<a href="https://colab.research.google.com/github/andricisabina/IMDB-Movie-Analysis---Python-Project/blob/main/Movie%20Analysis%20in%20Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q streamlit geopandas pydeck


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!npm install localtunnel

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K
added 22 packages in 4s
[1G[0K⠇[1G[0K
[1G[0K⠇[1G[0K3 packages are looking for funding
[1G[0K⠇[1G[0K  run `npm fund` for details
[1G[0K⠇[1G[0K

**HOMEPAGE**
Here we have the description of the project and its objectives. In this homepage, we will upload the .csv file that will be analysed in the rest of the pages.

In [None]:
%%writefile Home.py
import streamlit as st
import pandas as pd

st.set_page_config(page_title="Home", page_icon="🎬", layout="wide")
st.title("🎬 IMDB Movie Trends and Insights")

st.markdown("""

### 🎯 Project Objective

This project explores the IMDb Movie Dataset to generate strategic insights for companies operating in the movie production and distribution industry. We aim to discover patterns
and relationships that can guide decision-making in a highly competitive and creative market.The main goal is to identify the factors that influence a movie’s success—whether
defined by revenue, ratings, or popularity—and translate these findings into recommendations for companies seeking to expand or optimize their presence in the entertainment
industry.

### 💡 Why Are These Insights Relevant?

  Production and distribution companies need to make informed decisions to:

- Minimize financial risks
- Increase the probability of commercial success
- Tailor marketing strategies
- Choose promising scripts, genres, and collaborators

### 📈 Target Users

- Movie production studios
- Streaming platforms
- Distributors and sales agents
- Marketing teams
- Investors and analysts

---

### 📂 Upload Your IMDb Dataset (CSV)
""")

uploaded_file = st.file_uploader("Upload IMDb CSV file", type="csv")

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.session_state["imdb_data"] = df

    st.success("✅ File uploaded and stored successfully!")
    st.subheader("📄 Preview of Your Dataset")
    st.dataframe(df.head())

    # Dataset description
    st.markdown("""
    ### 🧾 Dataset Description

    - The dataset comprises anonymized data on movies available on IMDb, capturing various aspects such as genre, rating, and revenue.
    ---
    ### 📌 Variable Descriptions

    | **Column**              | **Description**                                                                 |
    |-------------------------|---------------------------------------------------------------------------------|
    | `Rank`                  | The position of the movie based on its ranking within the dataset.              |
    | `Title`                 | The name of the movie.                                                          |
    | `Genre`                 | The primary and secondary genres of the movie (comma-separated).                |
    | `Description`           | A short synopsis or plot summary of the movie.                                  |
    | `Director`              | The name of the movie’s director.                                               |
    | `Actors`                | A list of main actors featured in the movie (comma-separated).                  |
    | `Year`                  | The year the movie was released.                                                |
    | `Runtime (Minutes)`     | The duration of the movie in minutes.                                           |
    | `Rating`                | The IMDb user rating (scale: 0–10).                                             |
    | `Votes`                 | The number of user votes submitted for the movie on IMDb.                       |
    | `Revenue (Millions)`    | The box office revenue of the movie, in millions of USD.                        |
    | `Metascore`             | The Metacritic score (critic-based) for the movie (scale: 0–100).               |
    """)

else:
    st.warning("⚠️ Please upload a CSV file to begin.")


Writing Home.py


**Descriptive Analysis**

In [None]:
%%writefile pages/1_Descriptive_Analysis.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

st.set_page_config(page_title="Descriptive Analysis", page_icon="📊")
st.title("📊 Descriptive Analysis")

st.markdown("""
### 🎯 Defining the Problem
This analysis aims to understand the **basic statistical characteristics** of key numeric variables in the IMDb dataset.
We also want to identify any **missing values or extreme values** (outliers) which could impact future analyses like regression or clustering.
""")

st.markdown("""
### 📌 Information Required
We focus only on the numeric columns from the dataset:
- Runtime (Minutes)
- Rating
- Votes
- Revenue (Millions)
- Metascore
""")

st.markdown("""
### 🧮 Methods and Formulas
For this descriptive analysis we apply:
- `.describe()` from pandas for summary statistics (mean, std, min, max)
- `.isnull().sum()` to identify missing values
- `sns.boxplot()` to detect extreme values (outliers)
""")

if "imdb_data" in st.session_state:
    df = st.session_state["imdb_data"]
    numeric_cols = ['Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']

    st.markdown("### 📊 Results")

    st.subheader("📈 Descriptive Statistics")
    st.dataframe(df[numeric_cols].describe())

    st.subheader("❗ Missing Values")
    missing = df[numeric_cols].isnull().sum()
    missing = missing[missing > 0].to_frame(name="Missing Count")
    st.dataframe(missing)

    st.subheader("📦 Boxplots for Numerical Columns")
    for col in numeric_cols:
        fig, ax = plt.subplots(figsize=(6, 2))
        sns.boxplot(x=df[col], ax=ax, color="#B7B5E4")
        ax.set_title(f"{col}")
        st.pyplot(fig)

    st.subheader("📘 Economic Interpretation ")

    st.subheader("🎥 Runtime (Minutes)")
    st.write("""
    Most movies last around 1 hour and 50 minutes, which is normal for big-screen films.
    This length helps cinemas show the movie several times a day, which means more tickets sold.
    Very short movies might not feel worth the ticket price, and very long ones leave less room for other showings, so they can make less money.
    """)

    st.subheader("⭐ Rating")
    st.write("""
    Movie ratings show how much people liked a film. Most movies get a rating between 6.5 and 7.5, meaning they’re okay or good.
    Higher ratings can help movies make more money, because people are more likely to go see a movie their friends or online reviews say is great.
    So, good ratings often lead to better sales.
    """)

    st.subheader("👥 Votes")
    st.write("""
    The number of votes tells us how many people watched and rated a movie.
    A few movies have a lot of votes — over a million — which shows they were very popular.
    Most movies don’t get as many votes, which means fewer people watched them.
    In business, this tells us that just a few movies get most of the audience and money, while the rest are less known.
    """)

    st.subheader("💰 Revenue (Millions)")
    st.write("""
    Revenue means how much money a movie made. Most movies earn less than 100 million even though they’re popular,
    some earn hundreds of millions, or even more than 900 million.
    This shows that blockbusters (big-budget movies with lots of ads and famous actors)
    can make a lot of money. But they’re also risky — if they fail, the losses are huge.
    So, film companies have to be careful and balance big projects with smaller, safer ones.
    """)

    st.subheader("🧠 Metascore")
    st.write("""
    The Metascore shows how much critics liked the movie. Most scores are between 47 and 72, so many movies are seen as average or okay by critics.
    A high Metascore doesn’t always mean more money, but it can help movies win awards, get better deals with streaming platforms, and make more money in the long run.
    It’s also important for smaller movies that don’t rely on big marketing.
    """)

else:
    st.warning("⚠️ Please upload the CSV file from the homepage first.")

Writing pages/1_Descriptive_Analysis.py


FileNotFoundError: [Errno 2] No such file or directory: 'pages/1_Descriptive_Analysis.py'

**Encoding and Scaling**

In [None]:
%%writefile pages/2_Encoding_and_Scaling.py
import streamlit as st
import pandas as pd
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
import matplotlib.pyplot as plt
import seaborn as sns

st.set_page_config(page_title="Encoding & Scaling", page_icon="🔣")
st.title("🔣 Encoding & Scaling")


st.markdown("""
### 🎯 Defining the Problem
Machine learning models and statistical algorithms require numerical inputs.
We must convert categorical variables (like genres) into numeric format and scale numeric variables to standardize their ranges.
""")

st.markdown("""
### 📌 Information Required
- **Categorical column**: `Genre`
- **Numerical columns**: `Runtime (Minutes)`, `Rating`, `Votes`, `Revenue (Millions)`, `Metascore`
""")

if "imdb_data" in st.session_state:
    df = st.session_state["imdb_data"].copy()

    st.markdown("""
    ### 🧮 Methods
    - **Encoding**: We use `MultiLabelBinarizer` to one-hot encode genres (splitting multi-label strings).
    - **Scaling**: We use `StandardScaler` from `sklearn` to center and scale numeric columns (mean=0, std=1).
    """)

    st.subheader("🎬 Unique Genres Extracted and One-Hot Encoded")
    df['Genre List'] = df['Genre'].str.split(",")  # split into lists
    mlb = MultiLabelBinarizer()
    genre_encoded = pd.DataFrame(mlb.fit_transform(df['Genre List']), columns=mlb.classes_)
    st.dataframe(genre_encoded.head())

    st.subheader("📏 Scaling Numeric Variables")
    numeric_cols = ['Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)', 'Metascore']
    numeric_df = df[numeric_cols].dropna()
    scaler = StandardScaler()
    scaled = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)
    st.dataframe(scaled.head())

    st.markdown("### 📊 Results — Correlation Heatmap (Scaled Data)")
    fig, ax = plt.subplots(figsize=(6, 4))
    sns.heatmap(scaled.corr(), annot=True, cmap="PuRd", ax=ax)
    st.pyplot(fig)

    st.write("""

---
🔝 **Strongest Positive Correlations:**

- **Votes & Revenue (0.64):** Movies that receive more votes tend to generate higher revenues.
  This makes sense, as more popular films (with more audience engagement) usually perform better at the box office.
- **Rating & Metascore (0.67):** User ratings and critic scores tend to align fairly well.
  A higher rating is typically associated with a higher Metascore, showing consensus between audiences and critics.

---

📈 **Moderate Positive Correlations:**

- **Rating & Votes (0.52):** Popular movies (with many votes) are often rated higher, though not always.
- **Runtime & Votes (0.40):** Slight tendency for longer movies to get more votes, possibly due to being more prominent or heavily marketed.
- **Runtime & Rating (0.37):** A small positive relationship—longer movies might be perceived as more in-depth or serious, leading to better ratings.

---

📉 **Weak or Negligible Correlations:**

- **Metascore & Revenue (0.14):** Revenue isn’t strongly tied to critic scores.
  Commercial success may depend more on marketing and genre than critical acclaim.
- **Runtime & Metascore (0.22):** Very weak connection between a film’s length and its critical evaluation.
- **Rating & Revenue (0.22):** Suggests that a higher rating doesn’t always guarantee higher revenue.
""")




    st.subheader("🔠 Encoding Genres")
    st.write("""
    Movies often belong to more than one genre, like Action and Sci-Fi.
    If we just used text, computers wouldn’t understand it.
    By turning each genre into a separate column (called one-hot encoding), we make sure the model knows exactly what genres are in each movie.
    This helps us later group movies by type or predict their performance based on genre.
    """)

    st.subheader("📏 Scaling Numeric Variables")
    st.write("""
    Some values, like revenue and votes, have very large numbers compared to things like rating or runtime.
    Without scaling, big numbers would affect the model more than they should.
    Scaling (StandardScaler) makes all numbers use the same scale, so each feature is treated equally.
    This is important when we group or compare movies using models like k-means or linear regression.
    """)

else:
    st.warning("⚠️ Please upload the IMDb dataset on the Home page first.")


**Grouping and Aggregation**

In [None]:
%%writefile pages/3_Grouping_and_Aggregation.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

st.set_page_config(page_title="Grouping & Aggregation", page_icon="🧮")
st.title("🧮 Grouping & Aggregation")

st.markdown("""
### 🎯 Defining the Problem
We want to identify **temporal trends and genre performance** by grouping data by year and genre.
This helps detect which periods or categories perform best in revenue, votes, and ratings.
""")

st.markdown("""
### 📌 Information Required
We will group the dataset using:
- `Year` for time-based analysis
- `Genre` (primary genre only) for category-based comparison

We’ll aggregate:
- **Average Rating**
- **Average Revenue**
- **Average Votes**
""")

if "imdb_data" in st.session_state:
    df = st.session_state["imdb_data"].copy()

    st.markdown("""
    ### 🧮 Methods and Calculations
    We use `groupby()` in pandas to compute averages by:
    - `Year`
    - `Primary Genre`
    Then we visualize trends using line plots and bar plots.
    """)

    # Extract primary genre
    df["Primary Genre"] = df["Genre"].apply(lambda x: x.split(",")[0] if pd.notnull(x) else x)

    # Group by Year
    year_group = df.groupby("Year")[["Rating", "Revenue (Millions)", "Votes"]].mean().reset_index()

    # Group by Primary Genre
    genre_group = df.groupby("Primary Genre")[["Rating", "Revenue (Millions)", "Votes"]].mean().reset_index()

    st.markdown("### d) 📊 Results")

    st.subheader("📅 Average Metrics by Year")
    st.dataframe(year_group.round(2))

    st.subheader("🎬 Average Metrics by Primary Genre")
    st.dataframe(genre_group.round(2))

    st.subheader("📈 Trend: Average Rating Over the Years")
    fig1, ax1 = plt.subplots()
    sns.lineplot(data=year_group, x="Year", y="Rating", ax=ax1, color="#DB83B0")
    ax1.set_ylabel("Average Rating")
    st.pyplot(fig1)

    st.subheader("💰 Top Genres by Average Revenue")
    top_revenue = genre_group.sort_values("Revenue (Millions)", ascending=False).head(10)
    fig2, ax2 = plt.subplots()
    sns.barplot(data=top_revenue, x="Revenue (Millions)", y="Primary Genre", ax=ax2, palette="PuRd_r")
    st.pyplot(fig2)

    st.markdown("### 📘 Economic Interpretation ")

    st.subheader("📅 Temporal Trends ")
    st.write("""
    When we look at movies over time, we see that average ratings have stayed pretty stable,
    especially after 2010. This means that even though styles and technologies change,
    people’s opinions about movie quality don’t shift much year by year.
    This helps producers know that quality expectations stay consistent over time.
    """)

    st.subheader("🎬 Genre ")
    st.write("""
    Some genres like Action, Adventure, and Animation earn the most money on average.
    This shows that audiences are willing to pay more to see exciting, big-budget movies.
    So, film companies may choose to invest more in these types of movies if they want bigger profits.
    """)

    st.write("""
    Genres like Biography or Documentary often get high ratings, but not many people vote for them.
    This means they’re liked by the people who see them, but not many watch them.
    These genres are good for smaller audiences or platforms like film festivals or streaming services,
    where reviews matter more than ticket sales.
    """)

else:
    st.warning("⚠️ Please upload the IMDb dataset from the Home page first.")


**Clustering with ScikitLearn**

In [None]:
%%writefile pages/4_Clustering_with_ScikitLearn.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

st.set_page_config(page_title="Clustering (K-Means)", page_icon="🧩")
st.title("🧩 K-Means Clustering on Movies")

st.markdown("""
### 🎯 Defining the Problem
We want to group movies into clusters based on performance indicators like revenue, votes, and ratings.
This helps identify movie "types" such as blockbusters, flops, or critically acclaimed.
""")

st.markdown("""
### 📌 Information Required
We’ll use these numeric features:
- `Rating`
- `Votes`
- `Revenue (Millions)`
- `Metascore`
- `Runtime (Minutes)`
""")

if "imdb_data" in st.session_state:
    df = st.session_state["imdb_data"].copy()


    st.markdown("""
    ### 🧮 Methods
    - Drop missing values in selected columns
    - Apply `StandardScaler` to normalize data
    - Fit `KMeans` with 3 clusters
    - Visualize clusters in 2D using first two components
    """)


    features = ['Rating', 'Votes', 'Revenue (Millions)', 'Metascore', 'Runtime (Minutes)']
    df_cluster = df[features].dropna()
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df_cluster)

    # Fit K-Means
    kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
    labels = kmeans.fit_predict(X_scaled)
    df_cluster["Cluster"] = labels


    st.markdown("### 📊 Results")

    st.subheader("Cluster Centers (Scaled Features)")
    centers_df = pd.DataFrame(kmeans.cluster_centers_, columns=features)
    st.dataframe(centers_df)

    st.subheader("Cluster Visualization")
    fig, ax = plt.subplots()
    sns.scatterplot(
        x=X_scaled[:, 0], y=X_scaled[:, 1],
        hue=labels, palette="deep", alpha=0.6
    )
    ax.set_xlabel(features[0])
    ax.set_ylabel(features[1])
    st.pyplot(fig)

    st.markdown("### 📘 Economic Interpretation")
    st.write("""
    After grouping the movies using K-Means, we can look at what kind of movies are in each group (cluster).
    Even though the algorithm doesn't label the groups, we can guess based on the average values.
    """)

    st.markdown("""
    - 🎬 **Cluster 0**: These movies likely have **very high revenue and votes**.
      They could be **blockbusters** — big-budget films with lots of viewers and strong box office results.

    - 🎞️ **Cluster 1**: These have **average values** — not too high or too low.
      They may be **mid-range or standard films** — not famous, but not bad either.

    - 🎥 **Cluster 2**: These seem to have **high ratings or Metascore**, but **lower revenue and votes**.
      This could mean **critically loved or indie films** — maybe popular with critics, but not seen by many people.
    """)

else:
    st.warning("⚠️ Please upload the IMDb dataset from the Home page first.")


**Merge**

In [None]:
%%writefile pages/5_Merge_Data.py
import streamlit as st
import pandas as pd

st.set_page_config(page_title="Merge Data", page_icon="🔗", layout="wide")
st.title("🔗 Merging IMDb Dataset with Country Information")


if "imdb_data" not in st.session_state:
    st.warning("⚠️ Please upload your IMDb dataset on the Home page first.")
    st.stop()

imdb_df = st.session_state["imdb_data"]

try:
    country_df = pd.read_csv("country_only_randomized.csv")
except FileNotFoundError:
    st.error("❌ The file 'country_only_randomized.csv' was not found in the current directory.")
    st.stop()


with st.expander("📄 Preview IMDb Dataset"):
    st.dataframe(imdb_df.head())

with st.expander("🌍 Preview Country Dataset"):
    st.dataframe(country_df.head())


st.subheader("🔀 Merged Dataset Preview")


imdb_df = imdb_df.reset_index(drop=True)
country_df = country_df.reset_index(drop=True)
merged_df = pd.concat([imdb_df, country_df], axis=1)

st.session_state["merged_data"] = merged_df


merged_file_path = "imdb_movie_dataset_country.csv"
merged_df.to_csv(merged_file_path, index=False)
st.success(f"📁 Merged data saved to `{merged_file_path}`")

st.dataframe(merged_df.head())

st.download_button(
    label="⬇️ Download Merged CSV",
    data=merged_df.to_csv(index=False),
    file_name="imdb_movie_dataset_country.csv",
    mime="text/csv"
)


**Statsmodels - Regression**

In [None]:
%%writefile pages/6_Regression_with_Statsmodels.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

st.set_page_config(page_title="Multiple Regression", page_icon="📈")
st.title("📈 Multiple Linear Regression (statsmodels)")

st.markdown("""
### 🎯 Defining the Problem
We want to understand which factors influence **box office revenue** the most.
We'll use multiple regression to quantify how variables like rating, votes, and metascore affect movie revenue.
""")

st.markdown("""
### 📌 Information Required
We’ll use the following:
- **Target variable**: `Revenue (Millions)`
- **Independent variables**:
    - `Rating`
    - `Votes`
    - `Metascore`
    - `Runtime (Minutes)`
""")

if "imdb_data" in st.session_state:
    df = st.session_state["imdb_data"].copy()

    st.markdown("""
    ### 🧮 Methods and Calculations
    - Clean dataset: drop rows with missing values
    - Define X (features) and y (target)
    - Add constant term (intercept) using `sm.add_constant()`
    - Fit OLS model using `statsmodels.api.OLS()`
    - Show summary statistics and coefficients
    """)


    cols = ['Revenue (Millions)', 'Rating', 'Votes', 'Metascore', 'Runtime (Minutes)']
    df_reg = df[cols].dropna()

    y = df_reg['Revenue (Millions)']
    X = df_reg[['Rating', 'Votes', 'Metascore', 'Runtime (Minutes)']]
    X = sm.add_constant(X)

    model = sm.OLS(y, X).fit()


    st.markdown("### 📊 Results")

    st.subheader("📄 Regression Summary")
    st.text(model.summary())

    st.subheader("📉 Residual Plot")
    fig, ax = plt.subplots()
    sns.residplot(x=model.fittedvalues, y=model.resid, ax=ax, lowess=True, color="#DB83B0")
    ax.set_xlabel("Fitted Values")
    ax.set_ylabel("Residuals")
    st.pyplot(fig)

    st.markdown("### 📘 Economic Interpretation ")

    st.write("""
    The regression model shows how different features (like rating or votes) affect how much money a movie makes.
    It gives us coefficients, which tell us if something has a positive or negative impact on revenue.
    """)

    st.markdown("""
    -  **Votes**: A strong positive effect — the more people vote for a movie, the more money it likely makes.
      This makes sense because more viewers usually means higher ticket sales.

    -  **Rating**: Also a positive effect — well-rated movies usually earn more.
      A higher rating can attract more people and build trust.

    -  **Metascore**: The effect is smaller and sometimes unclear.
      Critics' opinions matter, but not as much as what everyday viewers think.

    -  **Runtime**: The impact is small — so just being longer or shorter doesn’t really predict revenue on its own.
    """)

    st.subheader("📊 How Good is the Model?")
    st.write("""
    The **R-squared value** tells us how much of the revenue we can explain using our variables.
    A higher number (closer to 1) means a better prediction.
    In our case, the model explains a **good portion of the variation** in revenue, but not all,
    which is normal because things like marketing, actors, or release date also matter.

    This kind of analysis helps movie studios and investors focus on the features that really make a financial difference.
    """)

else:
    st.warning("⚠️ Please upload the IMDb dataset from the Home page first.")



**Geopandas**

In [None]:
%%writefile pages/7_GeoPandas_Visualization.py
import streamlit as st
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

st.set_page_config(page_title="GeoPandas Map", page_icon="🗺")
st.title("🌍 Average Movie Revenue by Country")

# a) Problem Definition
st.markdown("We want to visualize how *average movie revenue* varies by country.")

try:
    # Load the merged dataset
    df = pd.read_csv("imdb_movie_dataset_country.csv")

    # Average revenue by country
    country_avg = df.groupby("Country")["Revenue (Millions)"].mean().reset_index()

    # 🔁 Rename "United States" to match shapefile naming
    country_avg["Country"] = country_avg["Country"].replace({
        "United States": "United States of America"
    })

    # Drop countries with no revenue
    country_avg = country_avg.dropna(subset=["Revenue (Millions)"])

    # ✅ Load world map from reliable source (NACIS Natural Earth)
    url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
    world = gpd.read_file(url)

    # Merge world geometries with country revenue
    world_map = world.merge(country_avg, how="left", left_on="NAME", right_on="Country")

    # 🌍 Plot the map
    fig, ax = plt.subplots(figsize=(12, 6))
    world_map.plot(
        column="Revenue (Millions)",
        cmap="YlOrRd",
        linewidth=0.8,
        ax=ax,
        edgecolor='black',
        legend=True,
        missing_kwds={"color": "lightgrey", "label": "No data"}
    )
    ax.set_title("Average Movie Revenue by Country", fontsize=14)
    ax.axis("off")
    st.pyplot(fig)

    st.markdown("### 📘 Economic Interpretation")

    st.write("""
    The map shows how much money movies make on average in different countries.
    Darker countries have higher average movie revenue.
    Lighter countries make less from movies.
    Grey means we don’t have enough data for that country.
    """)

    st.markdown("""
    - **United States** has the highest revenue — this makes sense because Hollywood is the biggest film industry in the world.
    - **United Kingdom** and  **Canada** also have strong movie markets with big studios and international audiences.
    - Some countries might have fewer movies in the dataset or focus more on local productions with smaller budgets.
    """)

    st.write("""
    💡 This kind of map helps us understand where movies tend to make the most money.
    It’s useful for:
    -  Planning where to release or promote a movie
    -  Deciding where to invest in film production
    -  Understanding global trends in the film industry
 and Canada tend to have higher revenues.
    """)

except Exception as e:
    st.error(f"❌ Could not generate map. Error: {e}")


In [None]:
!streamlit run Home.py &>/content/logs.txt &

In [None]:
import urllib
print("Password/Enpoint IP for localtunnel is:",urllib.request.urlopen('https://ipv4.icanhazip.com').read().decode('utf8').strip("\n"))

In [None]:
!npx localtunnel --port 8501