![image1](https://s3-us-west-2.amazonaws.com/prd-rteditorial/wp-content/uploads/2018/03/13153742/RT_300EssentialMovies_700X250.jpg)

# Predicting Movie Revenues: An Analytical Approach

## Introduction

In the competitive world of film production, predicting a movie’s revenue before its release is a challenge yet a necessity. It allows stakeholders to make informed decisions regarding marketing strategies, distribution plans, and other critical aspects. This project embarks on an in-depth analysis of various features that might impact a movie's revenue, with the ultimate goal of developing a robust predictive model. This model aims to estimate the revenue of movies based on features such as budget, genre, production companies, and countries, release dates, and runtime.

## Business Context

**Background:**
The global film industry is a multi-billion-dollar industry with substantial economic impact and influence. Understanding the potential revenue a movie can generate is crucial for production companies, investors, and other stakeholders involved in the movie-making process.

**Problem Statement:**
Building an accurate and reliable model to predict a movie's revenue can significantly enhance decision-making, budget allocation, and risk assessment. This project aims to fill this gap by utilizing a diverse dataset encompassing various aspects of movies and employing advanced machine learning techniques.

**Objective:**
The primary objective is to explore, analyze, and model the dataset to predict the revenue of movies. The model's performance is continuously assessed and refined to ensure its reliability and accuracy.

## Data Overview

The dataset used in this project is a comprehensive collection of movie data (The Movies Dataset - available in https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv), including features like:

- **Budget:** The amount of money allocated for the movie production.
- **Genres:** The categories or genres associated with the movie.
- **Original Language:** The original language in which the movie was filmed.
- **Production Companies:** The companies responsible for producing the movie.
- **Production Countries:** The countries where the movie was produced.
- **Release Date:** The date when the movie was released.
- **Runtime:** The duration of the movie.
- **Revenue:** The total revenue generated by the movie (target variable).

## Analytical Approach

The project involves multiple steps including:

1. **Data Cleaning:** Ensuring the dataset is clean, and formatted correctly for analysis and modeling.
2. **Exploratory Data Analysis (EDA):** Understanding the data distribution, relationships, and patterns.
3. **Baseline Model Building:** Employing machine learning algorithms to build a predictive model.
4. **Model Refinement:** Making necessary adjustments for improvement.

By systematically executing these steps, the project aspires to deliver a dependable model for predicting movie revenues, contributing to the efficient and effective decision-making process in the film industry.

## Conclusion and Future Work

The project promises insightful observations and a reliable predictive model for movie revenues (MAE ~9 MM USD). Future work may involve integrating additional data, employing more advanced modeling techniques, and exploring further feature engineering opportunities to boost the model's performance.


Another project of mine using some of the same datasets is Movie Revenue Prediction & Success Classification: https://www.kaggle.com/code/victorpaschoalini/movie-revenue-prediction-success-classification/notebook

# 1.Data Cleaning and Feature Engineering

## 1.1 Importing all the modules used in the project

In [None]:
import pandas as pd
import ast
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

## 1.2. Selecting the features

The dataset has 24 features, but the majoriy will not help in the modelling (e.g., overview, popularity, votes) because it is not tabular data or the information was not available after the release of the movies. So we are only going to use some features, the best ones to make a predictive model.

In [None]:
# Defining the selected columns and the function to extract genres

selected_columns = ['budget', 'genres', 'original_language', 'original_title',
                    'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime']

# Selecting the desired columns
movies_metadata_df = pd.read_csv("/kaggle/input/the-movies-dataset/movies_metadata.csv", low_memory=False)
movies_metadata_df = movies_metadata_df[selected_columns]

## 1.3. Making the "genres", "production companies" and "production countries' features useful

These features are in a jason format and in most cases have more than one value per row, so we are going to get the main 3 genres, the main production company and number of production companies, and main production country and number of production countries.

In [None]:
# Defining the helper function to extract genres
def extract_genres(genres_str):
    try:
        genres_list = ast.literal_eval(genres_str)
        primary_genre = genres_list[0]['name'] if len(genres_list) > 0 else None
        secondary_genre = genres_list[1]['name'] if len(genres_list) > 1 else None
        tertiary_genre = genres_list[2]['name'] if len(genres_list) > 2 else None
        return primary_genre, secondary_genre, tertiary_genre
    except (ValueError, SyntaxError):
        return None, None, None

# Applying the function to extract genres
movies_metadata_df[['primary_genre', 'secondary_genre', 'tertiary_genre']] = movies_metadata_df['genres'].apply(
    lambda x: pd.Series(extract_genres(x)))

# Dropping the original 'genres' column
movies_metadata_df = movies_metadata_df.drop(columns=['genres'])

# Previewing the dataset after handling 'genres' column
movies_metadata_df.head()

In [None]:
# Defining the helper function to handle unexpected values and extract production companies information
def extract_production_companies(prod_comp_str):
    try:
        prod_comp_list = ast.literal_eval(prod_comp_str)
        if not prod_comp_list or not isinstance(prod_comp_list, list):
            return None, 0
        primary_prod_comp = prod_comp_list[0]['name'] if len(prod_comp_list) > 0 else None
        num_prod_companies = len(prod_comp_list)
        return primary_prod_comp, num_prod_companies
    except (ValueError, SyntaxError):
        return None, 0

# Applying the function to extract production companies information
movies_metadata_df[['primary_production_company', 'num_production_companies']] = movies_metadata_df['production_companies'].apply(
    lambda x: pd.Series(extract_production_companies(x)))

# Defining the helper function to handle unexpected values and extract production countries information
def extract_production_countries(prod_country_str):
    try:
        prod_country_list = ast.literal_eval(prod_country_str)
        if not prod_country_list or not isinstance(prod_country_list, list):
            return None, 0
        primary_prod_country = prod_country_list[0]['iso_3166_1'] if len(prod_country_list) > 0 else None
        num_prod_countries = len(prod_country_list)
        return primary_prod_country, num_prod_countries
    except (ValueError, SyntaxError):
        return None, 0

# Applying the function to extract production countries information
movies_metadata_df[['primary_production_country', 'num_production_countries']] = movies_metadata_df['production_countries'].apply(
    lambda x: pd.Series(extract_production_countries(x)))

# Dropping the original 'production_companies' and 'production_countries' columns
movies_metadata_df = movies_metadata_df.drop(columns=['production_companies', 'production_countries'])

# Previewing the dataset after handling 'production_companies' and 'production_countries' columns
movies_metadata_df.head()


## 1.4. Classic data cleaning steps

In [None]:
# Converting 'budget' to numeric
movies_metadata_df['budget'] = pd.to_numeric(movies_metadata_df['budget'], errors='coerce')

# Converting 'release_date' to datetime
movies_metadata_df['release_date'] = pd.to_datetime(movies_metadata_df['release_date'], errors='coerce')

# Extracting 'year' and 'month' from 'release_date'
movies_metadata_df['release_year'] = movies_metadata_df['release_date'].dt.year
movies_metadata_df['release_month'] = movies_metadata_df['release_date'].dt.month

# Dropping the original 'release_date' column
movies_metadata_df = movies_metadata_df.drop(columns=['release_date'])

# Handling missing values by filling them with appropriate placeholders or zeros
movies_metadata_df['primary_genre'].fillna('Unknown', inplace=True)
movies_metadata_df['secondary_genre'].fillna('Unknown', inplace=True)
movies_metadata_df['tertiary_genre'].fillna('Unknown', inplace=True)
movies_metadata_df['primary_production_company'].fillna('Unknown', inplace=True)
movies_metadata_df['primary_production_country'].fillna('Unknown', inplace=True)
movies_metadata_df[['budget', 'revenue', 'runtime', 'num_production_companies', 'num_production_countries', 'release_year', 'release_month']] = movies_metadata_df[['budget', 'revenue', 'runtime', 'num_production_companies', 'num_production_countries', 'release_year', 'release_month']].fillna(0)

# Previewing the cleaned dataset
movies_metadata_df.head()


The dataset has been further cleaned and processed:

* budget has been converted to a numeric type.

* release_date has been converted to datetime, and year and month have been extracted as separate features (release_year and release_month).

* Missing values in genre, production company, and production country have been filled with 'Unknown'.

* Missing values in numerical columns have been filled with 0.

Now, to finish the data cleaning steps, we are going to drop rows without budget information and set the right index

In [None]:
# Dropping the rows with budget = 0
movies_metadata_df = movies_metadata_df[movies_metadata_df['budget'] != 0].reset_index(drop=True)

# Checking the shape of the dataset after dropping rows with budget = 0
print(movies_metadata_df.shape)

# Setting 'original_title' as the index
movies_metadata_df.set_index('original_title', inplace=True)


In [None]:
# movies_metadata_df.to_csv('clean_data.csv', index=False) # uncomment to download the dataset cleaned

# 2. EDA (Exploratory Data Analysis)

Now let's proceed with the Exploratory Data Analysis (EDA). Here are the aspects explored and visualized:

1. **Distribution of Numerical Features:**
   - Distribution of `budget`, `revenue`, `runtime`, and `num_production_companies`.
   
   
2. **Categorical Feature Analysis:**
   - Number of movies per `primary_genre` and `primary_production_country`.
   
   
3. **Time-based Analysis:**
   - Number of movies released each year (`release_year`).
   - Monthly distribution of movie releases (`release_month`).
   
   
4. **Correlation Analysis:**
   - Correlation between numerical features.
   
   
5. **Revenue Analysis:**
   - Revenue vs. Budget.
   - Revenue vs. Runtime.

## 2.1. Distribution of Numerical Features
   - Distribution of `budget`, `revenue`, `runtime`, and `num_production_companies`.

In [None]:
# Setting the style of seaborn
sns.set(style="whitegrid")

# 1. Distribution of Numerical Features
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(14, 12))
sns.histplot(data=movies_metadata_df, x='budget', bins=30, ax=axes[0, 0], kde=True, color='blue')
axes[0, 0].set_title('Distribution of Budget')
sns.histplot(data=movies_metadata_df, x='revenue', bins=30, ax=axes[0, 1], kde=True, color='green')
axes[0, 1].set_title('Distribution of Revenue')
sns.histplot(data=movies_metadata_df, x='runtime', bins=30, ax=axes[1, 0], kde=True, color='red')
axes[1, 0].set_title('Distribution of Runtime')
sns.histplot(data=movies_metadata_df, x='num_production_companies', bins=30, ax=axes[1, 1], kde=True, color='purple')
axes[1, 1].set_title('Distribution of Number of Production Companies')
plt.tight_layout()
plt.show()


The histograms above provide insights into the distribution of `budget`, `revenue`, `runtime`, and `number of production companies`:

- **Budget**: Most movies have a budget less than 100 million, with a peak at lower budgets.
- **Revenue**: A similar trend is observed in revenue, with most movies earning less than 100 million.
- **Runtime**: The majority of movies have a runtime around 90 to 120 minutes.
- **Number of Production Companies**: Most movies are associated with just one production company, and very few movies have more than 5 production companies involved.

## 2.2. Categorical Feature Analysis
   - Number of movies per `primary_genre` and `primary_production_country`.

In [None]:
# 2. Categorical Feature Analysis

# Number of Movies per Primary Genre
plt.figure(figsize=(16, 6))
sns.countplot(data=movies_metadata_df, y='primary_genre', order=movies_metadata_df['primary_genre'].value_counts().index, palette="viridis")
plt.title('Number of Movies per Primary Genre')
plt.xlabel('Number of Movies')
plt.ylabel('Primary Genre')
plt.show()

# Number of Movies per Primary Production Country (Top 20)
plt.figure(figsize=(16, 6))
sns.countplot(data=movies_metadata_df, y='primary_production_country', order=movies_metadata_df['primary_production_country'].value_counts().head(20).index, palette="viridis")
plt.title('Number of Movies per Primary Production Country (Top 20)')
plt.xlabel('Number of Movies')
plt.ylabel('Primary Production Country')
plt.show()


The visualizations above show the number of movies per `primary_genre` and `primary_production_country`:

- **Primary Genre**: Drama, Comedy, and Action are the top three genres with the highest number of movies. There's also a significant number of movies with unknown primary genres.
- **Primary Production Country**: The United States (US) is by far the leading country in terms of movie production, followed by other countries like the United Kingdom (GB), France (FR), and Germany (DE).

## 2.3. Time-based Analysis
   - Number of movies released each year (`release_year`).
   - Monthly distribution of movie releases (`release_month`).
   

In [None]:
# 3. Time-based Analysis

# Number of Movies Released Each Year
plt.figure(figsize=(18, 6))
sns.countplot(data=movies_metadata_df, x='release_year', palette="viridis")
plt.title('Number of Movies Released Each Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Movies')
plt.xticks(rotation=90)
plt.show()

# Monthly Distribution of Movie Releases
plt.figure(figsize=(12, 6))
sns.countplot(data=movies_metadata_df, x='release_month', palette="viridis")
plt.title('Monthly Distribution of Movie Releases')
plt.xlabel('Release Month')
plt.ylabel('Number of Movies')
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()


The visualizations above provide insights into the time-based distribution of movie releases:

- **Number of Movies Released Each Year**: 
  - There's a steady increase in the number of movies released each year, with a noticeable peak around 2014-2015. 
  - There's a sharp decline in 2016 and beyond, which might be due to incomplete data for these years.

- **Monthly Distribution of Movie Releases**: 
  - The most popular months for movie releases are January and September, followed by October and December. 
  - These trends might be related to the timing of film festivals, awards season, and holiday periods.

## 2.4. Correlation Analysis
   - Correlation between numerical features.

In [None]:
# 4. Correlation Analysis

# Calculating the correlation matrix
correlation_matrix = movies_metadata_df[['budget', 'revenue', 'runtime', 'num_production_companies', 'num_production_countries', 'release_year', 'release_month']].corr()

# Plotting the heatmap for correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Numerical Features')
plt.show()


The heatmap above displays the correlation between various numerical features:

- **Budget and Revenue**: There is a positive correlation 0.74 between budget and revenue, indicating that movies with higher budgets tend to generate higher revenues.
- **Runtime and Revenue**: The correlation between runtime and revenue is relatively low 0.17.
- **Number of Production Companies and Revenue**: This also has a low correlation 0.15 with revenue.
- **Release Year and Revenue**: There is also a low correlation 0.05 between release year and revenue.

## 2.5. Revenue Analysis
   - Revenue vs. Budget.
   - Revenue vs. Runtime.

In [None]:
# 5. Revenue Analysis

# Revenue vs Budget
plt.figure(figsize=(10, 6))
sns.scatterplot(data=movies_metadata_df, x='budget', y='revenue', color='blue', alpha=0.6)
plt.title('Revenue vs Budget')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.show()

# Revenue vs Runtime
plt.figure(figsize=(10, 6))
sns.scatterplot(data=movies_metadata_df, x='runtime', y='revenue', color='green', alpha=0.6)
plt.title('Revenue vs Runtime')
plt.xlabel('Runtime')
plt.ylabel('Revenue')
plt.show()


The scatter plots above display the relationship between revenue and other features:

- **Revenue vs. Budget**: 
  - There is a clear trend indicating that movies with higher budgets generally tend to generate higher revenues. 
  - There are also instances of movies with low budgets generating high revenues, indicating the presence of some successful low-budget movies.

- **Revenue vs. Runtime**: 
  - There is no distinct trend observed between revenue and runtime.

# 3. Baseline Model Building

To start making predictions is common practice to start with a baseline model, the most classic model to this type of problem is a simple linear regression model. Before using the model, it is a good idea to do some feature engineering so we can use the categorical data.

## 3.1. Feature Engineering

To use the categorical columns, there must be done one-hot encoding (another option is doing target encoding, this would probably be better for "primary_production_company' because there are many different companies, and probably a best approach to language and country would be to classify between only two... US and other, because of the quantity of data for every other country being very low).

In [None]:
# One-Hot Encoding for Categorical Variables
categorical_columns = ['primary_genre', 'secondary_genre', 'tertiary_genre', 'primary_production_company', 
                       'primary_production_country', 'original_language']

# Performing one-hot encoding
movies_metadata_encoded_df = pd.get_dummies(movies_metadata_df, columns=categorical_columns, drop_first=True)

# Previewing the dataset after one-hot encoding
movies_metadata_encoded_df.head()


## 3.2. Creating the Model

As mentioned before, the chosen model for the baseline will be the classic Linear Regression, next are the train/test split, model building, predicting using the model, and model evaluation.

In [None]:
# Defining the features (X) and the target (y)
X = movies_metadata_encoded_df.drop(columns=['revenue'])
y = movies_metadata_encoded_df['revenue']

# Splitting the dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Building a Linear Regression Model
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

# Predicting the target for the testing set
y_pred = linear_reg_model.predict(X_test)

# Calculating the Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

mae

The baseline model has a MAE (Mean Absolute Error) of ~50 MM USD, good for the high budget movies, but not that much for the common < 100 MM USD budget movies.

# 4. Model Refinement

To improve the MAE of the model we are going to use feature engineering (in this case one-hot encoder) and we are going to exclude the outliers.

## 4.1. One-Hot Encoding

The first step is going to use the categorical columns in the model by using one-hot encoding, but after tests I discovered that using all the categorical features led to overfitting, so we are only going to use one (primary genre).

In [None]:
# Excluding specified columns
new_cleaned_data_df = movies_metadata_df.drop(columns=['original_language', 'secondary_genre', 'tertiary_genre', 'primary_production_company', 'primary_production_country'])

# Viewing the first few rows of the modified dataset
new_cleaned_data_df.head()


Only numerical features and a single categorical feature in the data. Now is the one-hot encoding for this feature.

In [None]:
# One-Hot Encoding for 'primary_genre'
new_cleaned_data_encoded_df = pd.get_dummies(new_cleaned_data_df, columns=['primary_genre'], drop_first=True)

# Defining the features (X) and the target (y) again
X = new_cleaned_data_encoded_df.drop(columns=['revenue'])
y = new_cleaned_data_encoded_df['revenue']

# Splitting the dataset into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Building a Linear Regression Model again
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

# Predicting the target for the testing set
y_pred = linear_reg_model.predict(X_test)

# Calculating the Mean Absolute Error (MAE) again
mae = mean_absolute_error(y_test, y_pred)

mae


The Mean Absolute Error (MAE) of the Linear Regression model on the testing set is approximately 43,119,430 (around 43.1 million).

This value is slightly lower than our previous attempts, which is a positive indication. It suggests that simplifying the model and focusing on the most relevant features might help improve the model's performance.

Now, let's try to drop even the genre column.

In [None]:
# Excluding 'primary_genre' related columns (after one-hot encoding)
cols_to_drop = [col for col in new_cleaned_data_encoded_df.columns if 'primary_genre' in col]
new_cleaned_data_encoded_df = new_cleaned_data_encoded_df.drop(columns=cols_to_drop)

# Defining the features (X) and the target (y) again
X = new_cleaned_data_encoded_df.drop(columns=['revenue'])
y = new_cleaned_data_encoded_df['revenue']

# Splitting the dataset into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Building a Linear Regression Model again
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

# Predicting the target for the testing set
y_pred = linear_reg_model.predict(X_test)

# Calculating the Mean Absolute Error (MAE) again
mae = mean_absolute_error(y_test, y_pred)

mae


The Mean Absolute Error (MAE) of the Linear Regression model on the testing set, after excluding the `primary_genre`, is approximately 42,529,754 (around 42.5 million).

This is a further reduction in the MAE, suggesting that excluding the genre information has led to a slight improvement in the model's prediction accuracy.

This was suspected because there was many (>20) different primary genres, and most with less than 300 movies, so this feature only led to more problems (overfitting) than gains, but the genre of the movies could be used with some more feature engineering by joining many genres in a "other" category; also transforming drama, comedy and action in more categories, like action-war, drama-historical could be a good idea because of the great amount of movies in these three categories.

Let's proceed stepwise to refine the model. Here's a suggested order to start with:

## 4.2. Handling Outliers

    We can begin by examining and handling outliers in the numeric features (`budget`, `runtime`, and `revenue`). Outliers can significantly impact the performance of a linear regression model.

In [None]:
# Plotting boxplots to visualize the distribution and identify outliers for 'budget', 'runtime', and 'revenue'
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

# Boxplot for 'budget'
axes[0].boxplot(new_cleaned_data_encoded_df['budget'])
axes[0].set_title('Budget')

# Boxplot for 'runtime'
axes[1].boxplot(new_cleaned_data_encoded_df['runtime'].dropna())  # Dropping NaN values for plotting
axes[1].set_title('Runtime')

# Boxplot for 'revenue'
axes[2].boxplot(new_cleaned_data_encoded_df['revenue'])
axes[2].set_title('Revenue')

plt.show()


From the boxplots, we can observe that there are outliers in the `budget`, `runtime`, and `revenue` columns. These outliers can have a significant impact on the performance of the model.

For handling outliers, we can use the Interquartile Range (IQR) method, where we can define bounds, and values outside these bounds can be considered as outliers.

In [None]:
# Defining a function to remove outliers using the IQR method
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df

# Removing outliers from 'budget', 'runtime', and 'revenue'
new_cleaned_data_encoded_df = remove_outliers_iqr(new_cleaned_data_encoded_df, 'budget')
new_cleaned_data_encoded_df = remove_outliers_iqr(new_cleaned_data_encoded_df, 'runtime')
new_cleaned_data_encoded_df = remove_outliers_iqr(new_cleaned_data_encoded_df, 'revenue')

# Checking the shape of the DataFrame after removing outliers
new_cleaned_data_encoded_df.shape


After removing the outliers, the dataset now contains 6,470 entries, down from the original size. This reduction helps in eliminating extreme values that can adversely affect the model's performance.

Now we proceed to retrain the Linear Regression model with this cleaned data to check if there is an improvement in the Mean Absolute Error (MAE).

In [None]:
# Defining the features (X) and the target (y) again after removing outliers
X = new_cleaned_data_encoded_df.drop(columns=['revenue'])
y = new_cleaned_data_encoded_df['revenue']

# Splitting the dataset into training and testing sets again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Building a Linear Regression Model again
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

# Predicting the target for the testing set
y_pred = linear_reg_model.predict(X_test)

# Calculating the Mean Absolute Error (MAE) again
mae = mean_absolute_error(y_test, y_pred)

mae


The Mean Absolute Error (MAE) of the Linear Regression model on the testing set, after handling outliers, is approximately 9,173,138 (around 9.2 million). 

This is a substantial improvement compared to the previous MAE of around 42.5 million. It demonstrates the significant impact of handling outliers on improving the model's performance.

I also tried more advanced techniques like using advanced algorithms (random forest and gradient boosting) and polynomial features but the gains were minimal.

## 4.3. Adding More Datasets (discussion)

A possible next step would be to add more data for the model to work with, these are five datasets that could bring value, I did not used they because there were much smaller than the The Movies Datasets (all are available in Kaggle).

In [None]:
data_movies = pd.read_csv('/kaggle/input/movies/movies.csv')
print(data_movies.head())
data_movies.info()
data_movies.describe()

Features that could be added: rating, director, star and writer.

In [None]:
data_celebrities = pd.read_csv('/kaggle/input/forbes-celebrity-100-since-2005/forbes_celebrity_100.csv')
print(data_celebrities.head())
data_celebrities.info()
data_celebrities.describe()

Could be added using the star feature in the previously dataset, if the star is in that list, than is a major hollywood star.

In [None]:
data_tmdb_5000 = pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv')
print(data_tmdb_5000.head())
data_tmdb_5000.info()
data_tmdb_5000.describe()

Keywords feature could be a easier to use alternative to long descriptions, using these more complex features could lead to very interesting EDA and even better prediction models.

In [None]:
data_wiki = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')
print(data_wiki.head())
data_wiki.info()
data_wiki.describe()

Following the last comment, Plot feature could lead to a more complex and interesting analysis.

In [None]:
data_highest_grossing = pd.read_csv('/kaggle/input/top-1000-highest-grossing-movies/Highest Holywood Grossing Movies.csv')
print(data_highest_grossing.head())
data_highest_grossing.info()
data_highest_grossing.describe()

An interesting feature engineering idea is to use revenue divided by budget as a success factor, usually 2 is very good for hollywood standards, using this dataset we could also create a very successful feature (or metric), this was done in a previously project of mine: https://www.kaggle.com/code/victorpaschoalini/movie-revenue-prediction-success-classification/notebook

# Conclusion

Budget is the most important feature when trying to estimate the revenue, most budgets are below 100 MM USD.

Using a simple regression model dropping the budget outliers we got a MAE of ~9.2 MM USD for the revenues.