# Can We Predict a Movie's Popularity (Votes) Based on Its Features Like Genre, Year, and Runtime?

In this phase, our goal is to answer the question:  
**"Can we predict a movie's popularity (votes) based on its features like genre, year, and runtime?"**

The target variable is the **number of votes** that the movie received. The features we will rely on are:
- **Year**: The year the movie was released.
- **Runtime**: The duration of the movie in minutes.
- **Genres**: The columns that represent the genres of the movie (e.g., Action, Comedy).

The dataset has been split into training and testing sets, using 80% of the data for training and 20% for testing.

The objective is to select the model that provides the best accuracy in predicting a movie's popularity based on its features.

## Step 1: Data Preparation and Feature Selection

In this step, we load the cleaned dataset, select only the relevant features (such as runtime, release year, and movie genres), and prepare the data by handling any missing values. We then split the data into training and testing sets to be used later in the modeling phase.

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('Primary_Processed_Data.csv')

# Define the target column and basic features
target_column = 'Votes'
basic_features = ['Year', 'Runtime']

# Automatically extract all genre columns (columns starting with 'Genre_')
genre_columns = [col for col in df.columns if col.startswith('Genre_')]

# Combine all selected features + target into a new dataframe
selected_columns = basic_features + genre_columns + [target_column]
df_selected = df[selected_columns].copy()

# Check for missing values
print(df_selected.isnull().sum())

# Drop rows with missing values (to avoid training issues)
df_selected.dropna(inplace=True)

# Split the data into features (X) and target (y)
X = df_selected.drop(target_column, axis=1)
y = df_selected[target_column]

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Year                                                          0
Runtime                                                       0
Genre_Action, Adventure, Animation, Drama                     0
Genre_Action, Adventure, Animation, Drama, Family, Fantasy    0
Genre_Action, Adventure, Drama                                0
                                                             ..
Genre_War, Drama, Thriller, Mystery                           0
Genre_War, History, Thriller                                  0
Genre_War, History, Thriller, Drama                           0
Genre_Western                                                 0
Votes                                                         0
Length: 276, dtype: int64


## Step 2: Building and Evaluating the Baseline Model

In this step, we build a simple baseline model using **Linear Regression**.  
This model will serve as a benchmark to compare more advanced models later.  
We train the model on the training set, make predictions on the test set, and evaluate performance using MAE, RMSE, and R² Score.


In [43]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Build the baseline model using Linear Regression
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_baseline = baseline_model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred_baseline)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_baseline))
r2 = r2_score(y_test, y_pred_baseline)

# Print the evaluation metrics
print("Baseline Model - Linear Regression:")
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R² Score: {r2}")



Baseline Model - Linear Regression:
MAE: 1311549886.845346
RMSE: 3290435998.4628363
R² Score: -2.640036316040506e+20


## Step 3: Building and Evaluating the Random Forest Regressor Model

In this step, we build the **Random Forest Regressor** model.  
Random Forest is an ensemble learning method that creates multiple decision trees and merges them together to get a more accurate prediction.  
We train the model, make predictions on the test set, and evaluate its performance using MAE, RMSE, and R² Score.


In [44]:
from sklearn.ensemble import RandomForestRegressor

# Build the Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

# Print the evaluation metrics
print("\nRandom Forest Regressor:")
print(f"MAE: {mae_rf}")
print(f"RMSE: {rmse_rf}")
print(f"R² Score: {r2_rf}")



Random Forest Regressor:
MAE: 0.12826966950409097
RMSE: 0.20482426726470995
R² Score: -0.0229768956421863


## Step 4: Building and Evaluating the Gradient Boosting Regressor Model

In this step, we build the **Gradient Boosting Regressor** model.  
Gradient Boosting is an ensemble learning technique that builds the model in a sequential manner by focusing on the errors made by the previous models.  
We train the model, make predictions, and evaluate its performance using MAE, RMSE, and R² Score.


In [46]:
from sklearn.ensemble import GradientBoostingRegressor

# Build the Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model
mae_gb = mean_absolute_error(y_test, y_pred_gb)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
r2_gb = r2_score(y_test, y_pred_gb)

# Print the evaluation metrics
print("\nGradient Boosting Regressor:")
print(f"MAE: {mae_gb}")
print(f"RMSE: {rmse_gb}")
print(f"R² Score: {r2_gb}")



Gradient Boosting Regressor:
MAE: 0.13702244344263556
RMSE: 0.1940174246397845
R² Score: 0.08212302358981927




### **Model Results and Evaluation**

#### **1- Baseline Model: Linear Regression**  
- **R² Score:** -2.64e+20  
- **Reason for exclusion:**  
  The baseline model performed extremely poorly, with a highly negative **R² Score**, indicating that it does not fit the data well or explain the variance in the target variable.


#### **2- Random Forest Regressor**  
- **R² Score:** -0.023  
- **Reason for exclusion:**  
  Although the model produced relatively low **MAE** and **RMSE**, the **R² Score** was still negative. This suggests that the model fails to effectively explain the target variable and is not reliable for accurate predictions.



#### **3- Gradient Boosting Regressor**  
- **R² Score:** 0.082  
- **Reason for selection:**  
  This model achieved the best performance among all three, with a **positive R² Score**, indicating some ability to explain the variance in the target variable. While the score is modest, it still outperforms the other models and shows potential for making more accurate predictions. The **MAE** and **RMSE** were also within a reasonable range.


###  **Best Model: Gradient Boosting Regressor**  
The **Gradient Boosting Regressor** was selected as the best model because it was the only one to achieve a **positive R² Score**, demonstrating a better fit to the data and more reliable predictive performance compared to the other models.


