<div style="text-align: center; background-color: #559cff; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Final Project - Programming For Data Science @ FIT-HCMUS, VNU-HCM 📌
</div>

<div style="text-align: center; background-color: #b1d1ff; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 4.0 - Data modelling
</div>

**In this part, we will use regression model Random Forest to predict movie ratings based on Main Genres, Motion Picture Rating, Runtime, Release Year, Number of Ratings, Budget, Gross in US & Canada, Gross worldwide, Opening Weekend Gross in US & Canada.**

## Import

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

## Read data

In [3]:
pd.set_option('display.max_columns', None)
movie_df = pd.read_csv('./../data/processed/IMDbMovies_processed.csv')
movie_df.head()

Unnamed: 0,Title,Summary,Director,Writer,Main Genres,Motion Picture Rating,Runtime (Minutes),Release Year,Rating (Out of 10),Number of Ratings (in thousands),Budget (in milions),Gross in US & Canada (in milions),Gross worldwide (in milions),Opening Weekend Gross in US & Canada (in milions)
0,Napoleon,An epic that details the checkered rise and fa...,Ridley Scott,David Scarpa,"Action,Adventure,Biography",R,158.0,2023.0,6.7,38.0,67.8,37.514498,84.968381,20.638887
1,The Hunger Games: The Ballad of Songbirds & Sn...,Coriolanus Snow mentors and develops feelings ...,Francis Lawrence,"Michael Lesslie,Michael Arndt,Suzanne Collins","Action,Adventure,Drama",PG-13,157.0,2023.0,7.2,37.0,100.0,105.043414,191.729235,44.607143
2,The Killer,"After a fateful near-miss, an assassin battles...",David Fincher,"Andrew Kevin Walker,Luc Jacamon,Alexis Nolent","Action,Adventure,Crime",R,118.0,2023.0,6.8,117.0,67.8,46.8,0.421332,12.5
3,Leo,A 74-year-old lizard named Leo and his turtle ...,"David Wachtenheim,Robert Smigel,Robert Marianetti","Paul Sado,Robert Smigel,Adam Sandler","Animation,Comedy,Family",PG,102.0,2023.0,7.0,10.0,67.8,46.8,87.1,12.5
4,Thanksgiving,"After a Black Friday riot ends in tragedy, a m...",Eli Roth,"Eli Roth,Jeff Rendell","Horror,Mystery,Thriller",R,106.0,2023.0,7.0,9.1,67.8,25.408677,29.666585,10.306272


## Feature engineering

In [4]:
df = movie_df[['Main Genres', 'Motion Picture Rating', 'Runtime (Minutes)', 
               'Release Year', 'Rating (Out of 10)', 'Number of Ratings (in thousands)',
               'Budget (in milions)', 'Gross in US & Canada (in milions)', 'Gross worldwide (in milions)',
               'Opening Weekend Gross in US & Canada (in milions)']].dropna()

#Transform Main Genre and Motion Picture Rating using one-hot encoding and label encoding
one_hot_df = pd.DataFrame()
def splitGenre(s):
    return s.split(',')
genre_stack = df['Main Genres'].apply(splitGenre).agg(pd.Series).stack().agg(pd.Series).stack()
genre_dummy = pd.get_dummies(genre_stack).groupby(level = 0).sum()
genre_dummy.columns = ['Genre_' + col for col in genre_dummy.columns]
df = pd.concat([genre_dummy, df], axis = 1)

label_encoder=LabelEncoder()
df['Motion Picture Rating (encoded)'] = label_encoder.fit_transform(df['Motion Picture Rating'])

df = df.drop(['Main Genres', 'Motion Picture Rating'], axis=1)
df

Unnamed: 0,Genre_Action,Genre_Adventure,Genre_Animation,Genre_Biography,Genre_Comedy,Genre_Crime,Genre_Documentary,Genre_Drama,Genre_Family,Genre_Fantasy,Genre_Film-Noir,Genre_History,Genre_Horror,Genre_Music,Genre_Musical,Genre_Mystery,Genre_Romance,Genre_Sci-Fi,Genre_Sport,Genre_Thriller,Genre_War,Genre_Western,Runtime (Minutes),Release Year,Rating (Out of 10),Number of Ratings (in thousands),Budget (in milions),Gross in US & Canada (in milions),Gross worldwide (in milions),Opening Weekend Gross in US & Canada (in milions),Motion Picture Rating (encoded)
0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,158.0,2023.0,6.7,38.0,67.8,37.514498,84.968381,20.638887,14
1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,157.0,2023.0,7.2,37.0,100.0,105.043414,191.729235,44.607143,12
2,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,118.0,2023.0,6.8,117.0,67.8,46.800000,0.421332,12.500000,14
3,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,102.0,2023.0,7.0,10.0,67.8,46.800000,87.100000,12.500000,11
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,106.0,2023.0,7.0,9.1,67.8,25.408677,29.666585,10.306272,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9078,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,95.0,2020.0,6.3,24.0,67.8,46.800000,87.100000,12.500000,11
9079,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,98.0,2003.0,6.4,15.0,6.4,0.767373,2.561820,0.050278,9
9080,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,152.0,1952.0,6.5,16.0,4.0,36.000000,36.000000,12.500000,13
9081,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,152.0,1952.0,6.5,125.2,67.8,46.800000,87.100000,12.500000,13


## Data preparation

In [6]:
X = df.drop('Rating (Out of 10)', axis=1)  # Features
y = df['Rating (Out of 10)']  # Target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training model & Prediction

In [7]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

## Evaluation
- We use 2 metrics to evaluate the efficiency of model performances:
    - Mean squared error: estimates the average of the squares of the errors.
    - R-squared - Coefficient of determination: the proportion of change in the dependent variable that can be predicted from the independent variable.

In [8]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error (MSE): ", mse)
print("R-squared: ", r2)

Mean squared error (MSE):  0.42072669003021157
R-squared:  0.5741516810056819


## Results analysis

Analysis results: *mse* is 0.4 and *r-squared* is 0.5, these are acceptable results for the model, showing that the model works quite well in predicting ratings based on many other factors.