# Conclusions of Random Forests model applied to IMDB Movie Dataset to predict Average Rating

## Mean Absolute Error

We obtained a MAE of 0.26, pretty accurate considering a scale of 0-10 for IMBD rating.

## Features Importance

1. **Worldwide_Gross**: most important feature to predict movie IMBD rating (makes sense)
   
2. **Metascore**: second most important feature to predict movie IMDB rating (makes sense) 

3. **Budget**: least importante feature to predict movie IMBD rating (a bit surprising, perhaps the best rated movies not always had the greatest budgets)

## Real Values vs Predicted Values

The scatterplot shows that for lowest rating movies, our predictions for IMDB rating were more "optmistic". But as we move right on our x-axis (real values), we can see that our model can make better predictions. Finally, for higher ratings, our model becomes more "pessimistic", and also the accuracy decreases greatly.

## Error Distribution

Our graph shows that 80% of our values had an absolute error of 0.4 or less.

### Final comments

This model is simple but yet had a decent accuracy for predicting movies IMDB ratings based on the features ['Metascore','Budget','Worldwide_Gross'].

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the CSV file into a DataFrame
df = pd.read_csv(r'C:\Users\czset\Downloads\archive\IMDB_Movies_Dataset.csv')
df = df.set_index(df.columns[0])
df.index.name = 'Index' 

# Remove spaces from columns names
df.columns = df.columns.str.replace(' ', '_')

# Remover NaNs
df = df.dropna()

# Display the DataFrame
df

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\czset\\Downloads\\archive\\IMDB_Movies_Dataset.csv'

In [None]:
df.columns

In [None]:
df_train = df

In [None]:
# extract numeric values
import re

# Clean and convert 'budget' column
def clean_convert(value):
    # Remove all non-numeric values
    clean_value = re.sub(r'[^0-9]', '', value)
    return clean_value

# Apply function to 'budget' column and convert values to numeric type
df_train['Budget'] = df_train['Budget'].apply(clean_convert)
df_train['Budget'] = pd.to_numeric(df_train['Budget'])

# Verify result
print(df_train['Budget'].head())



In [None]:
# Apply function to 'Worldwide_Gross' column and convert values to numeric type 
df_train['Worldwide_Gross'] = df_train['Worldwide_Gross'].apply(clean_convert)
df_train['Worldwide_Gross'] = pd.to_numeric(df_train['Worldwide_Gross'])

# Verify result
print(df_train['Worldwide_Gross'].head())

In [None]:
# Choose target and features
y = df_train.Average_Rating
features = ['Metascore','Budget','Worldwide_Gross']
X = df_train[features]


In [None]:
# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

In [None]:
X_train

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Apply random forests model
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train)

movie_preds = forest_model.predict(X_valid)
print(mean_absolute_error(y_valid, movie_preds))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Measuring Features importance
# Defining importances.
importances = forest_model.feature_importances_

# Creating a dataframe with importances
feature_importance_df = pd.DataFrame({'feature': features, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)

# Plotting importances
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance_df)
plt.title('Features Importance')
plt.show()


In [None]:
# Scatterplot of predicted values vs validation values
plt.figure(figsize=(10, 6))
plt.scatter(y_valid, movie_preds, alpha=0.3)
plt.plot([y_valid.min(), y_valid.max()], [y_valid.min(), y_valid.max()], 'r--')  # Reference line
plt.xlabel('Validation values')
plt.ylabel('Predicted Values')
plt.title('Real values vs. Predicted Values')
plt.show()

In [None]:
# Scatterplot of predicted values vs validation values
plt.figure(figsize=(10, 6))
plt.scatter(y_valid, movie_preds, alpha=0.3)
plt.plot([y_valid.min(), y_valid.max()], [y_valid.min(), y_valid.max()], 'r--')  # Reference line
plt.xlabel('Validation values')
plt.ylabel('Predicted Values')
plt.title('Real values vs. Predicted Values')
plt.show()

In [None]:
# Error distribution
y_pred = movie_preds.copy()
y_test = y_valid.copy()
errors = abs(y_pred - y_test)

plt.figure(figsize=(10, 6))
plt.hist(errors, bins=25, edgecolor='k', alpha=0.7)
plt.xlabel('Absolute Error')
plt.ylabel('Frequency')
plt.title('Error Distribution')
plt.show()


In [None]:
# Error distribution

import numpy as np
import matplotlib.pyplot as plt

y_pred = movie_preds.copy()
y_test = y_valid.copy()

# Calculate the absolute errors
errors = abs(y_pred - y_test)

# Set up the plot
fig, ax1 = plt.subplots(figsize=(10, 6))

# Plot the histogram of errors
counts, bins, patches = ax1.hist(errors, bins=25, edgecolor='k', alpha=0.7)
ax1.set_xlabel('Absolute Error')
ax1.set_ylabel('Frequency')
ax1.set_title('Error Distribution')

# Set up the secondary y-axis
ax2 = ax1.twinx()

# Calculate the cumulative percentage of frequency
cum_counts = np.cumsum(counts)
cum_perc = cum_counts / cum_counts[-1] * 100

# Plot the red line for cumulative percentage
ax2.plot(bins[:-1], cum_perc, 'r-', linewidth=2, label='Cumulative Percentage')
ax2.set_ylabel('Cumulative Percentage (%)')
ax2.legend(loc='upper left')

# Display the plot
plt.show()
