# Film development for PProductions: predicting IMDb ratings, pt. 2

## Author: Letícia Zorzi Rama

## Project goal

Perform an analysis on a film database to guide PProductions on which type of film should be developed next by predicting IMDb ratings.

#### Summary
1. Project setup, pt. 1 (PProductions_EDA.ipynb)
2. Exploratory Data Analysis - EDA (PProductions_EDA.ipynb)
3. Project setup, pt. 2 (this notebook)
4. Predicting from the data (this notebook)

## 3. Project setup, pt. 2

This part consists of:
1. Imports
2. Loading the `imdb_tmdb` dataset

In [1]:
# 1. Imports

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import TruncatedSVD
import lightgbm as lgb
import joblib

In [2]:
# 2. Loading the dataset

df = pd.read_csv("imdb_tmdb.csv")

## 4. Predicting IMDb ratings from the data

#### Type of problem
This is a regression problem: predict a numeric and continuous target `imdb_rating` (float, usually between 0 and 10).

#### Variables used and/or their transformations

- Numeric variables (some use directly, other with simple transforms):
    - `meta_score`: strong signal of critics’ reception (keep numeric)
    - `runtime`: consider a quadratic term if relationship non-linear (keep numeric)
    - `no_of_votes`, `budget`, `revenue`: EDA shows heavy positive correlation with economic measures (use log). Reasoning: logs stabilize variance and reduce skew

- Categorical variables:
    - `main_genre`, `certificate`: one-hot encoding. EDA shows genre correlates with other economic measures
    - `director`, `star1..star4`: target-encoding (smoothed mean of the target `imdb_rating`). This captures reputation without exploding dimensionality. Reasoning: Many directors/actors have very few movies in the dataset — smoothing prevents high variance estimates

- Text features:
    - `overview`: TF-IDF (max_features=5–10k) then TruncatedSVD to keep first ~10 components (dense numeric summary). Reasoning: EDA showed TF-IDF + logistic regression couldn’t reliably infer genre (imbalance + generic overviews). TF-IDF still adds marginal signal to rating (tone, keywords). Dimensionality reduction prevents overfitting

OBS.: numeric variables were transformed on the notebook's project pt. 1 (EDA). The other variables were transforming on the step 4 of this notebook.

#### Best model that approximates the data, its pros and cons

Gradient Boosted Trees - LightGBM

- Why:
    - excellent at handling heterogeneous tabular data, non-linear interactions, missing values, and mixed categorical/numeric features.
    - Fast training and good default performance.

- Pros:
    - High predictive performance on tabular data.
    - Handles non-linearities and interactions automatically.
    - Works with target-encoded categorical features.
    - Fast inference (production-friendly).

- Cons:
    - Less interpretable than a simple linear model
    - Requires careful early stopping/hyperparameter tuning to avoid overfitting if dataset small

#### Explaining the chosen model performance measure

RMSE (Root Mean Squared Error)

- Why:
    - RMSE is the most common continuous error metric for rating prediction tasks
    - RMSE penalizes larger errors more than MAE (Mean Squared Error), which is useful because being off by 1.0+ in IMDb rating is more important than being off by 0.1
    - The resulted RMSE ≈ 0.091 means the model’s predictions are off by less than 0.1 rating points on average. In other words, if a movie’s true IMDb rating is 8.7, the model usually predicts somewhere between 8.6 and 8.8.

In [3]:
# Transforming categorical variables 'director' and 'star1...star4': target-encoding 

# Simple target encoding for director and stars
def target_encode(train, col, target):
    mean_global = train[target].mean()
    counts = train.groupby(col)[target].count()
    means = train.groupby(col)[target].mean()
    smooth = (means * counts + mean_global * 10) / (counts + 10)
    return train[col].map(smooth)

for col in ["director", "star1", "star2", "star3", "star4"]:
    df[f"{col}_enc"] = target_encode(df, col, "imdb_rating")

# star score: mean of 4 stars
df["star_score"] = df[["star1_enc", "star2_enc", "star3_enc", "star4_enc"]].mean(axis=1)

In [4]:
# numeric, categorical, text columns
num_cols = ["meta_score", "runtime", "log_votes", "log_gross", "log_budget", "log_revenue",
"director_enc", "star_score"]
cat_cols = ["main_genre", "certificate"]
text_col = "overview"

In [5]:
# Preprocessing pipeline

# numeric variables were imputed with SimpleImputer using 'median' and scaled with StandardScaler
# categorical variables were encoded with OneHotEncoder
# text features were measured with TF-IDF and truncated with TruncatedSVD

text_pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
("svd", TruncatedSVD(n_components=10, random_state=42))
])

preprocessor = ColumnTransformer([
("num", Pipeline([("impute", SimpleImputer(strategy="median")),
("scaler", StandardScaler())]), num_cols),
("cat", Pipeline([("impute", SimpleImputer(strategy="constant", fill_value="Unknown")),
("ohe", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
("text", text_pipe, text_col)
])


In [6]:
# Model

model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.05,
num_leaves=31, random_state=42, verbose=-1)

pipeline = Pipeline([("pre", preprocessor),
("model", model)])

In [7]:
# Cross-validation

gkf = GroupKFold(n_splits=5)
y = df["imdb_rating"]
groups = df["director"] # group by director to avoid leakage

cv_scores = cross_val_score(pipeline, df, y,
cv=gkf.split(df, y, groups),
scoring="neg_root_mean_squared_error")
print("CV RMSE:", -cv_scores.mean())

# CV RMSE: 0.09147664372869124 it's good



CV RMSE: 0.09147664372869124




In [8]:
# Train final model

pipeline.fit(df, y)

In [9]:
# Predict for Shawshank example

shawshank = {
'series_title': 'The Shawshank Redemption',
'released_year': 1994,
'certificate': 'A',
'runtime': 142.0,
'genre': 'Drama',
'overview': 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
'meta_score': 80.0,
'director': 'Frank Darabont',
'star1': 'Tim Robbins',
'star2': 'Morgan Freeman',
'star3': 'Bob Gunton',
'star4': 'William Sadler',
'no_of_votes': 2343110,
'gross': 28341469.0,
'main_genre': 'Drama',
'overview_len': 19,
'log_votes': np.log1p(2343110),
'budget': np.nan,
'revenue': np.nan,
'log_budget': np.nan,
'log_revenue': np.nan,
'log_gross': np.log1p(28341469.0)
}


# Since Shawshank wasn't in the dataset, it's necessary to make sure it has all variables used in the modelling  
# 'star_score' is missing on the dict
shaw_df = pd.DataFrame([shawshank])
for col in ["director", "star1", "star2", "star3", "star4"]:
    shaw_df[f"{col}_enc"] = target_encode(df, col, "imdb_rating").mean()
shaw_df["star_score"] = shaw_df[["star1_enc", "star2_enc", "star3_enc", "star4_enc"]].mean(axis=1)

In [10]:
# Predicting IMDb rating for Shawshank

pred = pipeline.predict(shaw_df)[0]
print("Predicted IMDb rating for Shawshank:", round(pred, 2))

Predicted IMDb rating for Shawshank: 8.18




In [11]:
# Saving the model
joblib.dump(pipeline, "model.pkl")

['model.pkl']

#### Predicted IMDb rating for Shawshank: 8.18

#### Testing the model with other titles

In [12]:
# Loading the model 
pipeline = joblib.load("model.pkl")

In [13]:
# Testing the model with 10 random titles from the dataset
random_df = df.sample(10)
random_df_index = random_df.index.tolist()

for index in random_df_index:
    pred = pipeline.predict(random_df.loc[[index]])
    title = random_df.at[index, "series_title"]        
    actual = random_df.at[index, "imdb_rating"] 
  
    print(f"{title} - actual IMDb rating: [{actual}] - predicted IMDb rating: {pred}")

Scent of a Woman - actual IMDb rating: [8.0] - predicted IMDb rating: [7.9990877]
La La Land - actual IMDb rating: [8.0] - predicted IMDb rating: [7.99546147]
The Fugitive - actual IMDb rating: [7.8] - predicted IMDb rating: [7.80386946]
Casablanca - actual IMDb rating: [8.5] - predicted IMDb rating: [8.4965125]
Hachi: A Dog's Tale - actual IMDb rating: [8.1] - predicted IMDb rating: [8.09803197]
Coco - actual IMDb rating: [8.4] - predicted IMDb rating: [8.39198626]
The Last Picture Show - actual IMDb rating: [8.0] - predicted IMDb rating: [7.99956992]
Rushmore - actual IMDb rating: [7.7] - predicted IMDb rating: [7.70137451]
Gongdong gyeongbi guyeok JSA - actual IMDb rating: [7.8] - predicted IMDb rating: [7.80115722]
Cape Fear - actual IMDb rating: [7.7] - predicted IMDb rating: [7.69938666]


