# Film development for PProductions: predicting IMDb ratings - Report

## Author: Letícia Zorzi Rama

### 2. Exploratory data analysis (EDA) - Insights

#### 2.a. Recommended movie for someone unknown

The recommended movie for someone unknown is:

1. ***The Godfather***, released in 1972, 175.0 min., directed by Francis Ford Coppola.

Extra insight! The top 10 most recommended movies for someone unknown (without the first):

2. *The Dark Knight*, released in 2008, 152.0 min., directed by Christopher Nolan.
3. *The Lord of the Rings*: The Return of the King, released in 2003, 201.0 min., directed by Peter Jackson.
5. *Pulp Fiction*, released in 1994, 154.0 min., directed by Quentin Tarantino.
6. *The Lord of the Rings: The Fellowship of the Ring*, released in 2001, 178.0 min., directed by Peter Jackson.
7. *Schindler's List*, released in 1993, 195.0 min., directed by Steven Spielberg.
8. *The Godfather: Part II*, released in 1974, 202.0 min., directed by Francis Ford Coppola.
9. *Forrest Gump*, released in 1994, 142.0 min., directed by Robert Zemeckis.
10. *Inception*, released in 2010, 148.0 min., directed by Christopher Nolan.
11. *The Lord of the Rings: The Two Towers*, released in 2002, 179.0 min., directed by Peter Jackson.

#### 2.b. Main factors related to a film's high grossing expectations

- `gross` and `no_of_votes` correlate strongly positive (ρ ≈ 0.6): movies with high gross reach a lot of people, and get more number of votes
- `gross` and `budget` correlate strongly positive (ρ ≈ 0.6): higher budgets allow higher production value & marketing spend, enabling high gross, although doesn’t guarantee it
- `gross` and `revenue` correlate even strongly positive (ρ ≈ 0.8): since gross is a subset of revenue
  
- `main_genre` is a related factor: Horror, Action, Family, Biography, and Adventure tend to have the highest gross
- `certificate`: PG (Parental Guidance) movies tend to have the highest gross
- `director`: Gareth Edwards, Anthony Russo, Josh Cooley, Roger Allers, and Tim Miller are the top 5 directors making high-grossing movies

#### 2.c. Insights gained from the Overview column

With the current TF-IDF + logistic regression approach, the model is not able to reliably infer the film’s genre from the ‘overview’ column. Because ‘Drama’ is overrepresented in the dataset, the classifier defaults to predicting this genre, while underrepresented genres are almost never predicted. 

This indicates that the ***imbalance in the dataset*** and the ***generic nature of overviews*** limit the predictive power of this approach.

### 3. Predicting IMDb ratings from the data

#### Type of problem
This is a regression problem: predict a numeric and continuous target `imdb_rating` (float, usually between 0 and 10).

#### Variables used and/or their transformations

- Numeric variables (some use directly, other with simple transforms):
    - `meta_score`: strong signal of critics’ reception (keep numeric)
    - `runtime`: consider a quadratic term if relationship non-linear (keep numeric)
    - `no_of_votes`, `budget`, `revenue`: EDA shows heavy positive correlation with economic measures (use log). Reasoning: logs stabilize variance and reduce skew

- Categorical variables:
    - `main_genre`, `certificate`: one-hot encoding. EDA shows genre correlates with other economic measures
    - `director`, `star1..star4`: target-encoding (smoothed mean of the target `imdb_rating`). This captures reputation without exploding dimensionality. Reasoning: Many directors/actors have very few movies in the dataset — smoothing prevents high variance estimates

- Text features:
    - `overview`: TF-IDF (max_features=5–10k) then TruncatedSVD to keep first ~10 components (dense numeric summary). Reasoning: EDA showed TF-IDF + logistic regression couldn’t reliably infer genre (imbalance + generic overviews). TF-IDF still adds marginal signal to rating (tone, keywords). Dimensionality reduction prevents overfitting

#### Best model that approximates the data, its pros and cons

Gradient Boosted Trees - LightGBM

- Why:
    - excellent at handling heterogeneous tabular data, non-linear interactions, missing values, and mixed categorical/numeric features.
    - Fast training and good default performance.

- Pros:
    - High predictive performance on tabular data.
    - Handles non-linearities and interactions automatically.
    - Works with target-encoded categorical features.
    - Fast inference (production-friendly).

- Cons:
    - Less interpretable than a simple linear model
    - Requires careful early stopping/hyperparameter tuning to avoid overfitting if dataset small

#### Explaining the chosen model performance measure

RMSE (Root Mean Squared Error)

- Why:
    - RMSE is the most common continuous error metric for rating prediction tasks
    - RMSE penalizes larger errors more than MAE (Mean Squared Error), which is useful because being off by 1.0+ in IMDb rating is more important than being off by 0.1
    - The resulted RMSE ≈ 0.091 means the model’s predictions are off by less than 0.1 rating points on average. In other words, if a movie’s true IMDb rating is 8.7, the model usually predicts somewhere between 8.6 and 8.8.

#### 4. Predicted IMDb rating for 'The Shawshank Redemption': 8.18

OBS.: `PProductions_Prediction.ipynb` notebook loads the model and predicts IMDb ratings for other titles.