# Movie Rating Prediction
**Goal:** Predict movie ratings for users and explore recommendation patterns using ML models.

# Introduction

IMDb ratings reflect collective audience perception and are influenced by multiple factors such as genre, director, cast popularity, budget, and user engagement (votes). Predicting movie ratings is a classic supervised regression problem that showcases:

- Data cleaning & feature engineering <br>
- Exploratory Data Analysis (EDA)<br>
- Regression modeling & evaluation <br>
- Model interpretation

# Problem Statement

Given movie metadata, predict the IMDb rating (continuous value between 1 and 10).

# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Dataset

In [None]:
cols=["Name","Year","Duration","Genre","Rating","Votes","Director","Actor 1","Actor 2","Actor 3"]
df = pd.read_csv("/kaggle/input/imdb-india-movies/IMDb Movies India.csv", usecols=cols, encoding="latin1")
df.head()

# Dataset Description

| Column Name | Type | Description |
|------------|------|-------------|
| Name | Categorical | Movie title |
| Year | Numerical | Year of release |
| Duration | Numerical | Length of the movie in minutes |
| Genre | Categorical | Genre(s) of the movie |
| Rating | Numerical (Target) | IMDb user rating (1–10) |
| Votes | Numerical | Number of IMDb users who voted |
| Director | Categorical | Director of the movie |
| Actor 1 | Categorical | Main actor of the movie |
| Actor 2 | Categorical | Second Main actor of the movie |
| Actor 3 | Categorical | Third Main actor of the movie |

In [None]:
df.info()
df.describe()

# Data Cleaning

## 1. Handle Missing Values

In [None]:
# Remove '(' and ')' and convert to numeric
df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str.replace('[()]', '', regex=True)
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
# Handle numerical missing values
df['Year'] = df['Year'].fillna(df['Year'].median())

In [None]:
df['Duration'] = df['Duration'].astype(str)
df['Duration'] = df['Duration'].str.replace(' min', '', regex=False)
df['Duration'] = pd.to_numeric(df['Duration'], errors='coerce')
df['Duration'] = df['Duration'].fillna(df['Duration'].median())


In [None]:
df['Votes'] = df['Votes'].astype(str)
df['Votes'] = df['Votes'].str.replace(',',' ', regex=False)
df['Votes'] = pd.to_numeric(df['Votes'], errors='coerce')
df['Votes'] = df['Votes'].fillna(df['Votes'].median())

In [None]:
# Drop rows where target is missing
df = df.dropna(subset=['Rating'])

## 2. Feature Selection

In [None]:
# Define features and target
features = ['Year', 'Duration', 'Votes', 'Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']
target = 'Rating'

X = df[features]
y = df[target]

# Exploratory Data Analysis

## IMDb Rating Distribution

In [None]:
sns.histplot(y, bins=20, kde=True)
plt.title('Distribution of IMDb Ratings')
plt.show()

## Correlation Analysis

In [None]:
numeric_features = ['Year', 'Duration', 'Votes']
sns.heatmap(df[numeric_features + [target]].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation of Numeric Features with Rating')
plt.show()

## EDA Insights

- Votes has strong positive correlation with Rating <br>
- Year and Duration show weak correlation <br>
- Ratings follow a near-normal distribution

# Feature Engineering & PreProcessing 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_features = ['Year', 'Duration', 'Votes']
cat_features = ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']

preprocessor = ColumnTransformer([('num', StandardScaler(), num_features),
                                ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)])

# Train - Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training

## Linear Regression

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
lin_model = Pipeline([('prep', preprocessor),('model', LinearRegression())])
lin_model.fit(X_train, y_train)

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_model = Pipeline([('prep', preprocessor),
('model', RandomForestRegressor(n_estimators=200, max_depth=15, random_state=42))])
rf_model.fit(X_train, y_train)

# Model Evaluation

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate(model, X_test, y_test):
    preds = model.predict(X_test)
    return {
        'MAE': mean_absolute_error(y_test, preds),
        'RMSE': np.sqrt(mean_squared_error(y_test, preds)),
        'R2': r2_score(y_test, preds)
    }

lin_results = evaluate(lin_model, X_test, y_test)
rf_results = evaluate(rf_model, X_test, y_test)

lin_results, rf_results

## Model Evaluation Results

| Model              | MAE  | RMSE | R²   |
|-------------------|------|------|------|
| Linear Regression | 1.93 | 2.77 | -3.14 |
| Random Forest     | 0.85 | 1.12 | 0.32  |


## Inference of Each Metric

| Metric   | Meaning                                                                             | Interpretation                                                                                               |
| -------- | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| **MAE**  | Mean Absolute Error — average absolute difference between predicted and true rating | Linear: 1.93 → on average predicts ~2 points off. Random Forest: 0.85 → predicts <1 point off.               |
| **RMSE** | Root Mean Squared Error — penalizes larger errors more                              | Linear: 2.77 → large errors occur often. RF: 1.12 → more consistent predictions.                             |
| **R²**   | Coefficient of determination — fraction of variance explained                       | Linear: -3.14 → model is worse than predicting mean. RF: 0.32 → model explains ~32% of variation in ratings. |


## Key Takeaways

R² negative means the linear regression model is performing very poorly — it cannot capture the relationship between features and ratings.

## Inference

- Random Forest is clearly better <br>
     - Can capture non-linear relationships <br>
     - Handles high-cardinality categorical features like Director/Actors <br>

- Linear Regression fails <br>
     - Numeric-only features are weak predictors <br>
     - Cannot model interactions between votes, genre, and actors <br>

- MAE ~ 0.85 <br>
     - Predictions are off by less than 1 rating point on average <br>
     - Acceptable for IMDb rating predictions (ratings are subjective anyway)

# SVR, KNeighbors Regressor, Gradient Boosting Regressor

In [None]:
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor


# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=200, max_depth=15, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42),
    'SVR (RBF kernel)': SVR(kernel='rbf', C=10, epsilon=0.2),
    'KNN Regressor': KNeighborsRegressor(n_neighbors=5)
}

# Evaluate all models
results = []

for name, model in models.items():
    pipe = Pipeline([
        ('prep', preprocessor),
        ('model', model)
    ])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)
    
    results.append({
        'Model': name,
        'MAE': mean_absolute_error(y_test, preds),
        'RMSE': np.sqrt(mean_squared_error(y_test, preds)),
        'R2': r2_score(y_test, preds)
    })

# Convert to DataFrame for display
results_df = pd.DataFrame(results)
results_df.sort_values('R2', ascending=False, inplace=True)
results_df


## Inferences

- Gradient Boosting slightly outperforms Random Forest (best R², lowest errors) <br>
- SVR and KNN perform reasonably but worse than tree-based models <br>
- Linear Regression fails (negative R²), showing the relationships are non-linear <br>

“Tree-based models (Random Forest and Gradient Boosting) outperform SVR, KNN, and Linear Regression. Gradient Boosting achieves the lowest MAE and RMSE and the highest R², making it the most suitable model for predicting IMDb ratings.”