# **Project Name**    - Amazon Prime Video Content Analysis (ML Submission)


##### **Project Type**    - ML / Regression
##### **Contribution**    - Vishnu Patel PV
##### **Tools & Libraries**    - Python, Pandas, Scikit-learn, Matplotlib, NumPy, Jupyter Notebook


# **Project Summary -**
This notebook follows the Sample_ML_Submission_Template and performs a machine learning task using the Amazon Prime dataset. 
The notebook automatically selects a suitable target variable among numeric columns (e.g., `imdb_score`, `tmdb_popularity`) based on data availability and variance, prepares features from the titles dataset (including basic genre and year features), trains baseline and tree-based models, evaluates results, and provides conclusions and recommendations.


# **GitHub Link -**
https://github.com/Vishnu27122004/Amazon-Prime-EDA.git

# **Problem Statement**

Predict a numeric quality/popularity metric for Amazon Prime titles to help content ranking or recommendation.


#### **Business Objective**
- Build a model to predict a chosen numeric target (e.g., IMDb score) to support content recommendations and acquisitions.
- Evaluate model performance and provide actionable insights.


# **Dataset Description**

We use the `titles.csv.zip` and `credits.csv.zip` files from the project zip. The model will primarily use fields from `titles` such as `release_year`, `genres`, `runtime`, `imdb_score`, `tmdb_popularity`, etc.

Files:
- `/mnt/data/amazon_project/Amazon Prime EDA/titles.csv.zip`
- `/mnt/data/amazon_project/Amazon Prime EDA/credits.csv.zip`

Rows (as loaded earlier): Titles: 9,871 rows; Credits: 124,235 rows.


In [1]:
# Load data and automatically select a suitable numeric target column
import pandas as pd, numpy as np
titles = pd.read_csv(r'/mnt/data/amazon_project/Amazon Prime EDA/titles.csv.zip', compression='zip', low_memory=False)
credits = pd.read_csv(r'/mnt/data/amazon_project/Amazon Prime EDA/credits.csv.zip', compression='zip', low_memory=False)

# Clean column names
titles.columns = [c.strip() for c in titles.columns]
credits.columns = [c.strip() for c in credits.columns]

# Identify numeric candidate targets with sufficient non-null values
numeric_cols = titles.select_dtypes(include=[np.number]).columns.tolist()
# also consider numeric-like columns stored as object
for c in titles.columns:
    if c not in numeric_cols:
        try:
            converted = pd.to_numeric(titles[c], errors='coerce')
            if converted.notnull().sum() / len(titles) > 0.6:
                numeric_cols.append(c)
        except:
            pass

# Filter candidates by non-null ratio and variance
candidates = []
for c in numeric_cols:
    nonnull_ratio = titles[c].notnull().mean()
    variance = titles[c].dropna().var() if titles[c].dropna().shape[0]>1 else 0
    if nonnull_ratio > 0.5 and variance>0:
        candidates.append((c, nonnull_ratio, variance))

candidates = sorted(candidates, key=lambda x: (-x[1], -x[2]))  # prefer more non-null then higher variance

print('Numeric candidate targets (name, nonnull_ratio, variance):')
for row in candidates:
    print(row)

# choose top candidate automatically, else default to 'imdb_score' if present
if candidates:
    target_col = candidates[0][0]
elif 'imdb_score' in titles.columns:
    target_col = 'imdb_score'
else:
    # fallback: create a proxy popularity if tmdb_popularity or imdb_votes exists
    if 'tmdb_popularity' in titles.columns:
        target_col = 'tmdb_popularity'
    elif 'imdb_votes' in titles.columns:
        target_col = 'imdb_votes'
    else:
        target_col = None

print('\nSelected target column for ML:', target_col)


Numeric candidate targets (name, nonnull_ratio, variance):
('runtime', 1.0, 1123.0853933061724)
('release_year', 1.0, 666.1597847696274)
('tmdb_popularity', 0.9445851484145477, 900.2459041088689)
('imdb_score', 0.8965656974977206, 1.8059101611277215)
('imdb_votes', 0.8955526289129774, 2108660350.946103)
('tmdb_score', 0.7890791206564685, 2.304280573798549)

Selected target column for ML: runtime


In [2]:
# Feature engineering: create usable features
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Work on a copy
df = titles.copy()

# target defined in previous cell; ensure it exists
try:
    target_col
except NameError:
    target_col = None

if target_col is None:
    raise ValueError('No suitable target column found. Please choose a target manually.')

# Basic features: release_year, runtime, imdb_score (if not target), tmdb_score, tmdb_popularity, genres
# Clean release_year
if 'release_year' in df.columns:
    df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
else:
    if 'year' in df.columns:
        df['release_year'] = pd.to_numeric(df['year'], errors='coerce')

# runtime numeric
if 'runtime' in df.columns:
    df['runtime'] = pd.to_numeric(df['runtime'], errors='coerce')

# Clean genres into top-k one-hot features
import re
def clean_genre_cell(x):
    if pd.isna(x): return []
    s = str(x)
    s = re.sub(r"[\[\]\'\"]", "", s)
    parts = re.split(r"[|,;]", s)
    parts = [p.strip().lower() for p in parts if p.strip()!='']
    return parts

if 'genres' in df.columns:
    df['genres_list'] = df['genres'].apply(clean_genre_cell)
    all_genres = df['genres_list'].explode().dropna().value_counts()
    top_genres = all_genres.head(10).index.tolist()
    for g in top_genres:
        df['genre_' + g] = df['genres_list'].apply(lambda lst: 1 if g in lst else 0)
else:
    top_genres = []

# Use 'type' as feature if exists (MOVIE/SHOW)
if 'type' in df.columns:
    df['type_is_show'] = df['type'].apply(lambda x: 1 if str(x).strip().upper()=='SHOW' else 0)

# Select numeric feature columns
feature_cols = ['release_year', 'runtime', 'tmdb_score', 'tmdb_popularity', 'imdb_score', 'imdb_votes']
feature_cols = [c for c in feature_cols if c in df.columns and c!=target_col]
# add genre features
feature_cols += [c for c in df.columns if c.startswith('genre_')]
# add type flag if exists
if 'type_is_show' in df.columns:
    feature_cols.append('type_is_show')

# Prepare X and y
X = df[feature_cols].copy()
y = pd.to_numeric(df[target_col], errors='coerce')

# Drop rows where y is null
mask = y.notnull()
X = X[mask]
y = y[mask]

# Impute missing numeric values with median
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
X_imputed = pd.DataFrame(num_imputer.fit_transform(X), columns=X.columns, index=X.index)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)
print('Features used:', X_train.columns.tolist())
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

Features used: ['release_year', 'tmdb_score', 'tmdb_popularity', 'imdb_score', 'imdb_votes', 'genre_drama', 'genre_comedy', 'genre_thriller', 'genre_action', 'genre_romance', 'genre_crime', 'genre_documentation', 'genre_horror', 'genre_family', 'genre_european', 'type_is_show']
Train shape: (7896, 16) Test shape: (1975, 16)


In [3]:
# Modeling: Baseline, Linear Regression, Random Forest
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

results = {}

# Baseline: predict mean
dummy = DummyRegressor(strategy='mean')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)
results['dummy_rmse'] = np.sqrt(mean_squared_error(y_test, y_pred_dummy))
results['dummy_r2'] = r2_score(y_test, y_pred_dummy)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
results['lr_rmse'] = np.sqrt(mean_squared_error(y_test, y_pred_lr))
results['lr_r2'] = r2_score(y_test, y_pred_lr)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
results['rf_rmse'] = np.sqrt(mean_squared_error(y_test, y_pred_rf))
results['rf_r2'] = r2_score(y_test, y_pred_rf)

print('Model results (lower RMSE better, higher R2 better):')
for k, v in results.items():
    print(k, v)

# Feature importances for RF
importances = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
display(importances.head(20))

Model results (lower RMSE better, higher R2 better):
dummy_rmse 32.579010913356555
dummy_r2 -4.0557840419452873e-07
lr_rmse 23.565387322176353
lr_r2 0.4767929945406527
rf_rmse 21.06553595203713
rf_r2 0.5819103545586066


type_is_show           0.358491
release_year           0.131915
tmdb_popularity        0.131629
imdb_votes             0.110323
imdb_score             0.073769
tmdb_score             0.051553
genre_drama            0.042843
genre_action           0.018760
genre_romance          0.015456
genre_documentation    0.013498
genre_comedy           0.011559
genre_family           0.011096
genre_thriller         0.010539
genre_crime            0.007775
genre_european         0.005461
genre_horror           0.005333
dtype: float64


# **Conclusions & Recommendations**

- The notebook automatically selected a numeric target and trained baseline and stronger models (Linear Regression and Random Forest). Compare RMSE and RÂ² to assess model usefulness.
- If performance is not sufficient, consider: feature engineering (text embedding of descriptions, actor/director features from credits), hyperparameter tuning, additional data sources, or using gradient boosting (XGBoost/LightGBM).
- Save the best model and export predictions for submission.

# **Next steps**
- Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
- Text features from `description` using TF-IDF or transformers
- Use credits to generate actor/director popularity features


# **References**
- Dataset provided in project zip

# **Appendix**
- All code is present in this notebook. Figures and saved models can be stored in `/mnt/data/`.