#Model Training : Box office prediction

This codebook documents the model training and evaluation pipeline for the Box Office 7-Day Domestic Revenue Prediction project. Building on the preprocessed and feature-engineered dataset, this notebook is responsible for constructing reproducible train–validation–test splits and training multiple predictive models to benchmark performance and understand model behavior.

The codebook implements three complementary modeling approaches: a linear regression as a transparent baseline, LightGBM as a high-performance gradient boosting model for capturing non-linear interactions, and CatBoost for robust handling of categorical features and complex relationships. Each model is trained under consistent data splits and evaluated using standardized error metrics to enable fair comparison.

The goal of this codebook is twofold: first, to establish a reliable baseline against which more complex models can be judged, and second, to assess how much predictive lift is gained from non-linear and ensemble-based methods. Together, these experiments provide both interpretability and performance insights, guiding model selection for accurate and economically meaningful box office revenue forecasting.

Finally, the best fitting model is used to predict the domestic weekly box office value generated by Avatar:Fire and Ash

##All imports

In [None]:
# Install once (quiet)
!pip install -q catboost optuna lightgbm

# Core
import pandas as pd
import numpy as np
import datetime as dt
import re
import json

# Models & metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from catboost import CatBoostRegressor, Pool
import lightgbm as lgb
import optuna





##Loading Feature list

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Box office prediction dataset/features.csv')


In [None]:
df.info()

##Y Variable

In [None]:
y=df['Seven_Day_Total']

In [None]:
df.info()

Date-time conversions that were missed in preprocessing

In [None]:
df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")

In [None]:
# 2) Derive numeric year (int) for splitting
df["year"] = df["release_date"].dt.year

##Preprocessing Step: Creating Covid Flat
This flag was an afterthought hence created later to account for the pressure on movie theaters during covid-19 pandemic

In [None]:
df["covid_era"] = df["year"].between(2020, 2021).astype(int)


#Creating X: Final feature list

In [None]:
x=df[['Distributor','Theaters','runtime','log_total_marketing_assets',
     'director_weighted_rating','weekend_release','is_holiday','holiday_type','spoken_languages_iso','has_same_day_competitor',
     'same_day_competitors','month_sin','month_cos','budget_log','cast_top5_popularity_sum_log','is_franchise','franchise_size_log'
     ,'installment_number','is_sequel','keywords_count_log','kw_superhero','kw_fantasy_scifi','kw_horror','kw_animation','kw_romance','kw_biopic'
     ,'kw_crime_thriller','kw_comedy_drama','kw_action_adventure','kw_family','genres_norm','num_genres','num_production_companies','has_major_studio',
     'num_prodcos_log','mpaa_std','is_US','is_China','is_India','num_countries','covid_era','year','Seven_Day_Total']]

In [None]:
x.info()

In [None]:
y.info()

#Train- Valid-Test split

In [None]:
# 3) Time-based split
train_df = x[x["year"] <= 2019]
val_df   = x[(x["year"] >= 2020) & (x["year"] <= 2022)]
test_df  = x[x["year"] >= 2023]

# 4) Features/target
TARGET = "Seven_Day_Total"
X_train, y_train = train_df.drop(columns=[TARGET]), train_df[TARGET]
X_val,   y_val   = val_df.drop(columns=[TARGET]),   val_df[TARGET]
X_test,  y_test  = test_df.drop(columns=[TARGET]),  test_df[TARGET]

# 5) Sanity checks
print("Train years:", int(X_train["year"].min()), "-", int(X_train["year"].max()), "| rows:", len(X_train))
print("Val years:  ", int(X_val["year"].min()),   "-", int(X_val["year"].max()),   "| rows:", len(X_val))
print("Test years: ", int(X_test["year"].min()),  "-", int(X_test["year"].max()),  "| rows:", len(X_test))


In [None]:
# 1) Totals add up (excluding rows with missing year)
n_all = df["year"].notna().sum()
n_train, n_val, n_test = len(X_train), len(X_val), len(X_test)
print("Covered rows:", n_train + n_val + n_test, "of", n_all)

# 2) Year ranges are disjoint and ordered
print("Train years unique:", sorted(train_df["year"].unique()))
print("Val years unique:  ", sorted(val_df["year"].unique()))
print("Test years unique: ", sorted(test_df["year"].unique()))

# 3) No overlap of indices
print("Overlap train∩val:", len(set(train_df.index) & set(val_df.index)))
print("Overlap val∩test: ", len(set(val_df.index) & set(test_df.index)))
print("Overlap train∩test:", len(set(train_df.index) & set(test_df.index)))

# 4) Any NaT release dates that were dropped by the filters?
print("Rows with NaT release_date:", df["release_date"].isna().sum())

# 5) Target availability check
for name, part in [("train", train_df), ("val", val_df), ("test", test_df)]:
    missing_target = part["Seven_Day_Total"].isna().sum()
    print(f"{name}: missing Seven_Day_Total = {missing_target}")

# 6) Optional distribution sanity checks
for name, part in [("train", train_df), ("val", val_df), ("test", test_df)]:
    print(f"{name}: mean target={part['Seven_Day_Total'].mean():.2f}, "
          f"median={part['Seven_Day_Total'].median():.2f}, "
          f"n={len(part)}")


#CatBoost Regressor:
CatBoost Regressor was chosen because it **handles categorical features natively**, reducing the need for extensive one-hot encoding and lowering the risk of information loss. It is particularly effective with **heterogeneous tabular data**, which fits the mix of financial, categorical, and engineered features in this dataset. CatBoost is **robust to overfitting**, especially on smaller-to-medium datasets, due to its ordered boosting strategy. It captures **non-linear interactions** between features that linear models cannot. Finally, it delivers strong performance with **minimal feature scaling and preprocessing**, making it well-suited for this problem.


##Before hyper-params tuning

In [None]:
# Log-transform the target for stability
y_train_log = np.log1p(y_train)
y_valid_log = np.log1p(y_val)

# Identify categorical columns
cat_cols = X_train.select_dtypes(include='object').columns.tolist()

# Fill NaN values in categorical columns with a placeholder string
# CatBoost requires categorical features to be strings or integers, not NaN
X_train_processed = X_train.copy()
X_val_processed = X_val.copy()
for col in cat_cols:
    X_train_processed[col] = X_train_processed[col].fillna('')
    X_val_processed[col] = X_val_processed[col].fillna('')

cat_feature_indices = [X_train_processed.columns.get_loc(c) for c in cat_cols]

train_pool = Pool(
    data=X_train_processed,
    label=y_train_log,
    cat_features=cat_feature_indices
)
valid_pool = Pool(
    data=X_val_processed,
    label=y_valid_log,
    cat_features=cat_feature_indices
)

# Reasonable starting hyperparameters
cb = CatBoostRegressor(
    loss_function="RMSE",         # in log-space because labels are log1p
    eval_metric="RMSE",
    depth=8,
    learning_rate=0.05,
    l2_leaf_reg=3.0,
    n_estimators=5000,
    random_seed=42,
    od_type="Iter",               # early stopping
    od_wait=200,
    verbose=200
)

cb.fit(train_pool, eval_set=valid_pool, use_best_model=True)

# Predict and invert log
valid_pred_log = cb.predict(valid_pool)
valid_pred = np.expm1(valid_pred_log)

# Metrics
rmse_log = np.sqrt(mean_squared_error(y_valid_log, valid_pred_log))
mae = mean_absolute_error(y_val, valid_pred)

print(f"Validation RMSE (log-space): {rmse_log:.4f}")
print(f"Validation MAE (original scale): {mae:,.0f}")

# Feature importance
fi = pd.Series(cb.get_feature_importance(train_pool), index=X_train_processed.columns).sort_values(ascending=False)
print(fi.head(25))

Overall Model Interpretation

The model was stopped at 780 iterations, indicating early stopping once validation performance stopped improving, which helps prevent overfitting.

A Validation RMSE (log-space) of 1.1437 suggests moderate predictive accuracy on log-transformed revenue; errors grow multiplicatively rather than additively.

A Validation MAE of ~11.0 million (original scale) means that, on average, predictions differ from actual 7-day revenue by about USD 11M.

Feature importance values indicate how strongly each variable contributes to reducing prediction error, not directionality.

Feature-by-Feature Explanation (Descending Importance)
1. Theaters (56.20)

The strongest predictor. Opening theater count directly constrains revenue potential—wider releases almost always generate higher opening-week grosses. This dominates all other signals.

2. Distributor (5.08)

Encodes distribution strength, release strategy, and bargaining power with exhibitors. Major distributors typically secure more screens, better timing, and stronger marketing support.

3. budget_log (4.74)

Budget reflects production scale and expected commercial ambition. The log transform captures diminishing returns while still signaling blockbuster-level investment.

4. log_total_marketing_assets (3.98)

Represents overall marketing intensity. Strongly predictive, but secondary to theaters and budget, aligning with real-world box office dynamics.

5. keywords_count_log (3.10)

Acts as a proxy for narrative complexity, genre richness, and discoverability. More keywords often reflect higher audience targeting and content depth.

6. director_weighted_rating (3.08)

Captures historical credibility and audience trust in the director. Well-rated directors consistently improve opening performance.

7. runtime (2.92)

Runtime affects daily screening capacity and audience perception. Both very short and very long runtimes carry trade-offs the model learns non-linearly.

8. mpaa_std (2.46)

Rating impacts addressable audience size (e.g., PG-13 vs R). This is a meaningful constraint on revenue, especially for family-driven openings.

9. cast_top5_popularity_sum_log (2.45)

Star power matters, but with diminishing returns. The log scale reflects that once a cast is highly popular, incremental fame adds less value.

10. year (2.01)

Captures macro trends such as inflation, market growth/decline, streaming competition, and post-pandemic effects.

11. month_sin (1.73)

Encodes seasonal release cycles. Certain periods (summer, holidays) systematically outperform others.

12. franchise_size_log (1.37)

Larger franchises carry built-in audience awareness, but benefits taper as franchises grow longer.

13. month_cos (1.15)

Complements month_sin to fully encode cyclical seasonality without artificial ordering.

14. holiday_type (1.12)

Opening near major holidays boosts attendance due to extended leisure time and family viewing.

15. installment_number (0.92)

Later installments behave differently than originals—often front-loaded but sometimes fatigued—captured here numerically.

16. num_genres (0.87)

Genre breadth can widen appeal, but excessive mixing may dilute targeting.

17. genres_norm (0.86)

Specific genre combinations influence audience size and expectations beyond just count.

18. num_countries (0.83)

International production involvement can signal higher budgets or global appeal, even when predicting domestic revenue.

19. num_production_companies (0.75)

More producers often indicate larger-scale projects, though with limited marginal gain.

20. num_prodcos_log (0.70)

Log version confirms diminishing returns—useful, but weaker than raw scale effects.

21. has_major_studio (0.67)

Major studio backing provides structural advantages, though much of its effect overlaps with theaters and distributor.

22. spoken_languages_iso (0.65)

Language diversity influences accessibility but is less critical for domestic opening performance.

23. same_day_competitors (0.63)

Competition slightly depresses revenue, but its effect is smaller than release scale and marketing.

24. is_franchise (0.46)

Binary franchise membership matters less than how big or which installment the franchise is.

25. weekend_release (0.44)

Most films already open on weekends, limiting its discriminative power.

CatBoost: Categorical Prep & Pool Construction (log-target)

In [None]:
# 1) Explicit list of categoricals used with CatBoost
cat_cols = [c for c in ["Distributor", "holiday_type", "spoken_languages_iso", "genres_norm", "mpaa_std"]
            if c in X_train.columns]

# 2) Sanitize categoricals in each split: fill NaN and cast to string
for frame in (X_train, X_val, X_test):
    for c in cat_cols:
        # Replace NaN/NaT with a placeholder, then ensure string dtype
        frame[c] = frame[c].where(frame[c].notna(), "Unknown").astype(str)

# (Optional) Quick check that no NaNs remain in categoricals
print(X_train[cat_cols].isna().sum())
print(X_val[cat_cols].isna().sum())
print(X_test[cat_cols].isna().sum())

# 3) Recompute categorical indices after any column changes
cat_idx = [X_train.columns.get_loc(c) for c in cat_cols]

# 4) Build Pools safely
from catboost import Pool
import numpy as np

y_train_log = np.log1p(y_train.astype(float))
y_val_log   = np.log1p(y_val.astype(float))
y_test_log  = np.log1p(y_test.astype(float))

train_pool = Pool(X_train, label=y_train_log, cat_features=cat_idx)
valid_pool = Pool(X_val,   label=y_val_log,   cat_features=cat_idx)
test_pool  = Pool(X_test,  label=y_test_log,  cat_features=cat_idx)


##Hyper- parameter tuning: CatBoost Regressor

In [None]:
def objective(trial):
    params = {
        "loss_function": "RMSE",
        "eval_metric":   "RMSE",
        "random_seed":   42,
        "depth": trial.suggest_int("depth", 6, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.02, 0.2, log=True),
        "l2_leaf_reg":   trial.suggest_float("l2_leaf_reg", 1.0, 10.0, log=True),
        "bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
        "random_strength":     trial.suggest_float("random_strength", 0.0, 2.0),
        "rsm":                 trial.suggest_float("rsm", 0.7, 1.0),
        "border_count":        trial.suggest_int("border_count", 128, 254),
        "grow_policy": trial.suggest_categorical("grow_policy", ["SymmetricTree", "Depthwise"]),
        "od_type": "Iter",
        "od_wait": 200,
        "verbose": False,
    }
    n_estimators = trial.suggest_int("n_estimators", 1500, 6000, step=500)

    model = CatBoostRegressor(**params, n_estimators=n_estimators)
    model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

    val_pred_log = model.predict(valid_pool)
    rmse_log = np.sqrt(mean_squared_error(y_val_log, val_pred_log))  # <-- patched

    trial.report(rmse_log, step=0)
    if trial.should_prune():
        raise optuna.TrialPruned()

    return rmse_log


In [None]:
study = optuna.create_study(direction="minimize", study_name="catboost_boxoffice")
study.optimize(objective, n_trials=50, show_progress_bar=True)  # adjust trials as you wish

print("Best value (val RMSE log):", study.best_value)
print("Best params:", study.best_params)


##After Hyper-params tuning: CatBoost Regressor

In [None]:
# Combine train + valid for final fit
X_trv = pd.concat([X_train, X_val], axis=0)
y_trv = pd.concat([y_train, y_val], axis=0)
y_trv_log = np.log1p(y_trv.astype(float))

cat_cols = [c for c in X_trv.columns if X_trv[c].dtype == "object"]
cat_idx  = [X_trv.columns.get_loc(c) for c in cat_cols]

trv_pool  = Pool(X_trv,  label=y_trv_log,  cat_features=cat_idx)
test_pool = Pool(X_test, label=y_test_log, cat_features=cat_idx)

best_params = study.best_params.copy()
best_n_estimators = best_params.pop("n_estimators")

final_model = CatBoostRegressor(
    loss_function="RMSE",
    eval_metric="RMSE",
    random_seed=42,
    od_type="Iter",
    od_wait=200,
    verbose=200,
    n_estimators=best_n_estimators,
    **best_params
)

final_model.fit(trv_pool, eval_set=test_pool, use_best_model=True)

# Evaluate on TEST
test_pred_log = final_model.predict(test_pool)
test_pred     = np.expm1(test_pred_log)

# Metrics (compat with older sklearn: compute RMSE via sqrt(MSE))
rmse_log = np.sqrt(mean_squared_error(y_test_log, test_pred_log))
mae      = mean_absolute_error(y_test, test_pred)
rmse     = np.sqrt(mean_squared_error(y_test, test_pred))

# R-squared (report both spaces)
r2_log   = r2_score(y_test_log, test_pred_log)   # on log target
r2_orig  = r2_score(y_test,     test_pred)       # on original scale

print(f"TEST RMSE (log-space): {rmse_log:.4f}")
print(f"TEST R^2  (log-space): {r2_log:.4f}")
print(f"TEST MAE  (original) : {mae:,.0f}")
print(f"TEST RMSE (original) : {rmse:,.0f}")
print(f"TEST R^2  (original) : {r2_orig:.4f}")


The tuned CatBoost model achieved a test RMSE of 0.74 in log-space, corresponding to a typical prediction error within a factor of approximately two. The model explains nearly 89% of the variance in log first-week revenue and approximately 58% in the original revenue scale. These results substantially outperform earlier baselines and indicate strong predictive signal under a strictly time-aware evaluation protocol.

#Scaling and Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# ---- inputs you already have ----
# X_train, X_val, X_test, y_train, y_val, y_test

# Explicit categoricals (single-label)
CAT_COLS = [c for c in ["Distributor", "holiday_type", "spoken_languages_iso", "mpaa_std"] if c in X_train.columns]

# Multi-label column (pipe-separated string, e.g., "Action|Adventure")
GENRES_COL = "genres_norm" if "genres_norm" in X_train.columns else None
TOP_K_GENRES = 12  # adjust if you want wider/narrower

# Optional: cap long-tail categories to "Other" based on TRAIN distribution only
def cap_categories_train_only(x_tr, x_va, x_te, col, top_n=30):
    top = x_tr[col].value_counts().head(top_n).index
    x_tr[col] = np.where(x_tr[col].isin(top), x_tr[col], "Other")
    x_va[col] = np.where(x_va[col].isin(top), x_va[col], "Other")
    x_te[col] = np.where(x_te[col].isin(top), x_te[col], "Other")
    return x_tr, x_va, x_te

# Sanitize categoricals: fill NaN → "Unknown", cast to str
def clean_cats_inplace(df, cols):
    for c in cols:
        df[c] = df[c].where(df[c].notna(), "Unknown").astype(str)

In [None]:
# Copy to avoid SettingWithCopy surprises
Xtr, Xva, Xte = X_train.copy(), X_val.copy(), X_test.copy()

# Clean + cap single-label categoricals
clean_cats_inplace(Xtr, CAT_COLS)
clean_cats_inplace(Xva, CAT_COLS)
clean_cats_inplace(Xte, CAT_COLS)

for c in CAT_COLS:
    Xtr, Xva, Xte = cap_categories_train_only(Xtr, Xva, Xte, c, top_n=30)

# One-Hot Encoder on TRAIN only
ohe=OneHotEncoder(handle_unknown="ignore",sparse_output=False,drop=None)
ohe.fit(Xtr[CAT_COLS])

def apply_ohe(df):
    arr = ohe.transform(df[CAT_COLS])
    cols = ohe.get_feature_names_out(CAT_COLS)
    ohe_df = pd.DataFrame(arr, columns=cols, index=df.index)
    return pd.concat([df.drop(columns=CAT_COLS), ohe_df], axis=1)

Xtr_ohe = apply_ohe(Xtr)
Xva_ohe = apply_ohe(Xva)
Xte_ohe = apply_ohe(Xte)

ohe_cols = list(ohe.get_feature_names_out(CAT_COLS))


In [None]:
def fit_genre_space(xtr, col, k=TOP_K_GENRES):
    # build top-K list from TRAIN only
    g = (
        xtr[col].fillna("")
        .astype(str).str.split("|")
        .explode().str.strip()
    )
    top = g[g != ""].value_counts().head(k).index.tolist()
    return top

def add_genre_multihot(df, col, top_genres):
    if col is None:
        return df, []
    base = df[col].fillna("").astype(str)
    out_cols = []
    for g in top_genres:
        cname = f"genre_{g}"
        df[cname] = base.str.contains(fr"(?:^|\|){g}(?:\||$)", regex=True).astype(int)
        out_cols.append(cname)
    return df.drop(columns=[col], errors="ignore"), out_cols

if GENRES_COL:
    top_genres = fit_genre_space(Xtr_ohe, GENRES_COL, k=TOP_K_GENRES)
    Xtr_ohe, genre_cols = add_genre_multihot(Xtr_ohe, GENRES_COL, top_genres)
    Xva_ohe, _         = add_genre_multihot(Xva_ohe, GENRES_COL, top_genres)
    Xte_ohe, _         = add_genre_multihot(Xte_ohe, GENRES_COL, top_genres)
else:
    genre_cols = []


In [None]:
all_cols = Xtr_ohe.columns.tolist()

non_ohe_cols = [c for c in all_cols if c not in ohe_cols + genre_cols]
num_cols_lin = [c for c in non_ohe_cols if np.issubdtype(Xtr_ohe[c].dtype, np.number)]

# Standardize numerics for LINEAR only (fit on TRAIN, apply to VAL/TEST)
scaler = StandardScaler()
Xtr_lin = Xtr_ohe.copy()
Xva_lin = Xva_ohe.copy()
Xte_lin = Xte_ohe.copy()

Xtr_lin[num_cols_lin] = scaler.fit_transform(Xtr_lin[num_cols_lin])
Xva_lin[num_cols_lin] = scaler.transform(Xva_lin[num_cols_lin])
Xte_lin[num_cols_lin] = scaler.transform(Xte_lin[num_cols_lin])


In [None]:
# For LightGBM (tree-based), use the unscaled OHE+multi-hot frames:
Xtr_lgbm, Xva_lgbm, Xte_lgbm = Xtr_ohe, Xva_ohe, Xte_ohe

#Baseline: LinearRegression

In [None]:
ytr_log = np.log1p(y_train.astype(float))
yva_log = np.log1p(y_val.astype(float))
yte_log = np.log1p(y_test.astype(float))

lr = LinearRegression()
lr.fit(Xtr_lin, ytr_log)

# Validation
pred_va_log = lr.predict(Xva_lin)
pred_va = np.expm1(pred_va_log)

rmse_log_va = np.sqrt(mean_squared_error(yva_log, pred_va_log))
mae_va      = mean_absolute_error(y_val, pred_va)
r2_log_va   = r2_score(yva_log, pred_va_log)

print(f"[Baseline Linear] VAL RMSE (log): {rmse_log_va:.4f} | VAL MAE: {mae_va:,.0f} | VAL R^2 (log): {r2_log_va:.4f}")

# Test
pred_te_log = lr.predict(Xte_lin)
pred_te = np.expm1(pred_te_log)

rmse_log_te = np.sqrt(mean_squared_error(yte_log, pred_te_log))
mae_te      = mean_absolute_error(y_test, pred_te)
r2_log_te   = r2_score(yte_log, pred_te_log)

print(f"[Baseline Linear] TEST RMSE (log): {rmse_log_te:.4f} | TEST MAE: {mae_te:,.0f} | TEST R^2 (log): {r2_log_te:.4f}")


#Lightgbm Regressor

##Before hyper parameters tuning

In [None]:
# Function to sanitize column names
def sanitize_lgbm_col_names(df):
    cols = df.columns
    new_cols = []
    for col in cols:
        new_col = re.sub(r'[^A-Za-z0-9_]+', '_', col)
        new_col = new_col.replace(' ', '_')
        new_cols.append(new_col)
    df.columns = new_cols
    return df

# Sanitize column names for LightGBM DataFrames
Xtr_lgbm_sanitized = sanitize_lgbm_col_names(Xtr_lgbm.copy())
Xva_lgbm_sanitized = sanitize_lgbm_col_names(Xva_lgbm.copy())
Xte_lgbm_sanitized = sanitize_lgbm_col_names(Xte_lgbm.copy())

# LightGBM works fine on OHE + raw numerics; we keep the log-target
lgb_train = lgb.Dataset(Xtr_lgbm_sanitized, label=ytr_log)
lgb_valid = lgb.Dataset(Xva_lgbm_sanitized, label=yva_log, reference=lgb_train)

params = {
    "objective": "regression",
    "metric": "rmse",
    "learning_rate": 0.05,
    "num_leaves": 63,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 1,
    "min_data_in_leaf": 20,
    "seed": 42,
}

lgbm_model = lgb.train(
    params,
    lgb_train,
    num_boost_round=5000,
    valid_sets=[lgb_train, lgb_valid],
    valid_names=["train","valid"],
    callbacks=[lgb.early_stopping(200, verbose=200)]
)

# Evaluate on TEST (log-space)
pred_te_log_lgbm = lgbm_model.predict(Xte_lgbm_sanitized, num_iteration=lgbm_model.best_iteration)
pred_te_lgbm = np.expm1(pred_te_log_lgbm)

rmse_log_te_lgbm = np.sqrt(mean_squared_error(yte_log, pred_te_log_lgbm))
mae_te_lgbm      = mean_absolute_error(y_test, pred_te_lgbm)
r2_log_te_lgbm   = r2_score(yte_log, pred_te_log_lgbm)

print(f"[LightGBM] TEST RMSE (log): {rmse_log_te_lgbm:.4f} | TEST MAE: {mae_te_lgbm:,.0f} | TEST R^2 (log): {r2_log_te_lgbm:.4f}")

##Hyper-parameter tuning and Final LightGBM training

In [None]:
optuna.logging.set_verbosity(optuna.logging.WARNING)  # quiet, ASCII-safe

# Function to sanitize column names
def sanitize_lgbm_col_names(df):
    cols = df.columns
    new_cols = []
    for col in cols:
        new_col = re.sub(r'[^A-Za-z0-9_]+', '_', col)
        new_col = new_col.replace(' ', '_')
        new_cols.append(new_col)
    df.columns = new_cols
    return df

# Sanitize column names for LightGBM DataFrames
Xtr_lgbm = sanitize_lgbm_col_names(Xtr_lgbm.copy())
Xva_lgbm = sanitize_lgbm_col_names(Xva_lgbm.copy())
Xte_lgbm = sanitize_lgbm_col_names(Xte_lgbm.copy())

lgb_train = lgb.Dataset(Xtr_lgbm, label=ytr_log)
lgb_valid = lgb.Dataset(Xva_lgbm, label=yva_log, reference=lgb_train)

def objective(trial: optuna.Trial) -> float:
    params = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": -1,                  # no unicode logs
        "seed": 42,
        "feature_pre_filter": False,      # Add this line to resolve the LightGBMError
        "num_leaves": trial.suggest_int("num_leaves", 31, 255),
        "max_depth": trial.suggest_int("max_depth", -1, 16),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 10, 200),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-4, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-4, 10.0, log=True),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0.0, 1.0),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.6, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.6, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 0, 7),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15, log=True),
        "extra_trees": trial.suggest_categorical("extra_trees", [False, True]),
    }
    num_boost_round = trial.suggest_int("num_boost_round", 1500, 6000, step=500)

    model = lgb.train(
        params,
        lgb_train,
        num_boost_round=num_boost_round,
        valid_sets=[lgb_valid],           # only valid set
        valid_names=["valid"],
        callbacks=[
            lgb.early_stopping(stopping_rounds=200, verbose=False),
            lgb.log_evaluation(period=0), # no periodic logs
        ],
    )
    return model.best_score["valid"]["rmse"]

# Use a pruner, no progress bar
pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
study = optuna.create_study(direction="minimize", pruner=pruner)
study.optimize(objective, n_trials=50, show_progress_bar=False)

# Safe ASCII prints
print("Best val log-RMSE:", study.best_value)
print("Best params:", json.dumps(study.best_params, ensure_ascii=True))

# Retrain on train+valid (quiet)
X_trv = np.ascontiguousarray(np.vstack([Xtr_lgbm, Xva_lgbm]))
y_trv_log = np.concatenate([ytr_log, yva_log])
lgb_trv = lgb.Dataset(X_trv, label=y_trv_log)

best = study.best_params.copy()
num_boost_round = best.pop("num_boost_round")

final_model = lgb.train(
    {**best, "objective": "regression", "metric": "rmse", "verbosity": -1, "seed": 42},
    lgb_trv,
    num_boost_round=num_boost_round,
    valid_sets=[lgb_trv],
    valid_names=["train_valid"],
    callbacks=[lgb.log_evaluation(period=0)],
)

# Test evaluation (explicit sqrt for older sklearn)
pred_test_log = final_model.predict(Xte_lgbm, num_iteration=final_model.best_iteration)
pred_test = np.expm1(pred_test_log)

rmse_log = np.sqrt(mean_squared_error(yte_log, pred_test_log))
mae      = mean_absolute_error(y_test, pred_test)
rmse     = np.sqrt(mean_squared_error(y_test, pred_test))
r2_log   = r2_score(yte_log, pred_test_log)
r2_orig  = r2_score(y_test, pred_test)

print("TEST_RMSE_LOG:", float(rmse_log))
print("TEST_R2_LOG:",  float(r2_log))
print("TEST_MAE:",     float(mae))
print("TEST_RMSE:",    float(rmse))
print("TEST_R2:",      float(r2_orig))

| **Metric**                     | **Linear Regression (Baseline)** | **CatBoost Regressor** | **LightGBM Regressor** |
| ------------------------------ | -------------------------------- | ---------------------- | ---------------------- |
| **TEST RMSE (log-space)**      | 0.9539                           | **0.7626**             | 0.7746                 |
| **TEST R² (log-space)**        | 0.8136                           | **0.8809**             | 0.8771                 |
| **TEST MAE (original scale)**  | 16,330,805                       | **10,484,272**         | 10,570,600             |
| **TEST RMSE (original scale)** | N/A                              | **27,656,886**         | 28,378,886             |
| **TEST R² (original scale)**   | N/A                              | **0.5539**             | 0.5303                 |


#Model Selection: CatBoost Regressor

CatBoost was selected as the ideal model because it consistently delivered the **best overall predictive performance** while remaining well-suited to the structure of the dataset. Compared to Linear Regression, CatBoost captured complex non-linear relationships and interactions inherent in box office dynamics, and it marginally outperformed LightGBM across both log-space and original-scale metrics. Its native handling of categorical features such as distributor, holiday type, MPAA rating, genres, and languages reduced the need for aggressive encoding and helped preserve informative signals. CatBoost’s ordered boosting strategy also limited overfitting, leading to stronger generalization under a time-aware train–test split. Together, these properties allowed CatBoost to achieve the lowest error and highest explanatory power, making it the most reliable and interpretable choice for first-week box office revenue prediction in this project.


#Loading Avatar : Fire and Ash data for prediction

This avatar data was preprocessed the same as the dataset. As the code for the preprocessing of this data was exactly the same, the codebook to be referred to is Data-preprocessing. The original codebook for the csv file is not shared.

In [None]:
avatar_pred=pd.read_csv("/content/drive/MyDrive/Box office prediction dataset/avatar_features.csv")

In [None]:
avatar_pred.info()

The first set of code blocks prepares x_avatar as a model-ready inference sample by mirroring the exact feature structure used during training. It engineers a covid_era indicator from the release year, selects the finalized feature set in the same order as the training matrix, and renames release_year to year to ensure schema consistency. The code then applies categorical sanitation (clean_cats_inplace) to fill missing values and enforce string types, followed by category capping using thresholds learned only from the training data. This guarantees that unseen or rare categories in the Avatar sample are safely mapped to "Other" without leaking test-time information into the model.

The second block constructs a CatBoost Pool for inference using the previously computed categorical feature indices, which tells CatBoost how to correctly process categorical columns. The tuned CatBoost model (cb) then predicts log-transformed first-week revenue for the Avatar sample. Finally, the prediction is converted back to the original dollar scale using np.expm1, producing an interpretable revenue estimate. Together, these steps ensure that inference is deterministic, leakage-free, and fully aligned with the training pipeline, allowing the Avatar prediction to be evaluated with confidence.

In [None]:
avatar_pred["covid_era"] = avatar_pred["release_year"].between(2020, 2021).astype(int)

In [None]:
x_avatar=avatar_pred[['Distributor','Theaters','runtime','log_total_marketing_assets',
     'director_weighted_rating','weekend_release','is_holiday','holiday_type','spoken_languages_iso','has_same_day_competitor',
     'same_day_competitors','month_sin','month_cos','budget_log','cast_top5_popularity_sum_log','is_franchise','franchise_size_log'
     ,'installment_number','is_sequel','keywords_count_log','kw_superhero','kw_fantasy_scifi','kw_horror','kw_animation','kw_romance','kw_biopic'
     ,'kw_crime_thriller','kw_comedy_drama','kw_action_adventure','kw_family','genres_norm','num_genres','num_production_companies','has_major_studio',
     'num_prodcos_log','mpaa_std','is_US','is_China','is_India','num_countries','covid_era','release_year',]]

In [None]:
y_avatar='Seven_Day_Total'

In [None]:
x_avatar = x_avatar.rename(columns={'release_year': 'year'})

In [None]:
clean_cats_inplace(x_avatar, CAT_COLS)

In [None]:
for c in CAT_COLS:
    X_train, x_avatar, X_test = cap_categories_train_only(X_train, x_avatar, X_test, c, top_n=30)

In [None]:
avatar_pool = Pool(x_avatar, cat_features=cat_idx)
print("CatBoost Pool for x_avatar created successfully.")

#Predicted Revenue for Avatar

In [None]:
predicted_log_revenue = cb.predict(avatar_pool)
predicted_revenue = np.expm1(predicted_log_revenue)

print(f"Predicted revenue for 'Avatar: Fire and Ash': ${predicted_revenue[0]:,.2f}")

$124,989,450.64 is the model’s point estimate of first-7-day domestic box office for “Avatar: Fire and Ash.” The CatBoost model predicts log(revenue + 1); your code then applies np.expm1() to invert that log so the result is in dollars.

Why the model landed here
CatBoost learned relationships from 2,573 past releases and combined the inputs you fed for this title—e.g., Theaters (release scale), Distributor/major-studio flag, budget_log, log_total_marketing_assets, franchise signals (is_franchise, franchise_size_log, installment_number), seasonality/holiday features (month_sin/cos, holiday_type), and same-day competition—into a nonlinear forecast. In your feature importances, Theaters dominates, with material lift from Distributor, budget_log, and marketing intensity; those likely pushed the prediction into the ~$125M range, moderated by timing (month/holiday), competition, runtime, and content features.

How to interpret accuracy
The project’s tuned CatBoost shows RMSE ≈ 0.76 in log space and MAE ≈ $10.5M on the original scale. Practically, that means forecasts are typically within ~$10–11M of actuals but can deviate more for outliers (raw-scale RMSE ≈ $27–28M). So a reasonable expectation band is roughly $125M ± $10M on average, acknowledging wider uncertainty for exceptional blockbusters.