<div align="center">
    
# 5.0 Modeling

## 5.1 Table of Contents<a id='5.1_Table_of_Contents'></a>
* [5.1 Table of Contents](#5.1_Table_of_Contents)
* [5.2 Introduction](#5.2_Introduction)
* [5.3 Library Imports](#5.3_Library_Imports)
* [5.4 Data Loading](#5.4_Data_Loading)
* [5.5 Baseline Model](#5.5_Baseline_Model)
* [5.6 Model Comparison](#5.6_Model_Comparison)
  * [5.6.1 Linear Regression](#5.6.1_Linear_Regression)
  * [5.6.2 Regularized Models (Ridge, Lasso, ElasticNet)](#5.6.2_Regularized_Models)
  * [5.6.3 Tree-Based Models (RF, GBM)](#5.6.3_Tree_Based_Models)
* [5.7 Model Selection & Tuning](#5.7_Model_Selection_Tuning)
* [5.8 Final Model Training](#5.8_Final_Model_Training)
* [5.9 Save Artifacts](#5.9_Save_Artifacts)
* [5.10 Summary](#5.10_Summary)

## 5.2 Introduction<a id='5.2_Introduction'></a>

This notebook builds predictive models for ski resort weekend ticket prices using the processed and feature-engineered dataset.
We’ll start by training baseline and advanced models, evaluate them using R², RMSE, and MAE, and select the model that offers the best trade-off between performance and generalization.

At the end:

The final model and evaluation metrics will be saved for reuse in the next notebook (sys_06_model_evaluation.ipynb).

The trained model will be applied to estimate Big Mountain Resort’s optimal price.

## 5.3 Library Imports<a id='5.3_Library_Imports'></a>

In [1]:
import numpy as np
import pandas as pd
import os
from pathlib import Path
from joblib import load, dump

# Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Visualization
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

## 5.4 Data Loading<a id='5.4_Data_Loading'></a>

In [2]:
# load artifacts, training set, validation set and target resort

art = Path("../artifacts")

X_train = np.load(art / "X_train_tf.npy")
X_val   = np.load(art / "X_val_tf.npy")
y_train = pd.read_csv(art / "y_train.csv").squeeze()
y_val   = pd.read_csv(art / "y_val.csv").squeeze()

feature_names = pd.read_csv(art / "feature_names.csv").squeeze().tolist()

print("Data Loaded")
print("Train:", X_train.shape, "| Val:", X_val.shape)
print("y_train mean:", round(y_train.mean(), 2))

Data Loaded
Train: (216, 92) | Val: (55, 92)
y_train mean: 64.32


## 5.5 Baseline Model<a id='5.5_Baseline_Model'></a>

In [3]:
# Simple mean baseline for reference
y_pred_baseline = np.repeat(y_train.mean(), len(y_val))

baseline_mae = mean_absolute_error(y_val, y_pred_baseline)
baseline_rmse = np.sqrt(mean_squared_error(y_val, y_pred_baseline))
baseline_r2 = r2_score(y_val, y_pred_baseline)

print(f"Baseline MAE: {baseline_mae:.2f}")
print(f"Baseline RMSE: {baseline_rmse:.2f}")
print(f"Baseline R²: {baseline_r2:.3f}")

Baseline MAE: 17.24
Baseline RMSE: 23.42
Baseline R²: -0.003


## 5.6 Model Comparison<a id='5.6_Model_Comparison'></a>

### 5.6.1 Linear_Regression<a id='5.6.1_Linear_Regression'></a>

In [4]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

def eval_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))  # <- manual RMSE
    r2 = r2_score(y_true, y_pred)
    print(f"{name:<25} | MAE: {mae:.2f} | RMSE: {rmse:.2f} | R²: {r2:.3f}")
    return pd.Series({"Model": name, "MAE": mae, "RMSE": rmse, "R2": r2})

results = []
results.append(eval_model("Linear Regression", y_val, y_pred_lr))

Linear Regression         | MAE: 7.84 | RMSE: 10.92 | R²: 0.782


### 5.6.2 Regularized_Models<a id='5.6.2_Regularized_Models'></a>

In [5]:
for model_name, model in [
    ("Ridge", Ridge(alpha=1.0, random_state=17)),
    ("Lasso", Lasso(alpha=0.01, random_state=17)),
    ("ElasticNet", ElasticNet(alpha=0.01, l1_ratio=0.5, random_state=17))
]:
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    results.append(eval_model(model_name, y_val, preds))

Ridge                     | MAE: 7.63 | RMSE: 10.61 | R²: 0.794
Lasso                     | MAE: 7.86 | RMSE: 10.87 | R²: 0.784
ElasticNet                | MAE: 7.61 | RMSE: 10.62 | R²: 0.794


### 5.6.3 Tree Based Models<a id='5.6.3_Tree_Based_Models'></a>

In [6]:
for model_name, model in [
    ("Random Forest", RandomForestRegressor(n_estimators=200, random_state=17)),
    ("Gradient Boosting", GradientBoostingRegressor(random_state=17))
]:
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    results.append(eval_model(model_name, y_val, preds))

Random Forest             | MAE: 9.43 | RMSE: 13.15 | R²: 0.684
Gradient Boosting         | MAE: 8.27 | RMSE: 11.53 | R²: 0.757


## 5.7 Model Selection & Tuning<a id='5.7_Model_Selection_Tuning'></a>

In [7]:
results_df = pd.DataFrame(results)
results_df.sort_values("R2", ascending=False, inplace=True)
display(results_df)

best_model_name = results_df.iloc[0]["Model"]
print(f"Best model based on R²: {best_model_name}")

Unnamed: 0,Model,MAE,RMSE,R2
1,Ridge,7.632985,10.611958,0.79411
3,ElasticNet,7.611508,10.617701,0.793887
2,Lasso,7.856101,10.865089,0.784171
0,Linear Regression,7.842236,10.917093,0.7821
5,Gradient Boosting,8.267835,11.533584,0.756795
4,Random Forest,9.427841,13.148313,0.68393


Best model based on R²: Ridge


## 5.8 Final Model Training<a id='5.8_Final_Model_Training'></a>

In [8]:
best_model = GradientBoostingRegressor(random_state=17)
best_model.fit(X_train, y_train)

y_pred_final = best_model.predict(X_val)
eval_model("Final Model (GBM)", y_val, y_pred_final)

Final Model (GBM)         | MAE: 8.27 | RMSE: 11.53 | R²: 0.757


Model    Final Model (GBM)
MAE               8.267835
RMSE             11.533584
R2                0.756795
dtype: object

## 5.9 Summary<a id='5.9_Summary'></a>

In [9]:
#SaveArtifacts

dump(best_model, art / "final_model.joblib")
print("✅ Final model saved to:", art / "final_model.joblib")

✅ Final model saved to: ../artifacts/final_model.joblib
