# Phase 2: Model Training

**Objective:** Train multiple regression models to predict `purchased_last_month`

---

## Models to Train:
1. **Baseline:** Linear Regression
2. **Linear Models:** Ridge, Lasso, ElasticNet
3. **Tree-Based Models:** Random Forest, XGBoost, Gradient Boosting

All models will be saved for evaluation in Phase 3.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import pickle
import time
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
import warnings

warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Load Preprocessed Data

In [2]:
# Load scaled data (for linear models)
X_train_scaled = pd.read_csv('X_train_scaled.csv')
X_test_scaled = pd.read_csv('X_test_scaled.csv')

# Load unscaled data (for tree-based models)
X_train = pd.read_csv('X_train.csv')
X_test = pd.read_csv('X_test.csv')

# Load targets
y_train_original = pd.read_csv('y_train_original.csv')['purchased_last_month']
y_test_original = pd.read_csv('y_test_original.csv')['purchased_last_month']
y_train_log = pd.read_csv('y_train_log.csv')['log_purchased_last_month']
y_test_log = pd.read_csv('y_test_log.csv')['log_purchased_last_month']

print("Data loaded successfully!")
print(f"\nX_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train_original.shape}")
print(f"y_test shape: {y_test_original.shape}")

Data loaded successfully!

X_train shape: (25731, 29)
X_test shape: (6433, 29)
y_train shape: (25731,)
y_test shape: (6433,)


## 2. Define Models

In [3]:
# Dictionary to store all models
models = {}
training_times = {}

print("Models will be trained with the following configurations:")
print("\n1. Linear Models (using scaled data + log-transformed target):")
print("   - Linear Regression")
print("   - Ridge Regression (alpha=1.0)")
print("   - Lasso Regression (alpha=1.0)")
print("   - ElasticNet (alpha=1.0, l1_ratio=0.5)")
print("\n2. Tree-Based Models (using unscaled data + original target):")
print("   - Random Forest (n_estimators=100, max_depth=20)")
print("   - XGBoost (n_estimators=100, max_depth=6, learning_rate=0.1)")
print("   - Gradient Boosting (n_estimators=100, max_depth=5, learning_rate=0.1)")

Models will be trained with the following configurations:

1. Linear Models (using scaled data + log-transformed target):
   - Linear Regression
   - Ridge Regression (alpha=1.0)
   - Lasso Regression (alpha=1.0)
   - ElasticNet (alpha=1.0, l1_ratio=0.5)

2. Tree-Based Models (using unscaled data + original target):
   - Random Forest (n_estimators=100, max_depth=20)
   - XGBoost (n_estimators=100, max_depth=6, learning_rate=0.1)
   - Gradient Boosting (n_estimators=100, max_depth=5, learning_rate=0.1)


## 3. Train Linear Models

Linear models will use:
- **Scaled features** (X_train_scaled)
- **Log-transformed target** (y_train_log) - better for handling skewness

In [4]:
print("=" * 60)
print("TRAINING LINEAR MODELS")
print("=" * 60)

TRAINING LINEAR MODELS


### 3.1 Linear Regression (Baseline)

In [5]:
print("\n[1/7] Training Linear Regression...")
start_time = time.time()

lr = LinearRegression()
lr.fit(X_train_scaled, y_train_log)

training_time = time.time() - start_time
models['linear_regression'] = lr
training_times['linear_regression'] = training_time

print(f"✓ Linear Regression trained in {training_time:.2f} seconds")


[1/7] Training Linear Regression...
✓ Linear Regression trained in 0.03 seconds


### 3.2 Ridge Regression

In [6]:
print("\n[2/7] Training Ridge Regression...")
start_time = time.time()

ridge = Ridge(alpha=1.0, random_state=42)
ridge.fit(X_train_scaled, y_train_log)

training_time = time.time() - start_time
models['ridge'] = ridge
training_times['ridge'] = training_time

print(f"✓ Ridge Regression trained in {training_time:.2f} seconds")


[2/7] Training Ridge Regression...
✓ Ridge Regression trained in 0.01 seconds


### 3.3 Lasso Regression

In [7]:
print("\n[3/7] Training Lasso Regression...")
start_time = time.time()

lasso = Lasso(alpha=1.0, random_state=42, max_iter=10000)
lasso.fit(X_train_scaled, y_train_log)

training_time = time.time() - start_time
models['lasso'] = lasso
training_times['lasso'] = training_time

print(f"✓ Lasso Regression trained in {training_time:.2f} seconds")


[3/7] Training Lasso Regression...
✓ Lasso Regression trained in 0.01 seconds


### 3.4 ElasticNet

In [8]:
print("\n[4/7] Training ElasticNet...")
start_time = time.time()

elasticnet = ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42, max_iter=10000)
elasticnet.fit(X_train_scaled, y_train_log)

training_time = time.time() - start_time
models['elasticnet'] = elasticnet
training_times['elasticnet'] = training_time

print(f"✓ ElasticNet trained in {training_time:.2f} seconds")


[4/7] Training ElasticNet...
✓ ElasticNet trained in 0.01 seconds


## 4. Train Tree-Based Models

Tree-based models will use:
- **Unscaled features** (X_train) - trees don't require scaling
- **Original target** (y_train_original) - trees handle outliers well

In [9]:
print("\n" + "=" * 60)
print("TRAINING TREE-BASED MODELS")
print("=" * 60)


TRAINING TREE-BASED MODELS


### 4.1 Random Forest

In [10]:
print("\n[5/7] Training Random Forest...")
start_time = time.time()

rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1,
    verbose=0
)
rf.fit(X_train, y_train_original)

training_time = time.time() - start_time
models['random_forest'] = rf
training_times['random_forest'] = training_time

print(f"✓ Random Forest trained in {training_time:.2f} seconds")


[5/7] Training Random Forest...
✓ Random Forest trained in 0.92 seconds


### 4.2 XGBoost

In [11]:
print("\n[6/7] Training XGBoost...")
start_time = time.time()

xgb = XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    verbosity=0
)
xgb.fit(X_train, y_train_original)

training_time = time.time() - start_time
models['xgboost'] = xgb
training_times['xgboost'] = training_time

print(f"✓ XGBoost trained in {training_time:.2f} seconds")


[6/7] Training XGBoost...
✓ XGBoost trained in 0.14 seconds


### 4.3 Gradient Boosting

In [12]:
print("\n[7/7] Training Gradient Boosting...")
start_time = time.time()

gb = GradientBoostingRegressor(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    random_state=42,
    verbose=0
)
gb.fit(X_train, y_train_original)

training_time = time.time() - start_time
models['gradient_boosting'] = gb
training_times['gradient_boosting'] = training_time

print(f"✓ Gradient Boosting trained in {training_time:.2f} seconds")


[7/7] Training Gradient Boosting...
✓ Gradient Boosting trained in 2.61 seconds


## 5. Save All Trained Models

In [13]:
print("\n" + "=" * 60)
print("SAVING MODELS")
print("=" * 60)

# Create a directory for models if it doesn't exist
import os
os.makedirs('models', exist_ok=True)

# Save each model
for model_name, model in models.items():
    filename = f'models/{model_name}.pkl'
    with open(filename, 'wb') as f:
        pickle.dump(model, f)
    print(f"✓ Saved: {filename}")

# Save training times
with open('models/training_times.pkl', 'wb') as f:
    pickle.dump(training_times, f)
print(f"✓ Saved: models/training_times.pkl")

# Save model metadata (which models use which data)
model_metadata = {
    'linear_models': ['linear_regression', 'ridge', 'lasso', 'elasticnet'],
    'tree_models': ['random_forest', 'xgboost', 'gradient_boosting'],
    'uses_scaled_data': ['linear_regression', 'ridge', 'lasso', 'elasticnet'],
    'uses_log_target': ['linear_regression', 'ridge', 'lasso', 'elasticnet']
}

with open('models/model_metadata.pkl', 'wb') as f:
    pickle.dump(model_metadata, f)
print(f"✓ Saved: models/model_metadata.pkl")

print("\n✓ All models saved successfully!")


SAVING MODELS
✓ Saved: models/linear_regression.pkl
✓ Saved: models/ridge.pkl
✓ Saved: models/lasso.pkl
✓ Saved: models/elasticnet.pkl
✓ Saved: models/random_forest.pkl
✓ Saved: models/xgboost.pkl
✓ Saved: models/gradient_boosting.pkl
✓ Saved: models/training_times.pkl
✓ Saved: models/model_metadata.pkl

✓ All models saved successfully!


## 6. Training Summary

In [14]:
print("\n" + "=" * 60)
print("TRAINING SUMMARY")
print("=" * 60)

print(f"\nTotal models trained: {len(models)}")
print(f"\nTraining times:")
for model_name, train_time in sorted(training_times.items(), key=lambda x: x[1]):
    print(f"  {model_name:20s}: {train_time:6.2f} seconds")

print(f"\nTotal training time: {sum(training_times.values()):.2f} seconds")

print("\n" + "=" * 60)
print("Models saved in ./models/ directory:")
print("=" * 60)
for model_name in models.keys():
    print(f"  ✓ models/{model_name}.pkl")

print("\n" + "=" * 60)
print("READY FOR PHASE 3: MODEL EVALUATION")
print("=" * 60)
print("\nAll models have been trained and saved.")
print("Proceed to Phase 3 to evaluate their performance!")


TRAINING SUMMARY

Total models trained: 7

Training times:
  lasso               :   0.01 seconds
  ridge               :   0.01 seconds
  elasticnet          :   0.01 seconds
  linear_regression   :   0.03 seconds
  xgboost             :   0.14 seconds
  random_forest       :   0.92 seconds
  gradient_boosting   :   2.61 seconds

Total training time: 3.72 seconds

Models saved in ./models/ directory:
  ✓ models/linear_regression.pkl
  ✓ models/ridge.pkl
  ✓ models/lasso.pkl
  ✓ models/elasticnet.pkl
  ✓ models/random_forest.pkl
  ✓ models/xgboost.pkl
  ✓ models/gradient_boosting.pkl

READY FOR PHASE 3: MODEL EVALUATION

All models have been trained and saved.
Proceed to Phase 3 to evaluate their performance!


---

## Next Steps

Proceed to **Phase 3: Model Evaluation** to:
- Load all trained models
- Make predictions on test set
- Calculate performance metrics (RMSE, MAE, R², MAPE)
- Compare all models
- Visualize results
- Select the best model