# 4. Advanced Models - Gold Price Forecasting

**Objective:** Implement advanced machine learning models and deep learning.

**Author:** Félix Jouary  
**Dataset:** Kaggle Gold Price Dataset

**Models in this notebook:**
- XGBoost (eXtreme Gradient Boosting)
- LightGBM
- LSTM (Long Short-Term Memory) - Deep Learning

**References:**
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD '16.
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.

## 4.1 Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')

# Advanced ML models
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")

## 4.2 Load Processed Data

In [None]:
# Load scaled data
X_train_scaled = np.load('../data/processed/X_train_scaled.npy')
X_test_scaled = np.load('../data/processed/X_test_scaled.npy')
y_train = np.load('../data/processed/y_train.npy')
y_test = np.load('../data/processed/y_test.npy')

# Load feature names
feature_names = pd.read_csv('../data/processed/feature_names.csv').iloc[:, 0].tolist()

# Load original data for visualization
train_data = pd.read_csv('../data/processed/train_data.csv', parse_dates=['Date'])
test_data = pd.read_csv('../data/processed/test_data.csv', parse_dates=['Date'])

print(f"X_train shape: {X_train_scaled.shape}")
print(f"X_test shape: {X_test_scaled.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

## 4.3 Evaluation Function

In [None]:
def evaluate_model(y_true, y_pred, model_name="Model"):
    """Calculate and display regression metrics."""
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{'='*50}")
    print(f"{model_name} - Evaluation Metrics")
    print(f"{'='*50}")
    print(f"RMSE:  ${rmse:.2f}")
    print(f"MAE:   ${mae:.2f}")
    print(f"MAPE:  {mape:.2f}%")
    print(f"R²:    {r2:.4f}")
    
    return {'Model': model_name, 'RMSE': rmse, 'MAE': mae, 'MAPE': mape, 'R2': r2}

# Store results
results = []

## 4.4 Model 1: XGBoost

XGBoost (eXtreme Gradient Boosting) is an optimized gradient boosting algorithm known for:
- Regularization to prevent overfitting
- Parallel processing for speed
- Handling missing values
- Built-in cross-validation

**Reference:** Chen & Guestrin (2016) - XGBoost: A Scalable Tree Boosting System

In [None]:
# Time Series Cross-Validation
tscv = TimeSeriesSplit(n_splits=5)

# XGBoost hyperparameter tuning
xgb_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

xgb_grid = GridSearchCV(
    XGBRegressor(random_state=42, n_jobs=-1, verbosity=0),
    xgb_params,
    cv=tscv,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Training XGBoost with GridSearchCV...")
xgb_grid.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {xgb_grid.best_params_}")
print(f"Best CV score (neg MSE): {xgb_grid.best_score_:.2f}")

In [None]:
# Train with best parameters
xgb_model = xgb_grid.best_estimator_

# Predictions
y_pred_xgb_train = xgb_model.predict(X_train_scaled)
y_pred_xgb_test = xgb_model.predict(X_test_scaled)

# Evaluate
print("TRAINING SET:")
_ = evaluate_model(y_train, y_pred_xgb_train, "XGBoost (Train)")

print("\nTEST SET:")
xgb_results = evaluate_model(y_test, y_pred_xgb_test, "XGBoost")
results.append(xgb_results)

In [None]:
# XGBoost Feature Importance
xgb_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(xgb_importance['Feature'].head(15)[::-1], xgb_importance['Importance'].head(15)[::-1])
plt.xlabel('Importance')
plt.title('Top 15 Feature Importances - XGBoost')
plt.tight_layout()
plt.savefig('../reports/figures/xgb_feature_importance.png', dpi=150)
plt.show()

print("Top 10 most important features (XGBoost):")
print(xgb_importance.head(10))

## 4.5 Model 2: LightGBM

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is:
- Faster than XGBoost
- Uses less memory
- Better accuracy with large datasets

**Reference:** Ke et al. (2017) - LightGBM: A Highly Efficient Gradient Boosting Decision Tree

In [None]:
# LightGBM hyperparameter tuning
lgbm_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, -1],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [31, 50, 100]
}

lgbm_grid = GridSearchCV(
    LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1),
    lgbm_params,
    cv=tscv,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Training LightGBM with GridSearchCV...")
lgbm_grid.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {lgbm_grid.best_params_}")
print(f"Best CV score (neg MSE): {lgbm_grid.best_score_:.2f}")

In [None]:
# Train with best parameters
lgbm_model = lgbm_grid.best_estimator_

# Predictions
y_pred_lgbm_train = lgbm_model.predict(X_train_scaled)
y_pred_lgbm_test = lgbm_model.predict(X_test_scaled)

# Evaluate
print("TRAINING SET:")
_ = evaluate_model(y_train, y_pred_lgbm_train, "LightGBM (Train)")

print("\nTEST SET:")
lgbm_results = evaluate_model(y_test, y_pred_lgbm_test, "LightGBM")
results.append(lgbm_results)

## 4.6 Model 3: LSTM (Deep Learning)

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) designed to learn long-term dependencies in sequential data.

**Why LSTM for time series:**
- Captures temporal patterns
- Handles variable-length sequences
- Memory cells retain information over time

**Reference:** Hochreiter & Schmidhuber (1997) - Long Short-Term Memory

In [None]:
# Import TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam

print(f"TensorFlow version: {tf.__version__}")

In [None]:
# Prepare data for LSTM (needs 3D input: samples, timesteps, features)
# We'll use a sequence length of 30 days

def create_sequences(X, y, seq_length=30):
    """Create sequences for LSTM input."""
    X_seq, y_seq = [], []
    for i in range(seq_length, len(X)):
        X_seq.append(X[i-seq_length:i])
        y_seq.append(y[i])
    return np.array(X_seq), np.array(y_seq)

# Create sequences
SEQ_LENGTH = 30

X_train_lstm, y_train_lstm = create_sequences(X_train_scaled, y_train, SEQ_LENGTH)
X_test_lstm, y_test_lstm = create_sequences(X_test_scaled, y_test, SEQ_LENGTH)

print(f"LSTM Training shape: {X_train_lstm.shape}")
print(f"LSTM Test shape: {X_test_lstm.shape}")

In [None]:
# Build LSTM model
def build_lstm_model(input_shape):
    model = Sequential([
        LSTM(50, return_sequences=True, input_shape=input_shape),
        Dropout(0.2),
        LSTM(50, return_sequences=False),
        Dropout(0.2),
        Dense(25, activation='relu'),
        Dense(1)
    ])
    
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )
    
    return model

# Create model
lstm_model = build_lstm_model((SEQ_LENGTH, X_train_scaled.shape[1]))
lstm_model.summary()

In [None]:
# Train LSTM with early stopping
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

print("Training LSTM model...")
history = lstm_model.fit(
    X_train_lstm, y_train_lstm,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[early_stop],
    verbose=1
)

print(f"\nTraining stopped at epoch {len(history.history['loss'])}")

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_title('LSTM Training History - Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss (MSE)')
axes[0].legend()

# MAE
axes[1].plot(history.history['mae'], label='Training MAE')
axes[1].plot(history.history['val_mae'], label='Validation MAE')
axes[1].set_title('LSTM Training History - MAE')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MAE')
axes[1].legend()

plt.tight_layout()
plt.savefig('../reports/figures/lstm_training_history.png', dpi=150)
plt.show()

In [None]:
# LSTM Predictions
y_pred_lstm_train = lstm_model.predict(X_train_lstm, verbose=0).flatten()
y_pred_lstm_test = lstm_model.predict(X_test_lstm, verbose=0).flatten()

# Evaluate
print("TRAINING SET:")
_ = evaluate_model(y_train_lstm, y_pred_lstm_train, "LSTM (Train)")

print("\nTEST SET:")
lstm_results = evaluate_model(y_test_lstm, y_pred_lstm_test, "LSTM")
results.append(lstm_results)

## 4.7 Overfitting Analysis

In [None]:
# Compare train vs test performance
models_names = ['XGBoost', 'LightGBM', 'LSTM']
train_preds = [y_pred_xgb_train, y_pred_lgbm_train, y_pred_lstm_train]
test_preds = [y_pred_xgb_test, y_pred_lgbm_test, y_pred_lstm_test]
y_trains = [y_train, y_train, y_train_lstm]
y_tests = [y_test, y_test, y_test_lstm]

overfitting_analysis = []
for name, train_pred, test_pred, y_tr, y_te in zip(models_names, train_preds, test_preds, y_trains, y_tests):
    train_rmse = np.sqrt(mean_squared_error(y_tr, train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_te, test_pred))
    gap = ((test_rmse - train_rmse) / train_rmse) * 100
    
    overfitting_analysis.append({
        'Model': name,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Gap (%)': gap
    })

overfit_df = pd.DataFrame(overfitting_analysis)
print("Overfitting Analysis - Advanced Models:")
print(overfit_df.to_string(index=False))

## 4.8 Model Comparison

In [None]:
# Create comparison dataframe
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSE')

print("Advanced Models Comparison (sorted by RMSE):")
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

colors = ['#2ecc71', '#3498db', '#9b59b6']

# RMSE
axes[0, 0].barh(results_df['Model'], results_df['RMSE'], color=colors)
axes[0, 0].set_xlabel('RMSE ($)')
axes[0, 0].set_title('RMSE by Model (lower is better)')

# MAE
axes[0, 1].barh(results_df['Model'], results_df['MAE'], color=colors)
axes[0, 1].set_xlabel('MAE ($)')
axes[0, 1].set_title('MAE by Model (lower is better)')

# MAPE
axes[1, 0].barh(results_df['Model'], results_df['MAPE'], color=colors)
axes[1, 0].set_xlabel('MAPE (%)')
axes[1, 0].set_title('MAPE by Model (lower is better)')

# R²
axes[1, 1].barh(results_df['Model'], results_df['R2'], color=colors)
axes[1, 1].set_xlabel('R²')
axes[1, 1].set_title('R² by Model (higher is better)')

plt.tight_layout()
plt.savefig('../reports/figures/model_comparison_advanced.png', dpi=150)
plt.show()

## 4.9 Predictions Visualization

In [None]:
# Plot predictions for all advanced models
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# XGBoost
axes[0].plot(test_data['Date'], y_test, label='Actual', alpha=0.8)
axes[0].plot(test_data['Date'], y_pred_xgb_test, label='XGBoost', alpha=0.8)
axes[0].set_title('XGBoost Predictions')
axes[0].set_ylabel('Price (USD)')
axes[0].legend()

# LightGBM
axes[1].plot(test_data['Date'], y_test, label='Actual', alpha=0.8)
axes[1].plot(test_data['Date'], y_pred_lgbm_test, label='LightGBM', alpha=0.8)
axes[1].set_title('LightGBM Predictions')
axes[1].set_ylabel('Price (USD)')
axes[1].legend()

# LSTM (note: shorter due to sequence creation)
lstm_dates = test_data['Date'].iloc[SEQ_LENGTH:].reset_index(drop=True)
axes[2].plot(lstm_dates, y_test_lstm, label='Actual', alpha=0.8)
axes[2].plot(lstm_dates, y_pred_lstm_test, label='LSTM', alpha=0.8)
axes[2].set_title('LSTM Predictions')
axes[2].set_xlabel('Date')
axes[2].set_ylabel('Price (USD)')
axes[2].legend()

plt.tight_layout()
plt.savefig('../reports/figures/advanced_models_predictions.png', dpi=150)
plt.show()

## 4.10 Save Models and Results

In [None]:
# Save models
joblib.dump(xgb_model, '../models/xgboost.pkl')
joblib.dump(lgbm_model, '../models/lightgbm.pkl')
lstm_model.save('../models/lstm_model.keras')

# Save results
results_df.to_csv('../reports/advanced_results.csv', index=False)

print("All models saved successfully!")
print("\nFiles created:")
print("- ../models/xgboost.pkl")
print("- ../models/lightgbm.pkl")
print("- ../models/lstm_model.keras")
print("- ../reports/advanced_results.csv")

## 4.11 Summary

### Advanced Models Implemented:

1. **XGBoost** - State-of-the-art gradient boosting
   - Reference: Chen & Guestrin (2016)
   - Regularization, parallel processing, handles missing values

2. **LightGBM** - Fast gradient boosting framework
   - Reference: Ke et al. (2017)
   - Faster and more memory efficient than XGBoost

3. **LSTM** - Deep learning for sequences
   - Reference: Hochreiter & Schmidhuber (1997)
   - Captures long-term temporal dependencies

### Key Findings:

- All models use **TimeSeriesSplit** for cross-validation
- LSTM uses **Early Stopping** to prevent overfitting
- XGBoost and LightGBM typically outperform traditional ML models

### Next Steps:

- Compare all models (baseline + advanced)
- Select best model for production
- Final conclusions and recommendations

In [None]:
# Final summary
print("="*60)
print("ADVANCED MODELS SUMMARY")
print("="*60)
print(f"\nBest Advanced Model: {results_df.iloc[0]['Model']}")
print(f"Best RMSE: ${results_df.iloc[0]['RMSE']:.2f}")
print(f"Best MAE: ${results_df.iloc[0]['MAE']:.2f}")
print(f"Best MAPE: {results_df.iloc[0]['MAPE']:.2f}%")
print(f"Best R²: {results_df.iloc[0]['R2']:.4f}")
print("="*60)