# Session 3: Overfitting & Complex Models

## Learning Objectives

In this session, we will:

1. **3-Class Classification** - Predict positive, negative, and flat returns
2. **Explore multiple models** - From simple to complex (Logistic Regression → Random Forest → Gradient Boosting → Neural Networks)
3. **Understand overfitting** - See how complex models can memorize training data but fail on test data
4. **Visualize model performance** - Interactive plots with confusion matrices, learning curves, and metrics
5. **Analyze bias-variance trade-off** - When to use simple vs complex models

## The Problem

In Sessions 1-2, we used binary classification (up/down). But in reality, many days have **flat returns** (near zero). 

**3-Class Classification:**
- **Positive**: Return > threshold (e.g., > 0.001)
- **Flat**: -threshold ≤ Return ≤ threshold (e.g., -0.001 to 0.001)
- **Negative**: Return < -threshold (e.g., < -0.001)

**Key question:** Can complex models improve performance, or do they just overfit?


## 1. Setup and Data Loading


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# Configure Plotly for Jupyter notebooks
# Set default renderer - works in Jupyter Notebook, JupyterLab, and VS Code
pio.renderers.default = "notebook"  # Use "jupyterlab" if in JupyterLab, "vscode" if in VS Code

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Import our modules
sys.path.insert(0, str(Path("..").resolve()))

from eda.analysis import basic_summary
from features.engineering import prepare_features, prepare_target
from backtesting.engine import backtest_strategy, print_backtest_metrics


In [None]:
# Load Session 2 data (same dataset with complex non-linear relationship)
data_path = Path("../data/saved/stock_session2.csv")
df = pd.read_csv(data_path, parse_dates=["timestamp"])

print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()


## 2. Prepare 3-Class Target

We'll create three classes based on a threshold parameter. This threshold defines what we consider "flat" returns.


In [None]:
# PARAMETER: Threshold for flat returns (students can play with this)
FLAT_THRESHOLD = 0.003  # Returns between -0.003 and 0.003 are considered "flat"

# Prepare features
feature_cols = ["X1", "X2", "X3", "X4"]
X = prepare_features(df, feature_cols=feature_cols)

# Prepare target: 3-class classification
y_returns = prepare_target(df, target_col="returns")

# Create 3 classes: -1 (negative), 0 (flat), 1 (positive)
# TODO fill the blank to create a multi-class label for y_3_class, using the FLAT_THRESHOLD
y_3class = pd.Series(
    ...
)

print(f"Features shape: {X.shape}")
print(f"\n3-Class Distribution:")
print(y_3class.value_counts().sort_index())
print(f"\nClass percentages:")
for cls in [-1, 0, 1]:
    pct = (y_3class == cls).mean() * 100
    label = {-1: "Negative", 0: "Flat", 1: "Positive"}[cls]
    print(f"  {label}: {pct:.2f}%")

# Visualize return distribution with thresholds
plt.figure(figsize=(12, 5))
plt.hist(y_returns, bins=100, edgecolor='black', alpha=0.7)
plt.axvline(FLAT_THRESHOLD, color='r', linestyle='--', label=f'Positive threshold: {FLAT_THRESHOLD}')
plt.axvline(-FLAT_THRESHOLD, color='r', linestyle='--', label=f'Negative threshold: {-FLAT_THRESHOLD}')
plt.axvline(0, color='gray', linestyle='-', alpha=0.5)
plt.xlabel('Returns')
plt.ylabel('Frequency')
plt.title('Return Distribution with 3-Class Thresholds')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


## 3. Train-Test Split

**Important:** We'll use the same split as before. The time-varying component (X4 in second half) will help us see overfitting.


In [None]:
# Chronological split: 80% train, 20% test
split_idx = int(len(df) * 0.8)

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y_3class.iloc[:split_idx]
y_test = y_3class.iloc[split_idx:]

print(f"Train set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nTrain class distribution:")
print(y_train.value_counts().sort_index())
print(f"\nTest class distribution:")
print(y_test.value_counts().sort_index())

# Check time-varying component: X4 only affects second half
print(f"\nTime-varying component check:")
print(f"  First half (train): X4 mean = {X_train['X4'].mean():.4f}")
print(f"  Second half (test): X4 mean = {X_test['X4'].mean():.4f}")
print(f"  Note: X4 relationship only active in second half - models may overfit to first half patterns!")


## 4. Multiple Models: Simple to Complex

We'll test models of increasing complexity and see how they perform on train vs test sets.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)

# Define models to test (simple to complex)
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Decision Tree (depth=20)': DecisionTreeClassifier(max_depth=20, random_state=42),
    'Random Forest (10 trees)': RandomForestClassifier(n_estimators=10, max_depth=10, random_state=42),
    'Random Forest (100 trees)': RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42),
    'Neural Network (small)': MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=42),
    'Neural Network (large)': MLPClassifier(hidden_layer_sizes=(100, 50, 25), max_iter=1000, random_state=42),
}

# Train and evaluate all models
results = []

for name, model in models.items():
    # Train
    # TODO fill the blanks
    model.fit(...)
    
    # Predictions
    y_train_pred = 
    y_test_pred = 
    
    # Metrics
    train_acc = 
    test_acc = 
    train_f1 = 
    test_f1 = 
    
    # Gap (overfitting indicator)
    acc_gap = 
    f1_gap = 
    
    results.append({
        'model': name,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'train_f1': train_f1,
        'test_f1': test_f1,
        'acc_gap': acc_gap,
        'f1_gap': f1_gap,
        'model_obj': model,
        'y_test_pred': y_test_pred
    })

results_df = pd.DataFrame(results)
print("Model Performance Comparison:")
print(results_df[['model', 'train_acc', 'test_acc', 'acc_gap', 'train_f1', 'test_f1', 'f1_gap']].round(4))


In [None]:
# Visualize overfitting: Train vs Test Accuracy

# TODO do a visualization for train vs test for all models, showcasing the overfitting gap


### Question 4.1: Overfitting Patterns

**What patterns do you see?**


**Key insight:** More complexity ≠ Better performance. The sweet spot balances bias and variance.


## 5. Confusion Matrices

Let's visualize confusion matrices for different models to see where they make mistakes.


In [None]:
# Select a few models to visualize
models_to_plot = ['Logistic Regression', 'Decision Tree (depth=20)', 'Random Forest (100 trees)', 'Neural Network (large)']

# TODO for each of the above models, visualize the confusion matrices (NB you can plot them using plotly heatmaps to visualize things better)


### Question 5.1: Confusion Matrix Interpretation

**What do the confusion matrices tell us?**


## 6. Per-Class Performance Metrics

Let's see how each model performs on each class.


In [None]:
# Calculate per-class metrics for each model
per_class_results = []

for _, row in results_df.iterrows():
    y_pred = row['y_test_pred']
    
    # Per-class precision, recall, F1
    precision = precision_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
    recall = recall_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
    f1 = f1_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
    
    for i, cls in enumerate([-1, 0, 1]):
        per_class_results.append({
            'model': row['model'],
            'class': {-1: 'Negative', 0: 'Flat', 1: 'Positive'}[cls],
            'precision': precision[i],
            'recall': recall[i],
            'f1': f1[i]
        })

per_class_df = pd.DataFrame(per_class_results)

# Interactive plot: Per-class F1 scores
fig = go.Figure()

for cls in ['Negative', 'Flat', 'Positive']:
    cls_data = per_class_df[per_class_df['class'] == cls]
    fig.add_trace(go.Scatter(
        x=cls_data['model'],
        y=cls_data['f1'],
        mode='lines+markers',
        name=cls,
        marker=dict(size=8)
    ))

fig.update_layout(
    title='Per-Class F1 Score by Model',
    xaxis_title='Model',
    yaxis_title='F1 Score',
    hovermode='x unified',
    height=500,
    xaxis=dict(tickangle=-45)
)

fig.show(renderer="notebook")

# Heatmap of F1 scores
pivot_f1 = per_class_df.pivot(index='model', columns='class', values='f1')

fig = go.Figure(data=go.Heatmap(
    z=pivot_f1.values,
    x=pivot_f1.columns,
    y=pivot_f1.index,
    colorscale='Viridis',
    text=pivot_f1.values.round(3),
    texttemplate='%{text}',
    textfont={"size": 10},
    colorbar=dict(title="F1 Score")
))

fig.update_layout(
    title='F1 Score Heatmap: Model vs Class',
    xaxis_title='Class',
    yaxis_title='Model',
    height=600
)

fig.show(renderer="notebook")


In [None]:
from sklearn.model_selection import learning_curve

# Select a few models for learning curves
# TODO use learning_curve to see how fast each model learns (only for logistic and tree models)


## 8. Feature Importance Analysis

Let's see which features different models consider important.


In [None]:
# TODO use, when available, model.feature_importances_ to analyze feature importances per models

### Question 8.1: Feature Importance Insights

**What can we learn from feature importance?**



## 9. Summary: Bias-Variance Trade-off

Let's create a final visualization showing the bias-variance trade-off.


In [None]:
#TODO plot for each model train accuracy, test accuracy and overfitting gap