# Model Evaluation: Comparing Forecasting Methods

This notebook evaluates the performance of different forecasting models:
- ARIMA
- SARIMA (Seasonal ARIMA)
- SARIMAX (Seasonal ARIMA with eXogenous variables)
- Prophet
- LSTM
- TCN (Temporal Convolutional Network)
- Transformer

We'll use data from **10 randomly selected companies** and compare model accuracy using multiple metrics.

## Model Input/Output Summary

### Input Format
All models accept the same input format:
- **Input**: `pd.Series` - A time series of values (e.g., daily returns) with a datetime index
- **Data Type**: Single univariate time series (one column of values over time)

### Output Format
All models return:
- **Forecast**: `pd.Series` - Predicted values for future periods with datetime index
- **Confidence Intervals**: `pd.Series` or `pd.DataFrame` (if available) - Upper and lower bounds

### Model-Specific Details

#### 1. ARIMA
- **Input**: Univariate time series (pd.Series)
- **Internal Processing**: Uses entire historical series to fit ARIMA model
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `pd.DataFrame` with 'lower' and 'upper' columns

#### 2. SARIMA
- **Input**: Univariate time series (pd.Series)
- **Internal Processing**: Uses entire historical series, captures seasonal patterns (e.g., weekly patterns)
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `pd.DataFrame` with 'lower' and 'upper' columns

#### 3. SARIMAX
- **Input**: Univariate time series (pd.Series) + optional exogenous variables
- **Internal Processing**: Uses historical series with additional features (rolling stats, lags)
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `pd.DataFrame` with 'lower' and 'upper' columns

#### 4. Prophet
- **Input**: Univariate time series (pd.Series) - converts to DataFrame with 'ds' (dates) and 'y' (values)
- **Internal Processing**: Uses entire historical series, handles seasonality
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `pd.DataFrame` with 'lower' and 'upper' columns

#### 5. LSTM
- **Input**: Univariate time series (pd.Series)
- **Internal Processing**: 
  - Creates sequences of `lookback_window` (default: 60) past values
  - Normalizes data using MinMaxScaler
  - Input shape: `(batch_size, lookback_window, 1)`
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `None` (not available)

#### 6. TCN
- **Input**: Univariate time series (pd.Series)
- **Internal Processing**: 
  - Creates sequences of `lookback_window` (default: 60) past values
  - Normalizes data using MinMaxScaler
  - Uses dilated convolutions with causal padding
  - Input shape: `(batch_size, lookback_window, 1)`
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `None` (not available)

#### 7. Transformer
- **Input**: Univariate time series (pd.Series)
- **Internal Processing**: 
  - Creates sequences of `lookback_window` (default: 60) past values
  - Normalizes data using MinMaxScaler
  - Uses multi-head attention mechanism
  - Input shape: `(batch_size, lookback_window, 1)`
- **Output**: 
  - Forecast: `pd.Series` with future dates
  - Confidence Intervals: `None` (not available)

### Key Differences
- **ARIMA, SARIMA, SARIMAX & Prophet**: Use entire historical series, provide confidence intervals
- **LSTM, TCN, Transformer**: Use sliding windows of past values, no confidence intervals
- **All models**: Return forecasts as `pd.Series` with datetime index matching the input frequency
- **SARIMA**: Extends ARIMA with seasonal components (useful for weekly/monthly patterns)
- **SARIMAX**: Extends SARIMA with exogenous variables (additional features like rolling statistics)


In [None]:
# Import necessary libraries
import sys
import os
import pandas as pd
import numpy as np
import warnings
from pathlib import Path
from typing import Dict, List, Tuple
import random

# Fix for TensorFlow/protobuf compatibility issue
# This error occurs when there's a version mismatch between TensorFlow and protobuf
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

# Suppress TensorFlow warnings and protobuf errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
warnings.filterwarnings('ignore')

# Try to fix protobuf issue by setting environment variable before importing TensorFlow
try:
    import tensorflow as tf
    # Suppress the specific protobuf error
    import logging
    logging.getLogger('tensorflow').setLevel(logging.ERROR)
except ImportError:
    pass

# Add project root to path
# In Jupyter notebooks, Path.cwd() gives the current working directory
# We assume the notebook is run from the project root or evaluation/ directory
current_dir = Path.cwd()
if current_dir.name == 'evaluation':
    project_root = current_dir.parent
else:
    # If running from project root, use current directory
    project_root = current_dir

sys.path.insert(0, str(project_root))

# Import forecast functions
from src.forecast import (
    arima_forecast,
    sarima_forecast,
    sarimax_forecast,
    prophet_forecast,
    lstm_forecast,
    tcn_forecast,
    transformer_forecast
)

print("Libraries imported successfully!")
print(f"Project root: {project_root}")
print(f"Current directory: {current_dir}")
print("\nNote: If you encounter protobuf errors with TensorFlow models,")
print("try running: pip install protobuf==3.20.3")


2025-12-06 22:22:07.626447: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765034527.644493  246409 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765034527.649835  246409 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765034527.663472  246409 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1765034527.663491  246409 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1765034527.663492  246409 computation_placer.cc:177] computation placer alr

Libraries imported successfully!
Project root: /home/dang.cpm/__MY_SPACE__/VinUni/FinLove
Current directory: /home/dang.cpm/__MY_SPACE__/VinUni/FinLove/evaluation

Note: If you encounter protobuf errors with TensorFlow models,
try running: pip install protobuf==3.20.3


In [2]:
# Fix protobuf compatibility issue
# This cell fixes the common TensorFlow/protobuf version mismatch error
import subprocess
import sys

def fix_protobuf():
    """Attempt to fix protobuf version compatibility."""
    try:
        # Try to install compatible protobuf version
        print("Attempting to fix protobuf compatibility...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "protobuf==3.20.3"], 
                             stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print("✓ Protobuf version fixed. Please restart the kernel and run all cells again.")
        return True
    except Exception as e:
        print(f"Could not automatically fix protobuf. Please run manually:")
        print(f"  pip install protobuf==3.20.3")
        print(f"Or: pip install --upgrade protobuf")
        return False

# Uncomment the line below if you're getting protobuf errors
# fix_protobuf()

# Alternative: Set environment variable to use Python implementation
import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

print("Protobuf compatibility settings applied.")
print("If errors persist, restart kernel after running: pip install protobuf==3.20.3")


Protobuf compatibility settings applied.
If errors persist, restart kernel after running: pip install protobuf==3.20.3


## Step 1: Load and Prepare Data

We'll randomly select 10 companies and load their data.


In [3]:
# Define data directory
data_dir = project_root / 'data'

# Get all CSV files (excluding cache directory)
csv_files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]

# Randomly select 10 companies
random.seed(42)  # For reproducibility
num_companies = min(10, len(csv_files))  # Use 10 or all available if less than 10
selected_companies = random.sample(csv_files, num_companies)

print(f"Selected {num_companies} companies: {[f.split('_')[0] for f in selected_companies]}")
print(f"Total files available: {len(csv_files)}")

def load_company_data(filepath: Path) -> Tuple[str, pd.Series]:
    """
    Load company data from CSV file.
    
    The CSV structure is:
    - Row 1: Column names (Price, Close, High, Low, Open, Volume)
    - Row 2: Ticker metadata
    - Row 3: Date header row
    - Row 4+: Actual data with dates in first column
    
    Args:
        filepath: Path to the CSV file
        
    Returns:
        Tuple of (ticker, Series with Date index and Close prices)
    """
    # Read CSV with header from first row, skip rows 2 and 3 (ticker and Date header)
    df = pd.read_csv(filepath, header=0, skiprows=[1, 2])
    
    # Extract ticker from filename
    ticker = filepath.stem.split('_')[0]
    
    # The first column should be 'Price' which contains the dates
    # Rename it to 'Date' for clarity
    if 'Price' in df.columns:
        df = df.rename(columns={'Price': 'Date'})
    
    # Convert Date column to datetime and set as index
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    
    # Use Close price column
    if 'Close' not in df.columns:
        # If Close column doesn't exist, check what columns we have
        raise ValueError(f"Close column not found. Available columns: {df.columns.tolist()}")
    
    close_prices = df['Close'].copy()
    
    return ticker, close_prices

# Load data for both companies
company_data = {}
for filename in selected_companies:
    filepath = data_dir / filename
    ticker, prices = load_company_data(filepath)
    company_data[ticker] = prices
    print(f"\n{ticker}:")
    print(f"  Data points: {len(prices)}")
    print(f"  Date range: {prices.index[0]} to {prices.index[-1]}")
    print(f"  Price range: ${prices.min():.2f} - ${prices.max():.2f}")


Selected 10 companies: ['MA', 'PG', 'XLK', 'NKE', 'XLB', 'HD', 'XLV', 'AAPL', 'AMZN', 'XLY']
Total files available: 29

MA:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $76.10 - $598.17

PG:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $56.85 - $175.07

XLK:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $34.66 - $304.13

NKE:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $44.32 - $167.31

XLB:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $30.68 - $95.72

HD:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $88.72 - $423.60

XLV:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00:00
  Price range: $54.01 - $154.61

AAPL:
  Data points: 2512
  Date range: 2015-11-27 00:00:00 to 2025-11-21 00:00

## Step 2: Calculate Returns and Split Data

We'll calculate daily returns and split the data into training and testing sets.


In [4]:
# Calculate returns for each company
# Returns = (Price_t - Price_{t-1}) / Price_{t-1}
returns_data = {}
for ticker, prices in company_data.items():
    returns = prices.pct_change().dropna()
    returns_data[ticker] = returns
    print(f"\n{ticker} Returns:")
    print(f"  Data points: {len(returns)}")
    print(f"  Mean return: {returns.mean():.6f}")
    print(f"  Std return: {returns.std():.6f}")
    print(f"  Min return: {returns.min():.6f}")
    print(f"  Max return: {returns.max():.6f}")

# Split data into train and test sets
# Use 80% for training, 20% for testing
train_test_split = {}
for ticker, returns in returns_data.items():
    split_idx = int(len(returns) * 0.8)
    train_returns = returns.iloc[:split_idx]
    test_returns = returns.iloc[split_idx:]
    
    train_test_split[ticker] = {
        'train': train_returns,
        'test': test_returns
    }
    
    print(f"\n{ticker} Split:")
    print(f"  Train: {len(train_returns)} points ({train_returns.index[0]} to {train_returns.index[-1]})")
    print(f"  Test: {len(test_returns)} points ({test_returns.index[0]} to {test_returns.index[-1]})")



MA Returns:
  Data points: 2511
  Mean return: 0.000844
  Std return: 0.016934
  Min return: -0.127254
  Max return: 0.166109

PG Returns:
  Data points: 2511
  Mean return: 0.000454
  Std return: 0.011794
  Min return: -0.087373
  Max return: 0.120090

XLK Returns:
  Data points: 2511
  Mean return: 0.000891
  Std return: 0.015279
  Min return: -0.138140
  Max return: 0.134257

NKE Returns:
  Data points: 2511
  Mean return: 0.000220
  Std return: 0.019837
  Min return: -0.199809
  Max return: 0.155315

XLB Returns:
  Data points: 2511
  Mean return: 0.000421
  Std return: 0.013125
  Min return: -0.110084
  Max return: 0.117601

HD Returns:
  Data points: 2511
  Mean return: 0.000588
  Std return: 0.015489
  Min return: -0.197938
  Max return: 0.137508

XLV Returns:
  Data points: 2511
  Mean return: 0.000426
  Std return: 0.010513
  Min return: -0.098610
  Max return: 0.077057

AAPL Returns:
  Data points: 2511
  Mean return: 0.001094
  Std return: 0.018354
  Min return: -0.128647
 

## Step 3: Define Evaluation Metrics

We'll calculate multiple metrics to compare model performance:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- MAPE (Mean Absolute Percentage Error)
- R² (Coefficient of Determination)


In [5]:
def calculate_metrics(y_true: pd.Series, y_pred: pd.Series) -> Dict[str, float]:
    """
    Calculate evaluation metrics for forecast predictions.
    
    Args:
        y_true: True values
        y_pred: Predicted values
        
    Returns:
        Dictionary with metric names and values
    """
    # Align indices in case they don't match exactly
    common_idx = y_true.index.intersection(y_pred.index)
    if len(common_idx) == 0:
        # If no common index, use positional alignment
        min_len = min(len(y_true), len(y_pred))
        y_true_aligned = y_true.iloc[:min_len].values
        y_pred_aligned = y_pred.iloc[:min_len].values
    else:
        y_true_aligned = y_true.loc[common_idx].values
        y_pred_aligned = y_pred.loc[common_idx].values
    
    # Calculate metrics
    mae = np.mean(np.abs(y_true_aligned - y_pred_aligned))
    rmse = np.sqrt(np.mean((y_true_aligned - y_pred_aligned) ** 2))
    
    # MAPE - handle division by zero
    mask = np.abs(y_true_aligned) > 1e-10
    if mask.sum() > 0:
        mape = np.mean(np.abs((y_true_aligned[mask] - y_pred_aligned[mask]) / y_true_aligned[mask])) * 100
    else:
        mape = np.nan
    
    # R² score
    ss_res = np.sum((y_true_aligned - y_pred_aligned) ** 2)
    ss_tot = np.sum((y_true_aligned - np.mean(y_true_aligned)) ** 2)
    r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else np.nan
    
    return {
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape,
        'R²': r2
    }

print("Evaluation metrics function defined!")


Evaluation metrics function defined!


## Step 4: Evaluate Models

We'll evaluate each model on all 10 companies and collect the results.

**Note**: This may take a while as we're evaluating 7 models × 10 companies = 70 model runs.


In [None]:
# Define forecast horizon (number of days to predict)
forecast_horizon = 30

# Store results for each company and model
results = {}

# Track progress
total_companies = len(train_test_split)
company_num = 0

# Evaluate each company
for ticker, data in train_test_split.items():
    company_num += 1
    print(f"\n{'='*60}")
    print(f"Evaluating models for {ticker} ({company_num}/{total_companies})")
    print(f"{'='*60}")
    
    train_returns = data['train']
    test_returns = data['test']
    
    # Limit test set to forecast_horizon for fair comparison
    actual_test = test_returns.iloc[:forecast_horizon]
    
    results[ticker] = {}
    
    # 1. ARIMA
    print(f"\n[1/7] ARIMA...")
    try:
        forecast, _ = arima_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            auto_select=True
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['ARIMA'] = metrics
        print(f"  ✓ ARIMA completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ ARIMA failed: {str(e)}")
        results[ticker]['ARIMA'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 2. SARIMA
    print(f"\n[2/7] SARIMA...")
    try:
        forecast, _ = sarima_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            auto_select=True
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['SARIMA'] = metrics
        print(f"  ✓ SARIMA completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ SARIMA failed: {str(e)}")
        results[ticker]['SARIMA'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 3. SARIMAX
    print(f"\n[3/7] SARIMAX...")
    try:
        forecast, _ = sarimax_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            auto_select=True
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['SARIMAX'] = metrics
        print(f"  ✓ SARIMAX completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ SARIMAX failed: {str(e)}")
        results[ticker]['SARIMAX'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 4. Prophet
    print(f"\n[4/7] Prophet...")
    try:
        forecast, _ = prophet_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            yearly_seasonality=False,  # Disable for daily returns
            weekly_seasonality=True,
            daily_seasonality=False
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['Prophet'] = metrics
        print(f"  ✓ Prophet completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ Prophet failed: {str(e)}")
        results[ticker]['Prophet'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 5. LSTM
    print(f"\n[5/7] LSTM...")
    try:
        forecast, _ = lstm_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            lookback_window=60,
            lstm_units=50,
            epochs=20,  # Reduced for faster evaluation
            batch_size=32,
            use_cache=True,
            ticker=ticker
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['LSTM'] = metrics
        print(f"  ✓ LSTM completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ LSTM failed: {str(e)}")
        results[ticker]['LSTM'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 6. TCN
    print(f"\n[6/7] TCN...")
    try:
        forecast, _ = tcn_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            lookback_window=60,
            num_filters=64,
            kernel_size=3,
            num_blocks=2,
            epochs=20,  # Reduced for faster evaluation
            batch_size=32,
            use_cache=True,
            ticker=ticker
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['TCN'] = metrics
        print(f"  ✓ TCN completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ TCN failed: {str(e)}")
        results[ticker]['TCN'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    # 7. Transformer
    print(f"\n[7/7] Transformer...")
    try:
        forecast, _ = transformer_forecast(
            train_returns,
            forecast_horizon=forecast_horizon,
            lookback_window=60,
            d_model=64,
            num_heads=4,
            num_layers=2,
            epochs=20,  # Reduced for faster evaluation
            batch_size=32,
            use_cache=True,
            ticker=ticker
        )
        metrics = calculate_metrics(actual_test, forecast)
        results[ticker]['Transformer'] = metrics
        print(f"  ✓ Transformer completed - MAE: {metrics['MAE']:.6f}, RMSE: {metrics['RMSE']:.6f}")
    except Exception as e:
        print(f"  ✗ Transformer failed: {str(e)}")
        results[ticker]['Transformer'] = {'MAE': np.nan, 'RMSE': np.nan, 'MAPE': np.nan, 'R²': np.nan}
    
    print(f"\n{ticker} evaluation completed!")

print(f"\n{'='*60}")
print("All evaluations completed!")
print(f"{'='*60}")



Evaluating models for MA (1/10)

[1/5] ARIMA...


  ✓ ARIMA completed - MAE: 0.006683, RMSE: 0.008497

[2/5] Prophet...


22:22:18 - cmdstanpy - INFO - Chain [1] start processing
22:22:18 - cmdstanpy - INFO - Chain [1] done processing


  ✓ Prophet completed - MAE: 0.006624, RMSE: 0.008486

[3/5] LSTM...


I0000 00:00:1765034541.081670  246409 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21795 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:1a:00.0, compute capability: 8.6
I0000 00:00:1765034541.083566  246409 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 11529 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:1b:00.0, compute capability: 8.6
I0000 00:00:1765034541.091742  246409 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22355 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6
I0000 00:00:1765034541.099324  246409 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10247 MB memory:  -> device: 3, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:3e:00.0, compute capability: 8.6
I0000 00:00:1765034541.101709  246409 gpu_device.cc:2019

  ✓ LSTM completed - MAE: 0.007102, RMSE: 0.008642

[4/5] TCN...


I0000 00:00:1765034548.332813  247853 service.cc:152] XLA service 0x7f1fd906d110 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1765034548.332865  247853 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332877  247853 service.cc:160]   StreamExecutor device (1): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332881  247853 service.cc:160]   StreamExecutor device (2): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332884  247853 service.cc:160]   StreamExecutor device (3): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332890  247853 service.cc:160]   StreamExecutor device (4): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332896  247853 service.cc:160]   StreamExecutor device (5): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1765034548.332899  247853 service

  ✓ TCN completed - MAE: 0.006476, RMSE: 0.008317

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.006461, RMSE: 0.008569

MA evaluation completed!

Evaluating models for PG (2/10)

[1/5] ARIMA...


22:22:50 - cmdstanpy - INFO - Chain [1] start processing
22:22:51 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.009129, RMSE: 0.012760

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.009124, RMSE: 0.012779

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.008949, RMSE: 0.012631

[4/5] TCN...
  ✓ TCN completed - MAE: 0.008900, RMSE: 0.012494

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.013467, RMSE: 0.016978

PG evaluation completed!

Evaluating models for XLK (3/10)

[1/5] ARIMA...


22:23:22 - cmdstanpy - INFO - Chain [1] start processing
22:23:22 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.006228, RMSE: 0.007676

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.005525, RMSE: 0.007089

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.007008, RMSE: 0.008071

[4/5] TCN...
  ✓ TCN completed - MAE: 0.006258, RMSE: 0.007436

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.013014, RMSE: 0.014397

XLK evaluation completed!

Evaluating models for NKE (4/10)

[1/5] ARIMA...


22:23:51 - cmdstanpy - INFO - Chain [1] start processing
22:23:51 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.009150, RMSE: 0.011840

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.009150, RMSE: 0.012131

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.010015, RMSE: 0.012918

[4/5] TCN...
  ✓ TCN completed - MAE: 0.008234, RMSE: 0.010792

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.008082, RMSE: 0.010245

NKE evaluation completed!

Evaluating models for XLB (5/10)

[1/5] ARIMA...


22:24:24 - cmdstanpy - INFO - Chain [1] start processing
22:24:24 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.006788, RMSE: 0.008374

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.006770, RMSE: 0.008461

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.006076, RMSE: 0.008272

[4/5] TCN...
  ✓ TCN completed - MAE: 0.007991, RMSE: 0.009349

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.006627, RMSE: 0.009083

XLB evaluation completed!

Evaluating models for HD (6/10)

[1/5] ARIMA...


22:24:55 - cmdstanpy - INFO - Chain [1] start processing
22:24:55 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.009244, RMSE: 0.011964

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.008997, RMSE: 0.012016

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.008523, RMSE: 0.011205

[4/5] TCN...
  ✓ TCN completed - MAE: 0.009948, RMSE: 0.012831

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.008603, RMSE: 0.011316

HD evaluation completed!

Evaluating models for XLV (7/10)

[1/5] ARIMA...


22:25:26 - cmdstanpy - INFO - Chain [1] start processing
22:25:26 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.005441, RMSE: 0.007148

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.005468, RMSE: 0.007191

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.006227, RMSE: 0.007765

[4/5] TCN...
  ✓ TCN completed - MAE: 0.005518, RMSE: 0.007128

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.006103, RMSE: 0.008060

XLV evaluation completed!

Evaluating models for AAPL (8/10)

[1/5] ARIMA...


22:25:55 - cmdstanpy - INFO - Chain [1] start processing
22:25:56 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.007262, RMSE: 0.008755

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.007709, RMSE: 0.009109

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.007368, RMSE: 0.008866

[4/5] TCN...
  ✓ TCN completed - MAE: 0.007457, RMSE: 0.008829

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.008171, RMSE: 0.009938

AAPL evaluation completed!

Evaluating models for AMZN (9/10)

[1/5] ARIMA...


22:26:26 - cmdstanpy - INFO - Chain [1] start processing
22:26:26 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.010575, RMSE: 0.012368

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.011027, RMSE: 0.012704

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.010725, RMSE: 0.012437

[4/5] TCN...
  ✓ TCN completed - MAE: 0.010545, RMSE: 0.012409

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.011245, RMSE: 0.013293

AMZN evaluation completed!

Evaluating models for XLY (10/10)

[1/5] ARIMA...


22:26:57 - cmdstanpy - INFO - Chain [1] start processing
22:26:57 - cmdstanpy - INFO - Chain [1] done processing


  ✓ ARIMA completed - MAE: 0.005357, RMSE: 0.007260

[2/5] Prophet...
  ✓ Prophet completed - MAE: 0.005385, RMSE: 0.007413

[3/5] LSTM...
  ✓ LSTM completed - MAE: 0.004910, RMSE: 0.006910

[4/5] TCN...
  ✓ TCN completed - MAE: 0.005043, RMSE: 0.006978

[5/5] Transformer...
  ✓ Transformer completed - MAE: 0.006663, RMSE: 0.008359

XLY evaluation completed!

All evaluations completed!


## Step 5: Create Comparison Table

Now we'll create a comprehensive comparison table showing all metrics for all models.


In [None]:
# Create a comprehensive results table
all_results = []

# Collect results for each company and model
for ticker, model_results in results.items():
    for model_name, metrics in model_results.items():
        all_results.append({
            'Company': ticker,
            'Model': model_name,
            'MAE': metrics['MAE'],
            'RMSE': metrics['RMSE'],
            'MAPE': metrics['MAPE'],
            'R²': metrics['R²']
        })

# Create DataFrame
results_df = pd.DataFrame(all_results)

# Display the results table
print("\n" + "="*80)
print("MODEL COMPARISON RESULTS")
print("="*80)
print("\nDetailed Results:")
print(results_df.to_string(index=False))

# Create a summary table with averages across companies
print("\n" + "="*80)
print("AVERAGE METRICS ACROSS ALL COMPANIES")
print("="*80)

summary_data = []
for model in ['ARIMA', 'SARIMA', 'SARIMAX', 'Prophet', 'LSTM', 'TCN', 'Transformer']:
    model_data = results_df[results_df['Model'] == model]
    if len(model_data) > 0:
        summary_data.append({
            'Model': model,
            'MAE (avg)': model_data['MAE'].mean(),
            'RMSE (avg)': model_data['RMSE'].mean(),
            'MAPE (avg)': model_data['MAPE'].mean(),
            'R² (avg)': model_data['R²'].mean(),
            'MAE (std)': model_data['MAE'].std(),
            'RMSE (std)': model_data['RMSE'].std()
        })

summary_df = pd.DataFrame(summary_data)
print("\nSummary Statistics:")
print(summary_df.to_string(index=False))



MODEL COMPARISON RESULTS

Detailed Results:
Company       Model      MAE     RMSE        MAPE        R²
     MA       ARIMA 0.006683 0.008497  285.311476 -0.023658
     MA     Prophet 0.006624 0.008486  258.490351 -0.021104
     MA        LSTM 0.007102 0.008642  243.770144 -0.058927
     MA         TCN 0.006476 0.008317  335.668377  0.019103
     MA Transformer 0.006461 0.008569  933.402334 -0.041172
     PG       ARIMA 0.009129 0.012760  106.868276 -0.042332
     PG     Prophet 0.009124 0.012779   99.892855 -0.045343
     PG        LSTM 0.008949 0.012631   98.241673 -0.021267
     PG         TCN 0.008900 0.012494   98.771720  0.000666
     PG Transformer 0.013467 0.016978  361.687983 -0.845194
    XLK       ARIMA 0.006228 0.007676  289.423485 -0.204227
    XLK     Prophet 0.005525 0.007089  184.269196 -0.027288
    XLK        LSTM 0.007008 0.008071  254.715218 -0.331321
    XLK         TCN 0.006258 0.007436  269.790942 -0.130300
    XLK Transformer 0.013014 0.014397 1365.662708 -3.23

## Step 6: Formatted Comparison Table

A nicely formatted table for easy comparison.


In [None]:
# Create a pivot table for better visualization
pivot_mae = results_df.pivot(index='Company', columns='Model', values='MAE')
pivot_rmse = results_df.pivot(index='Company', columns='Model', values='RMSE')
pivot_r2 = results_df.pivot(index='Company', columns='Model', values='R²')

print("\n" + "="*80)
print("MAE (Mean Absolute Error) - Lower is Better")
print("="*80)
print(pivot_mae.round(6).to_string())

print("\n" + "="*80)
print("RMSE (Root Mean Squared Error) - Lower is Better")
print("="*80)
print(pivot_rmse.round(6).to_string())

print("\n" + "="*80)
print("R² (Coefficient of Determination) - Higher is Better (closer to 1)")
print("="*80)
print(pivot_r2.round(4).to_string())

# Find best model for each metric
print("\n" + "="*80)
print("BEST MODEL FOR EACH METRIC (across all companies)")
print("="*80)

# Calculate average metrics
avg_metrics = results_df.groupby('Model').agg({
    'MAE': 'mean',
    'RMSE': 'mean',
    'R²': 'mean'
}).round(6)

best_mae = avg_metrics['MAE'].idxmin()
best_rmse = avg_metrics['RMSE'].idxmin()
best_r2 = avg_metrics['R²'].idxmax()

print(f"\nBest MAE:  {best_mae} ({avg_metrics.loc[best_mae, 'MAE']:.6f})")
print(f"Best RMSE: {best_rmse} ({avg_metrics.loc[best_rmse, 'RMSE']:.6f})")
print(f"Best R²:   {best_r2} ({avg_metrics.loc[best_r2, 'R²']:.4f})")

# Overall ranking (based on average rank across all metrics)
avg_metrics['MAE_rank'] = avg_metrics['MAE'].rank()
avg_metrics['RMSE_rank'] = avg_metrics['RMSE'].rank()
avg_metrics['R²_rank'] = avg_metrics['R²'].rank(ascending=False)  # Higher is better
avg_metrics['Overall_Rank'] = (avg_metrics['MAE_rank'] + avg_metrics['RMSE_rank'] + avg_metrics['R²_rank']) / 3

print("\n" + "="*80)
print("OVERALL RANKING (lower rank is better)")
print("="*80)
ranking = avg_metrics[['MAE', 'RMSE', 'R²', 'Overall_Rank']].sort_values('Overall_Rank')
print(ranking.to_string())



MAE (Mean Absolute Error) - Lower is Better
Model       ARIMA      LSTM   Prophet       TCN  Transformer
Company                                                     
AAPL     0.007262  0.007368  0.007709  0.007457     0.008171
AMZN     0.010575  0.010725  0.011027  0.010545     0.011245
HD       0.009244  0.008523  0.008997  0.009948     0.008603
MA       0.006683  0.007102  0.006624  0.006476     0.006461
NKE      0.009150  0.010015  0.009150  0.008234     0.008082
PG       0.009129  0.008949  0.009124  0.008900     0.013467
XLB      0.006788  0.006076  0.006770  0.007991     0.006627
XLK      0.006228  0.007008  0.005525  0.006258     0.013014
XLV      0.005441  0.006227  0.005468  0.005518     0.006103
XLY      0.005357  0.004910  0.005385  0.005043     0.006663

RMSE (Root Mean Squared Error) - Lower is Better
Model       ARIMA      LSTM   Prophet       TCN  Transformer
Company                                                     
AAPL     0.008755  0.008866  0.009109  0.008829    

: 

## Summary

This evaluation compares the performance of 7 forecasting models:
- **ARIMA**: Statistical time series model
- **SARIMA**: Seasonal ARIMA model (captures seasonal patterns)
- **SARIMAX**: Seasonal ARIMA with exogenous variables (uses additional features)
- **Prophet**: Facebook's forecasting tool
- **LSTM**: Long Short-Term Memory neural network
- **TCN**: Temporal Convolutional Network
- **Transformer**: Attention-based neural network

### Metrics Explained:
- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and actual values. Lower is better.
- **RMSE (Root Mean Squared Error)**: Square root of average squared differences. Penalizes large errors more. Lower is better.
- **MAPE (Mean Absolute Percentage Error)**: Average percentage error. Lower is better.
- **R² (Coefficient of Determination)**: Proportion of variance explained. Closer to 1 is better.

The models are evaluated on daily returns data, which is more challenging to predict than prices due to the noisy nature of financial returns.
