# **02 - Baseline & Classical Models**
**Goal**: Implement baseline + classical forecasting models for hourly temperature (1-7 day horizons)

**Models**:
- **Baseline**: Naive, Seasonal Naive
- **Classical**: SARIMA, Exponential Smoothing

**Evaluation**: RMSE, MAE across 24/48/72/96/120/144/168 hour horizons

### Step 1: Setup & Load Data

Load cleaned hourly temperature data created.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Load processed data from Member 1 (automatic download)
url = "https://raw.githubusercontent.com/aejae-da/weather-forecasting-project-gha/main/data/weather_temperature_hourly.csv"
df = pd.read_csv(url, parse_dates=['date'])
print(f"‚úÖ Loaded Member 1 data: {df.shape}")
print("Columns:", list(df.columns))
df.head()

### Step 2: Train/Test Split

**Split**: 80% train + last 7 days test

In [None]:
# Use last 7 days for testing (168 hours)
test_size = 24 * 7  # 168 hours
train = df.iloc[:-test_size].copy()
test = df.iloc[-test_size:].copy()

print(f"Train: {len(train)} hours ({len(train)/24:.1f} days)")
print(f"Test:  {len(test)} hours (7 days)")
print(f"Train period: {train['date'].min().date()} to {train['date'].max().date()}")

### Step 3: Baseline Models

**Naive**: Repeat last observed value  
**Seasonal Naive**: Repeat last 24h pattern

*Baseline = "what not to beat"*

In [None]:
# Create forecasts for 7 horizons: 24,48,72,96,120,144,168 hours
horizons = [24, 48, 72, 96, 120, 144, 168]

results = {}

# 3.1 Naive (last observed value)
naive_forecasts = {}
for h in horizons:
    last_value = train['y'].iloc[-1]
    naive_forecasts[h] = [last_value] * h
    mae = mean_absolute_error(test['y'].iloc[:h], naive_forecasts[h])
    rmse = np.sqrt(mean_squared_error(test['y'].iloc[:h], naive_forecasts[h]))
    results[f'Naive_h{h}'] = {'MAE': mae, 'RMSE': rmse}

print("‚úÖ Naive baseline complete")

In [None]:
# 3.2 Seasonal Naive (last 24h pattern repeats)
seasonal_forecasts = {}
for h in horizons:
    pattern = train['y'].iloc[-24:].values  # Last 24 hours
    forecast = np.tile(pattern, (h//24 + 1))[:h]
    seasonal_forecasts[h] = forecast
    mae = mean_absolute_error(test['y'].iloc[:h], forecast)
    rmse = np.sqrt(mean_squared_error(test['y'].iloc[:h], forecast))
    results[f'Seasonal_h{h}'] = {'MAE': mae, 'RMSE': rmse}

print("‚úÖ Seasonal naive complete")

### Step 4: Classical Models (SARIMA)

**SARIMA(1,1,1)(1,1,1,24)**  
- Order (1,1,1): AR, differencing, MA  
- Seasonal (1,1,1,24): Daily cycles

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX

# SARIMA(1,1,1)(1,1,1,24) - daily seasonality
print("Training SARIMA... (this takes ~2-3 minutes)")
sarima_model = SARIMAX(train['y'], order=(1,1,1), seasonal_order=(1,1,1,24))
sarima_fit = sarima_model.fit(disp=False)

sarima_forecasts = {}
for h in horizons:
    forecast = sarima_fit.forecast(steps=h)
    sarima_forecasts[h] = forecast
    mae = mean_absolute_error(test['y'].iloc[:h], forecast)
    rmse = np.sqrt(mean_squared_error(test['y'].iloc[:h], forecast))
    results[f'SARIMA_h{h}'] = {'MAE': mae, 'RMSE': rmse}

print("‚úÖ SARIMA complete")

### Step 5: Exponential Smoothing

**ETS(A,A,A,24)**: Additive trend + daily seasonality

*Fast alternative to SARIMA*

In [None]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# ETS with daily seasonality
print("Training Exponential Smoothing...")
ets_model = ExponentialSmoothing(train['y'],
                                trend='add',
                                seasonal='add',
                                seasonal_periods=24)
ets_fit = ets_model.fit()

ets_forecasts = {}
for h in horizons:
    forecast = ets_fit.forecast(steps=h)
    ets_forecasts[h] = forecast
    mae = mean_absolute_error(test['y'].iloc[:h], forecast)
    rmse = np.sqrt(mean_squared_error(test['y'].iloc[:h], forecast))
    results[f'ETS_h{h}'] = {'MAE': mae, 'RMSE': rmse}

print("‚úÖ ETS complete")

### Step 6: Results Table

Compare all 4 models across 7 horizons

In [None]:
results_df = pd.DataFrame(results).T
results_df = results_df.round(3)
print("\nüìä MODEL COMPARISON (Test Set)")
print(results_df)

# Best models per horizon
print("\nüèÜ BEST MODEL PER HORIZON:")
for h in horizons:
    best_model = results_df.loc[[f'Naive_h{h}', f'Seasonal_h{h}', f'SARIMA_h{h}', f'ETS_h{h}'], 'RMSE'].idxmin()
    best_rmse = results_df.loc[best_model, 'RMSE']
    print(f"H{h:2d}h: {best_model:10s} (RMSE={best_rmse:.3f})")

### Step 7: Visualization

48-hour forecasts vs actual temperature

In [None]:
# Plot best forecasts vs actual (48h horizon example)
h_example = 48
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Naive
axes[0,0].plot(test['date'].iloc[:h_example], test['y'].iloc[:h_example], label='Actual', linewidth=2)
axes[0,0].plot(test['date'].iloc[:h_example], naive_forecasts[h_example], label='Naive', linestyle='--')
axes[0,0].set_title(f'Naive Baseline (H{h_example})')
axes[0,0].legend()
axes[0,0].grid(alpha=0.3)

# Seasonal Naive
axes[0,1].plot(test['date'].iloc[:h_example], test['y'].iloc[:h_example], label='Actual', linewidth=2)
axes[0,1].plot(test['date'].iloc[:h_example], seasonal_forecasts[h_example], label='Seasonal Naive', linestyle='--')
axes[0,1].set_title(f'Seasonal Naive (H{h_example})')
axes[0,1].legend()
axes[0,1].grid(alpha=0.3)

# SARIMA
axes[1,0].plot(test['date'].iloc[:h_example], test['y'].iloc[:h_example], label='Actual', linewidth=2)
axes[1,0].plot(test['date'].iloc[:h_example], sarima_forecasts[h_example], label='SARIMA', linestyle='--')
axes[1,0].set_title(f'SARIMA (H{h_example})')
axes[1,0].legend()
axes[1,0].grid(alpha=0.3)

# ETS
axes[1,1].plot(test['date'].iloc[:h_example], test['y'].iloc[:h_example], label='Actual', linewidth=2)
axes[1,1].plot(test['date'].iloc[:h_example], ets_forecasts[h_example], label='ETS', linestyle='--')
axes[1,1].set_title(f'Exponential Smoothing (H{h_example})')
axes[1,1].legend()
axes[1,1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Step 8: Save Results

**Saved**:
- `baseline_classical_results.csv` ‚Üê Metrics table
- `baseline_classical_forecasts.csv` ‚Üê Raw predictions

In [None]:
# Save results table
results_df.to_csv('baseline_classical_results.csv')
print("‚úÖ Saved: baseline_classical_results.csv")

# Save forecasts (for plotting later)
forecasts_df = pd.DataFrame({
    'date': test['date'].iloc[:168].values,
    'actual': test['y'].iloc[:168].values,
    'naive': naive_forecasts[168],
    'seasonal': seasonal_forecasts[168],
    'sarima': sarima_forecasts[168],
    'ets': ets_forecasts[168]
})
forecasts_df.to_csv('baseline_classical_forecasts.csv', index=False)
print("‚úÖ Saved: baseline_classical_forecasts.csv")

from google.colab import files
files.download('baseline_classical_results.csv')
files.download('baseline_classical_forecasts.csv')