## Notebook 02: Predictive Modeling & Evaluation

Objective: To develop, calibrate, and evaluate a Seasonal Autoregressive Integrated Moving Average (SARIMA) model against established persistence baselines to define the limits of short-term thermal forecasting.
1. Model Identification & Environmental DNA

In this phase, we move from observation to estimation. Our goal is to translate the physical patterns found in Notebook 01 (Daily Waves and Thermal Inertia) into mathematical parameters (p,d,q,P,D,Q,s).

Intentional Logic:

- Temporal Consistency: We enforce a strict 5-minute frequency to ensure the model understands the "distance" between observations.

- Autocorrelation Analysis: We use ACF and PACF plots to determine how many past "pings" directly influence the future temperature.

- Benchmark Alignment: Every result here is directly compared to the 0.0219∘C, 0.0351∘C, and 0.0549∘C benchmarks we set earli

We start by loading the processed data. We must ensure the Timestamp is once again recognized as the index and that the frequency is explicitly set to 5min. Without an explicit frequency, the SARIMA math will fail.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error

# --- LOGIC: LOADING CHECKPOINT DATA ---
df = pd.read_csv('../data/processed/cleaned_iot_data.csv', index_col='Timestamp', parse_dates=True)

# Explicitly set the frequency (Essential for SARIMA logic)
df = df.asfreq('5min')

print("-" * 30)
print("DATASET READY FOR MODELING")
print("-" * 30)
print(f"Total Observations: {len(df)}")
print(f"Sampling Frequency: {df.index.freqstr}")
print("-" * 30)
df.head()

2. Identifying Model Parameters (ACF & PACF)

Intent: We need to figure out our p,d,q values.

- ACF (Autocorrelation): Shows how much the current temp is correlated with its own past.

- PACF (Partial Autocorrelation): Helps us find the "direct" relationship by stripping away intermediate correlations.

In [None]:
# --- LOGIC: SIGNAL FINGERPRINTING ---
# We look at the first 50 lags (approx 4 hours of data)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

plot_acf(df['Temp_C'], lags=50, ax=ax1)
plot_pacf(df['Temp_C'], lags=50, ax=ax2, method='ywm')

plt.tight_layout()
plt.show()