# Day-45: ARIMA Models for Forecasting

Welcome to Day 45! Today is the day we build our first true forecasting model: ARIMA (AutoRegressive Integrated Moving Average). This is the bedrock of classical time series modeling and is essential for turning those trend and seasonality insights into real predictions!

## Topics Covered:

- The ARIMA Framework
- AIC and BIC: Selecting the Best Model

## The ARIMA Framework

ARIMA is a model that uses past values to forecast future values. The power of the model comes from combining three core components: 
- $ AR(p) $ : *AutoRegressive* ,
    - Uses a linear combination of *past observations* (lagged values) to forecast the current observation. The 'p' is the number of lagged observations to include.
- $ I(d) $ : *Integrated* ,
    - Uses *differencing* of the raw observations to make the time series stationary. The 'd' is the number of times the raw observations are differenced.
- and $ MA(q) $: *Moving Average*.
    - Uses a linear combination of *past forecast errors (residuals)* to forecast the current observation. The 'q' is the number of lagged forecast errors to include.

### $ I(d) $ : *Integrated* - Differencing for Stationarity

Before you can build an $AR$ or $MA$ model, your time series must be stationary.

- What is Stationarity? 
    - A time series is stationary if its statistical properties (mean, variance, and autocorrelation) do not change over time.

- The Problem: 
    - Most real-world data (like stock prices or sales) are non-stationary because they have a Trend and/or Seasonality.

- The Solution: Differencing: 
    - The 'I' in ARIMA stands for Integrated, meaning the raw series is differenced. Differencing is the process of computing the difference between consecutive observations.

$ \Delta Y_t = Y_t - Y_{t-1}  $

### $ AR(p) $ : *AutoRegressive*

The $AR$ term uses a technique called Autoregression. This is simply a linear regression of the variable against its own past values (lags).

$ Y_t = c + \Phi_1 Y_{t-1}+ \Phi_2 Y_{t-2}+...+ \Phi_p Y_{t-p}+ \epsilon_t $ 

- `Analogy`: Predicting tomorrow's temperature based on the temperature from the last three days. The parameter $p$ is the number of days you look back.

- `Identification`: The parameter $p$ is identified by examining the Partial Autocorrelation Function (PACF) plot. You look for the point where the PACF plot cuts off abruptly to zero.

### $ MA(q) $: *Moving Average*.

The $MA$ term uses a linear combination of past forecast errors (residuals) to correct the current forecast.

$ Y_t = \mu + \Theta_1 \epsilon_{t-1}+ \Theta_2 \epsilon_{t-2}+...+ \Theta_q \epsilon_{t-q}+ \epsilon_t $ 

- `Analogy`: If your forecast was too high yesterday and too low the day before, you use those errors to slightly adjust today's prediction. The parameter q is the number of past forecast errors you include.

- `Identification`: The parameter q is identified by examining the Autocorrelation Function (ACF) plot. You look for the point where the ACF plot cuts off abruptly to zero.

## AIC and BIC: Selecting the Best Model

Once you've narrowed down the possible values for p and q using the ACF and PACF plots, you often have several candidate ARIMA models (e.g., ARIMA(1,1,0), ARIMA(0,1,1), etc.).

To choose the best model, we use information criteria:

- AIC (Akaike Information Criterion):

    - Estimates the relative quality of statistical models for a given set of data.

- BIC (Bayesian Information Criterion):

    - Similar to AIC but penalizes models with more parameters (p and q) more heavily.

The Rule: When comparing a set of models, you always choose the model with the lowest AIC and BIC values. This criterion helps find the sweet spot: a model that fits the data well without being overly complex (overfitting).

In [7]:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
import numpy as np

# Recreating the index and data array used in the Day 44 example
index = pd.date_range(start='2025', periods=100, freq='ME')
data = 5 * np.sin(np.linspace(0, 3*np.pi, 100)) + np.arange(100) * 0.5 + np.random.randn(100) * 2

# Creating the time series object
ts = pd.Series(data, index=index)

print("--- Reconstructed Time Series (ts) ---")
print(ts.head())
print("\n... (omitting middle 90 rows) ...")
print(ts.tail())

# 1. Define the ARIMA order (p, d, q)
# Example: ARIMA(1, 1, 0) - One AR term, one difference, no MA term
order = (1, 1, 1)

# 2. Fit the model
model = ARIMA(ts, order=order)
results = model.fit()

# 3. Print the results (This will show the AIC/BIC)
print(results.summary())

# 4. Forecast the next 10 periods
forecast = results.forecast(steps=10)

print("\nForecast for next 10 periods:")
print(forecast)

--- Reconstructed Time Series (ts) ---
2025-01-31    0.916678
2025-02-28    1.292150
2025-03-31    3.928571
2025-04-30    1.440891
2025-05-31    0.307240
Freq: ME, dtype: float64

... (omitting middle 90 rows) ...
2032-12-31    47.705763
2033-01-31    46.176776
2033-02-28    50.227033
2033-03-31    49.704161
2033-04-30    51.990481
Freq: ME, dtype: float64
                               SARIMAX Results                                
Dep. Variable:                      y   No. Observations:                  100
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -221.282
Date:                Mon, 29 Sep 2025   AIC                            448.563
Time:                        16:38:13   BIC                            456.349
Sample:                    01-31-2025   HQIC                           451.713
                         - 04-30-2033                                         
Covariance Type:                  opg                                         
         