<a href="https://colab.research.google.com/github/arkeodev/time-series/blob/main/Statistical_Time_Series_Analysis/07-autoregressive-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 7. Autoregressive (AR) Models

## 1. Introduction

An **Autoregressive (AR)** model is a cornerstone in time series analysis. The term *autoregressive* highlights that the variable of interest (\( x_t \)) depends on its own previous values (lags). 

**Key Features**

- **Lag Dependence**: The current value depends on its past values.
- **Stationarity Requirement**: AR models generally need the time series to be stationary. This often means that the AR coefficients lie within certain bounds (e.g., \(-1 < \phi < 1\) for AR(1)).
- **Simplicity & Interpretability**: AR models are relatively easy to interpret and estimate, making them a common first choice for forecasting tasks.

## 2. AR(1) Model

The simplest form, **AR(1)**, posits:
$$
x_t = C + \phi x_{t-1} + \epsilon_t
$$
where

1. $( x_t )$: Value of the time series at time $( t )$.  
2. $( C )$: Constant (intercept).  
3. $( \phi )$: AR coefficient indicating how strongly $( x_{t-1} )$ influences $( x_t )$.  
4. $( \epsilon_t )$: A random error term (often assumed to be white noise).  

**Interpretation**

- **$(\phi \approx 1)$**: High persistence. The series value decays very slowly.  
- **$(\phi \approx 0)$**: Little to no influence from the past.  
- **$(\phi < 0)$**: Inverse (negative) relationship with the previous value, causing oscillations.

## 3. $AR(p)$ Model

A general **AR(p)** model includes \(p\) lags:
$$
x_t = C + \phi_1 x_{t-1} + \phi_2 x_{t-2} + \dots + \phi_p x_{t-p} + \epsilon_t
$$

- Increasing $( p )$ allows for more complex dependencies (multiple past values), but also increases the number of parameters to estimate, leading to a trade-off between model complexity and interpretability.

## 4. An Example with Python



In the example below, we will:

1. Generate a synthetic AR(1) time series.
2. Fit an AR(1) model.
3. Compare AR models of different orders (AR(1) through AR(5)).

> **Note**: Any normalization or detailed residual analysis steps would typically go here, but they have been moved to **separate notebooks** for clarity.

### 4.1 Imports and Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg

# For reproducibility
np.random.seed(42)

### 4.2 Generate Synthetic AR(1) Data

We’ll create a 100-sample time series where:
$$
x_t = 2 + 0.7 \, x_{t-1} + \epsilon_t
$$

In [2]:
n_samples = 100
sigma = 1  # standard deviation of noise

errors = np.random.normal(0, sigma, n_samples)
alpha = 2   # constant (C)
phi = 0.7   # AR(1) coefficient

# Initialize the series
data = [alpha / (1 - phi)]  # often used as a starting point near the mean level

for t in range(1, n_samples):
    data.append(alpha + phi * data[t-1] + errors[t])

time_series = pd.Series(data)

### 4.3 Fit an AR(1) Model

In [3]:
ar_model = AutoReg(time_series, lags=1, old_names=False)
ar_results = ar_model.fit()
print(ar_results.summary())

                            AutoReg Model Results                             
Dep. Variable:                      y   No. Observations:                  100
Model:                     AutoReg(1)   Log Likelihood                -130.638
Method:               Conditional MLE   S.D. of innovations              0.905
Date:                Thu, 02 Jan 2025   AIC                            267.276
Time:                        13:05:37   BIC                            275.062
Sample:                             1   HQIC                           270.426
                                  100                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0774      0.479      4.334      0.000       1.138       3.017
y.L1           0.6703      0.075      8.995      0.000       0.524       0.816
                                    Roots           

**Interpretation**:
- **const**: The estimated intercept (close to the true $(C = 2)$).
- **y.L1**: The estimated AR coefficient $(\phi)$, close to 0.7.
- **S.D. of innovations**: Standard deviation of the errors.

### 4.4 Compare AR(1) to Higher Orders (AR(2) to AR(5))

We can loop through various orders and see if adding lags improves fit.

In [4]:
ar_orders = range(1, 6)
results_dict = {}
log_likelihoods = []

for order in ar_orders:
    model = AutoReg(time_series, lags=order, old_names=False)
    result = model.fit()
    results_dict[order] = result
    log_likelihoods.append(result.llf)
    print(f"AR({order}) Log Likelihood: {result.llf:.2f}")
    print(result.summary())
    print("\n" + "="*50 + "\n")

AR(1) Log Likelihood: -130.64
                            AutoReg Model Results                             
Dep. Variable:                      y   No. Observations:                  100
Model:                     AutoReg(1)   Log Likelihood                -130.638
Method:               Conditional MLE   S.D. of innovations              0.905
Date:                Thu, 02 Jan 2025   AIC                            267.276
Time:                        13:05:37   BIC                            275.062
Sample:                             1   HQIC                           270.426
                                  100                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0774      0.479      4.334      0.000       1.138       3.017
y.L1           0.6703      0.075      8.995      0.000       0.524       0.816
                      

### 4.5 Model Selection Considerations

- **Log Likelihood**: Generally, higher (less negative) indicates a better fit.
- **AIC/BIC**: Information criteria that penalize complexity; lower is often better.
- **Significance of Coefficients**: If higher-order coefficients are not significant (p-values > 0.05), their inclusion may not be justified.
- **Practical Parsimony**: A simpler model with fewer lags can be preferable if it captures the essential dynamics adequately.

### 4.6 Interpretation of the Results

1. **Log Likelihood**: Improves from AR(1) $(-130.64)$ to AR(5) $(-124.57)$, indicating each additional lag slightly improves fit.

2. **AIC vs. BIC**:
   - **AIC** (less penalty for complexity): Lowest with **AR(5)** $(263.136)$.
   - **BIC** (heavier penalty for complexity): Lowest with **AR(1)** $(275.062)$.
   - Conclusion: AR(1) is favored by BIC; AR(5) is favored by AIC.

3. **Significance of Additional Lags**:
   - AR(1): Lag-1 is significant $(p < 0.05)$, $\phi \approx 0.67$.
   - AR(2)–AR(5): Extra lags $(\text{y.L2}, \text{y.L3}, \dots)$ have high $p$-values, i.e., not statistically significant.

4. **Model Choice**:
   - **AR(1)** is simpler, has significant coefficients, and the best BIC.
   - **AR(5)** yields a better AIC but includes many insignificant terms.

Most analysts would pick **AR(1)** for its parsimony and significant lag, unless there’s a compelling reason (like better out-of-sample performance) to accept the extra complexity of higher-order models.


## 5. Summary



- **Autoregressive models** are powerful tools for **time series forecasting**, leveraging past observations to predict future values.
- For an **AR(1)** process, focus on whether $\phi$ is statistically significant and within the stationarity bounds $(-1 < \phi < 1)$.
- Higher-order **AR(p)** models can capture more complex dynamics but risk overfitting if you include too many insignificant lags.
- **Model selection** often balances goodness-of-fit (likelihood) with parsimony (fewest parameters). This can be done through **hypothesis testing** (e.g., t-tests on the coefficients) and information criteria like **AIC** or **BIC**.

