---
title: "Autocorrelation Models"
subtitle: "IN2004B: Generation of Value with Data Analytics"
author: 
  - name: Alan R. Vazquez
    affiliations:
      - name: Department of Industrial Engineering
format: 
  revealjs:
    chalkboard: false
    multiplex: false
    footer: "Tecnologico de Monterrey"
    logo: IN2004B_logo.png
    css: style.css
    slide-number: True
    html-math-method: mathjax
editor: visual
jupyter: python3
---


## Agenda

</br>

1.  Autocorrelation
2.  The ARIMA model
3.  The SARIMA model

## Load the libraries

Before we start, let's import the data science libraries into Python.


In [None]:
#| echo: true
#| output: false

# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.tools import diff
from sklearn.metrics import mean_squared_error, r2_score

Here, we use specific functions from the **pandas**, **matplotlib**, **seaborn**,  **sklearn** and **statsmodels** libraries in Python.

# Autocorrelation

## Problem with linear regression models

</br>

Linear regression models do not incorporate the [dependence]{style="color:purple;"} between consecutive values in a time series.

This is unfortunate because responses recorded over close time periods tend to be correlated. This correlation is called the [**autocorrelation**]{style="color:purple;"} of the time series.

Autocorrelation helps us develop a model that can make better predictions of future responses.

## What is correlation?

:::::: center
::::: columns
::: {.column width="60%"}
</br>

-   It is a measure of the strength and direction of the *linear* relationship between two numerical variables.

-   Specifically, it is used to assess the relationship between two sets of observations.

-   Correlation is between $-1$ and 1.
:::

::: {.column width="40%"}
</br>


In [None]:
#| echo: false
#| output: true
#| fig-align: center

# Load the data into Python
Ads_data = pd.read_excel('Advertising.xlsx')

# Calculate correlation
correlation = Ads_data['TV'].corr(Ads_data['Sales'])

# Create scatter plot
plt.figure(figsize=(5, 5))
sns.scatterplot(data=Ads_data, x='TV', y='Sales')
plt.xlabel('TV Advertising Budget')
plt.ylabel('Sales')
# Optionally, also add as text on the plot
plt.text(0.05, 0.95, f'Correlation = {correlation:.2f}', transform=plt.gca().transAxes,
         fontsize=12, color='blue', ha='left', va='top')
plt.tight_layout()
plt.show()

:::
:::::
::::::

## How do we measure autocorrelation?

</br></br></br>

There are two formal tools for measuring the correlation between observations in a time series:

::: incremental
-   The [***autocorrelation***]{style="color:green;"} function.

-   The [***partial autocorrelation***]{style="color:purple;"} function.
:::

## The autocorrelation function

> Measures the correlation between responses separated by $j$ periods.

For example, consider the autocorrelation between the current temperature and the temperature recorded the day before.

. . .

:::::: center
::::: columns
::: {.column width="50%"}
![](images/correlation_1.png){fig-align="center" width="465" height="288"}
:::

::: {.column width="50%"}
:::
:::::
::::::

## The autocorrelation function

> Measures the correlation between responses separated by $j$ periods.

For example, consider the autocorrelation between the current temperature and the temperature recorded the day before.

:::::: center
::::: columns
::: {.column width="50%"}
![](images/correlation_1.png){fig-align="center" width="465" height="288"}
:::
::: {.column width="50%"}
![](images/correlation_2.png){fig-align="center" width="465" height="288"}
:::
:::::
::::::

## The autocorrelation function

> Measures the correlation between responses separated by $j$ periods.

For example, consider the autocorrelation between the current temperature and the temperature recorded the day before. This would be the correlation between these two columns

![](images/correlation_combined.png){fig-align="center"}


## Example 1

Let's consider again the dataset in the file "Amtrak.xlsx." The file contains records of Amtrak passenger numbers from January 1991 to March 2004.


In [None]:
#| echo: false
#| output: true
#| fig-align: center

Amtrak_data = pd.read_excel('Amtrak.xlsx')
Amtrak_data.head()

##

</br>


In [None]:
#| echo: false
#| output: true
#| fig-align: center

plt.figure(figsize=(10, 6))
sns.lineplot(x='Month', y='Ridership (in 000s)', data = Amtrak_data)
plt.xlabel('Month')
plt.ylabel('Ridership')
plt.title('Amtrak Ridership Over Time')
plt.show()

## Autocorrelation function

</br>

- The autocorrelation function measures the correlation between responses that are separated by a specific number of periods.

- The autocorrelation function is commonly visualized using a bar chart. 

- The vertical axis shows the differences (or [*lags*]{style="color:brown;"}) between the periods considered, and the horizontal axis shows the correlations between observations at different lags.

## Autocorrelation plot

</br>

In Python, we use the `plot_acf` function from the **statsmodels** library.


In [None]:
#| echo: true
#| output: false
#| fig-align: center

plt.figure(figsize=(10, 6))
plot_acf(Amtrak_data['Ridership (in 000s)'], lags = 25)
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

The `lags` parameter controls the number of periods for which to compute the autocorrelation function.

## The resulting plot

</br>


In [None]:
#| echo: false
#| output: true
#| fig-align: center

plt.figure(figsize=(10, 6))
plot_acf(Amtrak_data['Ridership (in 000s)'], lags = 25)
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

##


:::::: center
::::: columns
::: {.column width="50%"}

::: {style="font-size: 80%;"}

</br>

- The autocorrelation plot shows that the responses and those from zero periods ago have a correlation of 1.

- The autocorrelation plot shows that the responses and those from one period ago have a correlation of around 0.45.

- The autocorrelation plot shows that the responses and those from 24 periods ago have a correlation of around 0.5.
:::

:::
::: {.column width="50%"}

</br>

In [None]:
#| echo: false
#| output: true
#| fig-align: center

plt.figure(figsize=(4, 4))
plot_acf(Amtrak_data['Ridership (in 000s)'], lags = 25)
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

:::
:::
:::

## Autocorrelation patterns

</br>

- **A strong autocorrelation (positive or negative) with a lag** $j$ **greater than 1 and its multiples** ($2k, 3k, \ldots$) typically reflects a cyclical pattern or seasonality.

- **Positive lag-1 autocorrelation** describes a series in which consecutive values *generally* move in the same direction.

- **Negative lag-1 autocorrelation** reflects oscillations in the series, where high values (*generally*) are immediately followed by low values and vice versa.

## More about the autocorrelation function

Consider the problem of predicting the average price of a kilo of avocado this month.

For this, we have the average price from last month and the month before that.

![](images/Avocado1.png){fig-align="center"}

## 

</br></br>

The autocorrelation function for $Y_t$ and $Y_{t-2}$ includes the direct and indirect effect of $Y_{t-2}$ on $Y_t$.

![](images/Avocado3.png){fig-align="center"}

## Partial autocorrelation function

> Measures the correlation between responses that are separated by $j$ periods, **excluding correlation** due to responses separated by intervening periods.

![](images/Avocado4.png){fig-align="center"}

## 

In technical terms, the partial autocorrelation function fits the following linear regression model

$$\hat{Y}_t = \hat{\beta}_1 Y_{t-1} + \hat{\beta}_2 Y_{t-2}$$
Where:

- $\hat{Y}_{t}$ is the predicted response at the current time ($t$).
- $\hat{\beta}_1$ is the *direct effect* of $Y_{t-1}$ on predicting $Y_{t}$.
- $\hat{\beta}_2$ is the *direct effect* of $Y_{t-2}$ on predicting $Y_{t}$.

The partial autocorrelation between $Y_t$ and $Y_{t-2}$ is equal to $\hat{\beta}_2$. 

##

- The partial autocorrelation function is visualized using a graph similar to that for autocorrelation.

- The vertical axis shows the differences (or *lags*) between the periods considered, and the horizontal axis shows the partial correlations between observations at different lags.

In Python, we use the `plot_pacf` function from **statsmodels**.


In [None]:
#| echo: true
#| output: false
#| fig-align: center

plt.figure(figsize=(10, 6))
plot_pacf(Amtrak_data['Ridership (in 000s)'], lags = 25)
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

## 

</br>

:::::: center
::::: columns
::: {.column width="50%"}

::: {style="font-size: 80%;"}

</br>

- The partial autocorrelation plot shows that the responses and those from one period ago have a correlation of around 0.45. *This is the same for the autocorrelation plot.*

- The partial autocorrelation plot shows that the responses and those from two periods ago have a correlation near 0.
:::
:::
::: {.column width="50%"}

In [None]:
#| echo: false
#| output: true
#| fig-align: center

plt.figure(figsize=(10, 6))
plot_pacf(Amtrak_data['Ridership (in 000s)'], lags = 25)
plt.xlabel("Lag")
plt.ylabel("Correlation")
plt.show()

:::
:::
:::

# The ARIMA Model

## Autoregressive models

*Autoregressive models* are a type of linear regression model that directly incorporate the autocorrelation of the time series to predict the current response.

Their main characteristic is that the predictors of the current value of the series are its past values.

- An autoregressive model of order 2 has the mathematical form: $\hat{Y}_t = \hat{\beta}_0 + \hat{\beta}_1 Y_{t-1} + \hat{\beta}_2 Y_{t-2}.$

- An order 3 model looks like this: $\hat{Y}_t = \hat{\beta}_0 + \hat{\beta}_1 Y_{t-1} + \hat{\beta}_2 Y_{t-2} + \hat{\beta}_3 Y_{t-3}.$

## ARIMA models

</br>

A special class of autoregressive models are [***ARIMA***]{style="color:purple;"} ([*Autoregressive Integrated Moving Average*]{style="color:purple;"}).

An ARIMA model is composed of three elements:

::: incremental
1. Differenced or integrated operators (*integrated*).
2. Autoregressive terms (*autoregressive*).
3. Stochastic terms (*moving average*).
:::

# El modelo SARIMA

## Estacionalidad

La **estacionalidad** consiste en un comportamiento repetitivo o cíclico que ocurre con una frecuencia constante.

Se puede identificar de la gráfica de la serie o usando las ACF and PACF.

Para esto, debemos haber removido la tendencia.

## En Python

## Modelo SARIMA

El modelo [S]{style="color:orange;"}ARIMA ([*Seasonal*]{style="color:orange;"} Autoregressive Integrated Moving Average) es una extensión del modelo ARIMA para modelar los patrones de estacionalidad.

El modelo SARIMA tiene tres elementos adicionales para modelar la estacionalidad en la serie de tiempo.

1.  Operadores diferenciados o integrados (*integrated*) para la estacionalidad.
2.  Términos autoregresivos (*autoregressive*) para la estacionalidad.
3.  Términos estocásticos o promedios móviles (*moving average*) para la estacionalidad.

## Notación

La estacionalidad en una serie de tiempo es un patrón regular de cambios que se repite durante [$S$ períodos de tiempo]{style="color:orange;"}, donde $S$ define el número de períodos de tiempo hasta que el patrón se repite nuevamente.

Por ejemplo, hay estacionalidad en los datos mensuales para los cuales los valores altos siempre tienden a ocurrir en algunos meses en particular y los valores bajos siempre tienden a ocurrir en otros meses en particular.

En este caso, $S=12$ (meses por año) es el lapso del comportamiento estacional periódico. Para datos trimestrales, $S=4$ períodos de tiempo por año.

## Diferenciación estacional

Es la diferencia entre una respuesta y una respuesta con un rezago que es múltiplo de $S$.

Por ejemplo, con datos mensuales $S=12$,

-   Una diferencia estacional de nivel 1 es $Y_{t} - Y_{t-12}$.
-   Una diferencia estacional de nivel 2 es $(Y_t - Y_{t-12}) - (Y_{t-12} - Y_{t-24})$.

La diferenciación estacional elimina la tendencia estacional y también puede eliminar un tipo de no estacionariedad de paseo aleatorio estacional.

## Términos de los modelos AR y MA para estacionalidad

En el SARIMA, los términos de los componentes AR y MA estacionales predicen la respuesta actual ($Y_t$) usando respuestas y errores en momentos con rezagos que son múltiplos de $S$.

Por ejemplo, con datos mensuales $S = 12$,

-   El modelo AR estacional de orden 1 usaría $Y_{t-12}$ para predecir $Y_{t}$.
-   El modelo AR estacional de orden 2 usaría $Y_{t-12}$ y $Y_{t-24}$ para predecir $Y_{t}$.
-   El modelo MA estacional de orden 1 usaría el término estocástico $a_{t-12}$ como un predictor.
-   El modelo MA estacional de orden 2 usaría los términos estocásticos $a_{t-12}$ y $a_{t-24}$ como predictores.

# En Python

# [Return to main page](https://alanrvazquez.github.io/TEC-IN1002B-Website/)