https://medium.com/data-science/a-step-by-step-guide-to-calculating-autocorrelation-and-partial-autocorrelation-8c4342b784e8

# A Step-by-Step Guide to Calculating Autocorrelation and Partial Autocorrelation

## Partial Autocorrelation

Partial autocorrelation is a bit harder to understand. Once again, it describes the relationship between a time series and its lagged counterparts, however, this time removing all the intermediate effects. We could use the partial autocorrelation to determine, for example, how is this month’s number of passengers directly related to the number 6 months ago. In this example, when calculating the PACF we would remove the information on how the values from 6 months ago impact the ones 5 months ago, then the effect between the 5th lag and the 4th, and so on up until the most recent month.

This also means that the first partial autocorrelation is the same as the first autocorrelation, as there are no intermediate effects to be removed.

To make things more interesting, there are quite a lot of approaches to calculating partial autocorrelation. The function in statsmodels offers 3 methods: the Yule-Walker approach, the OLS (ordinary least squares) approach, and the Levinson-Durbin recursion approach. Additionally, there are some alternative variants for each of those (there are 7+ different combinations in total).

In this article, we focus on the OLS approach, which is based on autoregressive (AR) models. The details of the algorithm are described below.

First, we create a matrix (or DaraFrame) of lagged values up until the N-th one. At this point, we do not remove any observations, even though they are technically not available (due to the way lags are created). Then, for each iteration (denoted as k) between 2 and N:

- we fit a linear regression model (with intercept) using the original series as the target and the lagged series up until the k-th lag as the features. In this step, we use observations starting with the k-th one, as not all of the observations are available for the lag features.
- the coefficient of the k-th variable is the k-th partial autocorrelation coefficient.

The idea behind this approach is that the variance explained by intermediate time points can be excluded from the lag k-th’s coefficient. Below we describe the differences between the two OLS methods available in statsmodels. Feel free to skip that description if it is too technical, or you just want to get to the hands-on part.

    To be precise, we are describing what statsmodels calls the “efficient” OLS method, which is also the default OLS method. The “inefficient” method is quite similar, with some small tweaks. First, before creating the matrix of lagged values, we demean the original series. Then, after creating the matrix of lags, we remove the first N observations (in the efficient approach, we iteratively decreased the number of observations instead of removing more than necessary at the very beginning). This has a clear practical implication. Let’s assume that you use the inefficient method twice, first to get the coefficients for lags up until the 2nd, and then repeat the process for lags up until the 4th. In such a scenario, the 2nd partial autocorrelation coefficient obtained in the first calculation will not be equal to the corresponding 2nd coefficient from the latter calculation. That is because a different number of observations is used for the underlying regressions. You can see the example in the accompanying Notebook (link at the end of the article).

Okay, that would be enough of the technicalities, let’s calculate the partial autocorrelations for the airline passengers time series. As before, we start with creating the benchmark using the pacf function from statsmodels.

pacf(df, nlags=10, method="ols")

Which generates the following array:

array([ 1.        ,  0.95893198, -0.32983096,  0.2018249 ,  0.14500798,         0.25848232, -0.02690283,  0.20433019,  0.15607896,  0.56860841,         0.29256358])

Then, we calculate the partial autocorrelation coefficients using the steps described above.

In [None]:
N_LAGS = 10

# the first partial autocorrelation is always equal to 1
pacf_list = [1]

X = pd.DataFrame(lagmat(y, N_LAGS))
X.columns = [f"lag_{lag+1}" for lag in range(10)]

for k in range(1, N_LAGS + 1):
    fitted_model = LinearRegression().fit(X.iloc[k:, :k], 
                                          y.iloc[k:])
    pacf_list.append(fitted_model.coef_[-1])

np.array(pacf_list)

The code should be pretty self-explanatory, given it is almost a literal translation of the written steps into Python code. The only thing that might be new is the lagmat function from statsmodels. Instead of creating the lagged series manually — as we have done before in the ACF example — we can use this handy function. It has quite a few convenient features, for example, it allows us to automatically remove the first few observations that contain invalid values. For more information, please have a look at the documentation.

Our code generates the following partial autocorrelation coefficients, which are equal to the ones we generated before with the pacf function.

array([ 1.        ,  0.95893198, -0.32983096,  0.2018249 ,  0.14500798,         0.25848232, -0.02690283,  0.20433019,  0.15607896,  0.56860841,         0.29256358])

Note: In this article, you can find a step-by-step introduction to another method of calculating the partial autocorrelation coefficient, this time based on the correlation of residuals.

## Takeaways

- the autocorrelation function describes the relationship between a time series and its lagged counterpart,
- the partial autocorrelation describes a direct relationship, that is, it removes the effects of the intermediate lagged values,
- there are multiple ways of calculating the partial autocorrelation coefficients, perhaps the simplest one is the one based on estimating autoregressive (AR) models using OLS.

You can find the code used for this article on my GitHub. Also, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!

You might also be interested in one of the following:

LINK

## References

- https://www.statsmodels.org/stable/index.html
- Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 2022–01–28.
- https://www.statsmodels.org/stable/

# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/partial-autocorrelation-for-time-series-481a9cfa7526

# Partial Autocorrelation for Time Series Analysis

subtitle: Describing what partial autocorrelation is and its importance in time series analysis

## Introduction

In my previous post we discussed the concept of autocorrelation:

LINK

Autocorrelation is the correlation of random variables or data against itself at different points in time (lags). Autocorrelation conveys the similarity of the data at different lags enabling us to deduce some interesting features of our time series such as seasonality and trend.

    If you want to learn more about autocorrelation, make sure to checkout my post that I linked above!

Most people have heard about autocorrelation, however you may not know of its less popular cousin partial autocorrelation function (PACF). In this short and sweet post I want to describe what PACF is, why it is useful and go through a simple example in applying PACF in Python.

VIDEO

## What is Partial Autocorrelation?

We can begin by explaining partial correlation. This is the correlation between two random variables whilst controlling the effect of another (orm more) random variable that affects the original variables we are correlating.

Lets say we have three random variables of X, Y and Z. The partial correlation between X and Y, excluding the effects of Z, is mathematically:

EQUATION

Where r is the correlation coefficient that ranges between -1 and 1.

Partial autocorrelation is then simply just the partial correlation of a time series at two different states in time. Taking it one step further, it is the correlation between the time series at two different lags not considering the effect of any intermediate lags. For example, the partial autocorrelation for a lag of 2 is only the correlation that lag 1 didn’t explain.

## Why is it Useful?

Unlike autocorrelation, partial autocorrelation hasn’t got as my uses for time series analysis. However, its main and very important impact comes in when building forecasting models.

The PACF is used to estimate the number/order of autoregressive components when fitting Autoregressive, ARMA or ARIMA models as defined by the Box-Jenkins procedure. These models are probably the most used and often provide the best results when training a forecasting model.

    In future blogs I plan to explain the Autoregressive, ARMA and ARIMA models. Until then, refer to the links above to learn about these respective algorithms.

Lets now go through an example in Python in applying the PACF.

## Example in Python

We will work with the airline passenger volumes dataset:

    Data sourced from Kaggle with a CC0 licence.

In [None]:
# Import packages
import plotly.express as px
import pandas as pd

# Read in the data
data = pd.read_csv('AirPassengers.csv')

# Plot the data
fig = px.line(data, x='Month', y='#Passengers',
              labels=({'#Passengers': 'Passengers', 'Month': 'Date'}))

fig.update_layout(template="simple_white", font=dict(size=18),
                  title_text='Airline Passengers', width=650,
                  title_x=0.5, height=400)

There is a clear trend in the data and an obvious yearly seasonality.

The statsmodels Python module provides a plot_pacf function to plot the PACF at different lags, this is also known as a correlogram:

In [None]:
# Import packages
from statsmodels.graphics.tsaplots import plot_pacf
import matplotlib.pyplot as plt

# Plot partial autocorrelation
plt.rc("figure", figsize=(11,5))
plot_pacf(data['#Passengers'], method='ywm')
plt.xlabel('Lags', fontsize=18)
plt.ylabel('Correlation', fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.title('Partial Autocorrelation Plot', fontsize=20)
plt.tight_layout()
plt.show()

We see that lag 1 is highly correlated and there are other highly correlated lags later in time which are probably due to seasonal affects.

However, from this plot it is quite unclear how many autoregressors we would choose if we were building a forecasting model. Therefore, it is often recommended to simply carry out a grid-search over the possible parameters using modelling packages such as auto arima.

    The blue region is where lags are no longer statistically significant. We typically choose the number autoregressors by seeing how many of them are above the blue region.

## Summary and Further Thoughts

In this post we have gained an understanding of partial autocorrelation. This is the correlation of a time series against a lagged version of itself exluding any intermediate lags. Its primary use is in estimating the number of auto-regressors components for forecasting models such as ARMA and ARIMA.

The full code that used in this post is available at my GitHub here:

LINK


# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/time-series-from-scratch-autocorrelation-and-partial-autocorrelation-explained-1dd641e3076f

## Partial autocorrelation — Theory and implementation

This one is a bit tougher to understand. It does the same as regular autocorrelation — shows the correlation of a sequence with itself lagged by some number of time units. But there’s a twist. Only the direct effect is shown, and all intermediary effects are removed.

For example, you want to know the direct relationship between the number of passengers today and 12 months ago. You don’t care about anything in between.

The number of passengers 12 months affects the number of passengers 11 months ago — and the whole chain repeats until the most recent period. These indirect effects are neglected in partial autocorrelation calculations.

You should also make the time series stationary before calculations.

You can use the pacf() function from statsmodels for the calculation:



In [None]:
# Calculate partial autocorrelation
pacf_values = pacf(df['Passengers_Diff'])

Here’s how the values look like:

IMAGE

The correlation value at lag 12 has dropped to 0.61, indicating the direct relationship is a bit weaker. Let’s take a look at the results graphically to confirm these are still significant:

In [None]:
# Plot partial autocorrelation
plot_acf(df['Passengers_Diff'], lags=30);

Here’s how it looks like:

IMAGE

To conclude — the lag 12 is still significant, but the lag at 24 isn’t. A couple of lags before 12 are negatively correlated to the original time series. Take some time to think about why.

There’s still one important question remaining — how do you interpret ACF and PACF plots for forecasting? Let’s answer that next.

## How to interpret ACF and PACF plots

Time series models you’ll soon learn about, such as Auto Regression (AR), Moving Averages (MA), or their combinations (ARMA), require you to specify one or more parameters. These can be obtained by looking at ACF and PACF plots.

In a nutshell:

    If the ACF plot declines gradually and the PACF drops instantly, use Auto Regressive model.
    If the ACF plot drops instantly and the PACF declines gradually, use Moving Average model.
    If both ACF and PACF decline gradually, combine Auto Regressive and Moving Average models (ARMA).
    If both ACF and PACF drop instantly (no significant lags), it’s likely you won’t be able to model the time series.

Still, reading ACF and PACF plots is challenging, and you’re far better of using grid search to find optimal parameter values. An optimal parameter combination has the lowest error (such as MAPE) or lowest general quality estimator (such as AIC). We’ll cover time series evaluation metrics soon, so stay tuned.

## Conclusion

And there you have it — autocorrelation and partial autocorrelation in a nutshell. Both functions and plots help analyze time series data, but we’ll mostly rely on brute-force parameter finding methods for forecasting. It’s much easier to do a grid search than to look at charts.

Both ACF and PACF require stationary time series. We’ve only covered stationarity briefly for now, but that will change in the following article. Stay tuned to learn everything about stationarity, stationarity tests, and testing automation.

Thanks for reading.