https://medium.com/data-science/autocorrelation-for-time-series-analysis-86e68e631f77

# Autocorrelation For Time Series Analysis

## Introduction

In time series analysis we often make inferences about the past to produce forecasts about the future. In order for this process to be successful, we must diagnose our time series thoroughly to find all its ‘nooks and crannies.’

One such diagnosis method is autocorrelation. This helps us detect certain features in our series to enable us to choose the most optimal forecasting model for our data.

In this short post I want to go over: what is autocorrelation, why it is useful and finish with how to apply it to a simple dataset in Python.


## What Is Autocorrelation?

Autocorrelation is just the correlation of the data with itself. So, instead of measuring the correlation between two random variables, we are measuring the correlation between a random variable against itself. Hence, why it is called auto-correlation.

    Correlation is how strongly two variables are related to each other. If the value is 1, the variables are perfectly positively correlated, -1 they are perfectly negatively correlated and 0 there is no correlation.

For time-series, the autocorrelation is the correlation of that time series at two different points in time (also known as lags). In other words, we are measuring the time series against some lagged version of itself.

Mathematically, autocorrelation is calculated as :

$$
r_k = \frac {\sum_{t=k+1}^{N} (y_t - \overline{y}) (y_{t-k} - \overline{y})} {\sum_{t=1}^{N}(y_t - \overline{y})^2}
$$

Where N is the length of the time series y and k is the specifie lag of the time series. So, when calculating r_1 we are computing the correlation between y_t and y_{t-1}.

    The autocorrelation between y_t and y_t would be 1 as they are identical.

## Why Is It Useful?

As stated above, we use autocorrelation to measure the correlation of a time series with a lagged version of itself. This computation allows us to gain some interesting insight into the characteristics of our series:

    Seasonality: Lets say we find the correlation at certain lag multiples is in general higher than others. This means we have some seasonal component in our data. For example, if we have daily data and we find that every multiple of 7 lag term is higher than others, we probably have some weekly seasonality.
    Trend: If the correlation for recent lags is higher and slowly decreases as the lags increase, then there is some trend in our data. Therefore, we would need to carry out some differencing to render the time series stationary.

To learn more about seasonality, trend and stationarity, check out my previous articles on those topics:

Let’s now go through an example in Python to make this theory more concrete!
Python Example

For this walkthrough we will use the classic airline passenger volumes dataset:

    Data sourced from Kaggle with a CC0 licence.

In [None]:
# Import packages
import plotly.express as px
import pandas as pd

# Read in the data
data = pd.read_csv('/mnt/FE86DAF186DAAA03/Python/Secondary/Study Repositories/time-series-studies/1_datasets/AirPassengers.csv')

# Plot the data
fig = px.line(data, x='Month', y='#Passengers',
              labels=({'#Passengers': 'Passengers', 'Month': 'Date'}))

fig.update_layout(template="simple_white", font=dict(size=18),
                  title_text='Airline Passengers', width=650,
                  title_x=0.5, height=400)

There is a clear upwards trend and yearly seasonality (data points indexed by month).

We can use the plot_acf function from the statsmodels package to plot the autocorrelation of our time series at various lags, this type plot is known as a correlogram:

In [None]:
# Import packages
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

# Plot autocorrelation
plt.rc("figure", figsize=(11,5))
plot_acf(data['#Passengers'], lags=48)
plt.ylim(0,1)
plt.xlabel('Lags', fontsize=18)
plt.ylabel('Correlation', fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.title('Autocorrelation Plot', fontsize=20)
plt.tight_layout()
plt.show()

We observe the following:

- There is a clear cyclical pattern in the lags every multiple of 12. As our data is indexed by month, we therefore have a yearly seasonality in our data.
- The strength of correlation is generally and slowly decreasing as the lags increase. This points to a trend in our data and it needs to be differenced to make it stationary when modelling.

The blue region signifies which lags are statistically significant. Therefore, when building a forecast model for this data, the next month forecast should probably only consider ~15 of the previous values due to their statistical significance.

    The lag at value 0 has a perfect correlation of 1 because we are correlating the time series with an exact copy of itself.

## Summary and Other Thoughts

In this post we have described what autocorrelation is and how we can use it to detect seasonality and trends in our time series. However, it does have other uses to. For example, we can use an autocorrelation plot for the residuals from a forecasting model to determine if the residuals are indeed independent. If the autocorrelation for the residuals are not mostly zero, then the fitted model has not accounted for all information and probably can be improved.

The full code script used in this article can be found at my GitHub here:

# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/a-step-by-step-guide-to-calculating-autocorrelation-and-partial-autocorrelation-8c4342b784e8

# A Step-by-Step Guide to Calculating Autocorrelation and Partial Autocorrelation

If you have worked at any time series task, I am sure at one point you looked into the approaches to identifying the nature of relationships in time series — the measures of autocorrelation. For example, you might have used the ACF and PACF plots to determine the orders of an ARMA model.

However, have you actually wondered how those correlation coefficients are calculated? If not, this article is the right place for you. We will briefly describe what those two measures are and then show step-by-step how to calculate them in Python.

And to manage expectations — we will focus on the calculations behind the coefficients, not their interpretation and details on how to use them for time series modeling. That would be a topic for another article.

## Setup

As always, we quickly import the required libraries. We will use the functions from statsmodels as a benchmark to make sure our calculations are correct.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.tsatools import lagmat

import matplotlib.pyplot as plt

# settings
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (16, 8)

## Data

For this article, we will use a data set that is simply a classic for time series — the airline passengers data set. You can find the data set in the seaborn library (sns.load_dataset("flights.csv")), or download the slightly preprocessed version from my GitHub. For our calculations, we will be using a pandas Series called y, which contains the monthly number of airline passengers.

In [None]:
df = pd.read_csv("../data/air_passengers.csv", index_col=0)
df.index = pd.to_datetime(df.index)
y = df["#Passengers"]

I am pretty sure you are already quite familiar with the plot above :) Then, we generate the two plots that contain the ACF/PACF values. While we will not explicitly replicate the plots (though we could), we focus on the values that are represented by the points (and vertical lines) on the plots.

In [None]:
fig, ax = plt.subplots(2, 1)

plot_acf(df, ax=ax[0])
plot_pacf(df, ax=ax[1], method="ols")

Note: While we will not cover it explicitly in this article, the blue shaded areas are the confidence intervals. The values of the ACF/PACF that are inside the intervals are not considered statistically significant at the 5% level (the default setting, which we can change).
Autocorrelation

Let’s start with the simpler of the two. In a nutshell, autocorrelation is the correlation of a time series with its lagged counterpart. The simplest example — lag 1 — would inform us how correlated is this month’s number of airline passengers with the number from the previous month. Naturally, we can rephrase that sentence with an arbitrary number of lags.

After this very brief recap, let’s calculate the benchmark using statsmodels. The plots we generated before used 24 lags, but for convenience’s sake, we will consider 10 lags. Calculating the values of the ACF is as easy as the following snippet.

acf(df, nlags=10)

Which generates the following array:

array([1.        , 0.94804734, 0.87557484, 0.80668116, 0.75262542,        0.71376997, 0.6817336 , 0.66290439, 0.65561048, 0.67094833,        0.70271992])

We know that autocorrelation is the correlation of the time series with its lagged values. So for our calculations, we could easily create a DataFrame with the original and lagged series in separate columns and then use the corr method to calculate the Pearson’s correlation coefficients. Let’s give it a shot.

We start with generating the DataFrame:

In [None]:
acf_df = pd.DataFrame()
for lag in range(0, 11):
    acf_df[f"y_lag_{lag}"] = y.shift(lag)

acf_df

Which generates the following table:

IMAGE

The lag_0 column is the original series and all the other ones are shifted appropriately. Then, we calculate the correlation matrix and print the column for the original series — it shows how the original series is correlated with all the columns of the DataFrame.

acf_df.corr()["y_lag_0"].values

Which returns the following array:

array([1.        , 0.96019465, 0.89567531, 0.83739477, 0.7977347 ,        0.78594315, 0.7839188 , 0.78459213, 0.79221505, 0.8278519 ,        0.8827128 ])

Something is off and the values do not match our benchmark. And what could be the reason for that? As is often the case, the devil is in the details. Let’s have a look at the autocorrelation formula:

IMAGE

There are a few things that call for our attention here:

- all the series in the formula are demeaned, that is, the average value has been subtracted from them. We have not done that explicitly, but that happens under the hood in the corr method.
- what is different from the general correlation formula — we are always subtracting the mean of the original series! Because we create the lags, the mean of the original series and the lagged ones will not be the same.
- in the denominator, we divide by the variance of the original series. In Pearson’s correlation coefficient, we would divide by the multiplication of the standard deviations of the two considered variables.

Let’s now calculate the ACF values using the formula above.

In [None]:
acf_list = []
mu = y.mean() 

for lag in range(0, 11):
    acf_list.append(sum((y - mu).iloc[lag:] * (y.shift(lag) - mu).iloc[lag:]) / sum((y - mu) ** 2))

np.array(acf_list)

Which returns the following array:

array([1.        , 0.94804734, 0.87557484, 0.80668116, 0.75262542,        0.71376997, 0.6817336 , 0.66290439, 0.65561048, 0.67094833,        0.70271992])

It is a perfect match for what we calculated using the acf function from statsmodels.

On a side note, the formula comes from probably the best book about time series forecasting — Forecasting: Principles and Practice. I highly recommend it to anyone interested in the topic. You can read my opinion/review here:

LINK



# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/understanding-autocorrelation-in-time-series-analysis-322ad52f2199

# Understanding Autocorrelation in Time Series Analysis

subtitle: Understanding how autocorrelation works is essential for beginners to make their journey in time series analysis easier.

I began my time series analysis journey two years ago. Initially, I began learning through YouTube videos where I came across autocorrelation, a basic concept of time series analysis. According to a few videos, the definition of autocorrelation was the relationship/correlation of a time series with its previous versions in time. Few others said, since we need two variables to compute correlation, but in a time series we have only one variable, we need to compute the correlation of the time series with a ‘k’th lagged version of itself.

## What is ‘k’th lag?

This was the first question that came into my mind. As I proceeded, I got to know that a time series (y) with ‘k’th lag is its version that is ‘t-k’ periods behind in time. A time series with lag (k=1) is a version of the original time series that is 1 period behind in time, i.e. y(t-1).

Most of those videos took an example of the stock market daily prices to explain time series analysis. Those prices were recorded daily. They explained, the autocorrelation of the stock prices is the correlation of the current price with the price ‘k’ periods behind in time. So, the autocorrelation with lag (k=1) is the correlation with today’s price y(t) and yesterday’s price y(t-1). Similarly, for k=2, the autocorrelation is computed between y(t) and y(t-2).

## Now comes the main question

How can we compute the correlation (or even covariance for that matter) of today’s price with yesterday’s price? Since correlation can only be computed between variables with multiple values. If I try to compute correlation between two single values, I’d get an ‘NaN’. Also, the two variables must have same lengths (number of values), so, I can’t even compute the correlation between y(starting point to t) and y(starting point to t-1).

Then, I started searching for a theoretical explanation of autocorrelation and came across the formula of autocorrelation as shown below.

EQUATION

EQUATION

## Understanding the formula

- The formula of autocorrelation is similar (but not exactly the same) to that of correlation.
- The numerator is similar to covariance between the current and lagged versions of the time series (but doesn’t have ‘N-1’ as denominator). A closer examination of the two components of the numerator shows that the mean of the original time series, mean(y), is being subtracted from them, not mean(y(t)) and mean(y(t-k)), respectively. This makes the numerator of the formula a bit different from covariance.
- The denominator is similar to the square of standard deviation (a.k.a. variance) of the original time series (but doesn’t have ‘N-1’ as denominator).

Let’s answer the question, How to compute autocorrelation? by implementing it in Python

We’ll use the Nifty (an Indian stock index tracking 50 stocks) closing price data from 17 September, 2007 to 30 July, 2021. The data is downloaded as a csv from Yahoo Finance. We’ll first prepare the data for time series analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

nifty = pd.read_csv("Nifty.csv",
                     usecols=['Date', "Close"],
                     parse_dates=['Date'])

nifty.set_index("Date", inplace=True) #setting "Date" as index

nifty = nifty.asfreq("b").copy() #setting frequency to business days

nifty['Close'] = nifty['Close'].fillna(method="ffill") #forward filling missing values (due to holidays in India)

nifty.plot(figsize=(25,5))
plt.show()

We’ll define a function called ‘autocorr’ that returns the autocorrelation (acf) for a single lag by taking a time series array and ‘k’th lag value as inputs. This function will be nested inside another function called ‘my_auto_corr’ that returns acf for lags [k,0] by calling ‘autocorr’ function to compute acf for each lag value.

In [None]:
def my_auto_corr(df, nlags=2):
    '''
    Returns autocorrelation coefficient for lags [nlags, 0]
    
    df: dataframe
        Input dataframe of time series
    nlags: int
           maximum number of lags, default 2
    
    Returns
    array: autocorrelation coefficients for lags [nlags, 0]
    '''
    def autocorr(y, lag=2):
        '''
        Calculates autocorrelation coefficient for single lag value
        
        y: array
           Input time series array
        lag: int, default: 2 
             'kth' lag value
        
        Returns
        int: autocorrelation coefficient 
        '''
        y = np.array(y).copy()
        y_bar = np.mean(y) #y_bar = mean of the time series y
        denominator = sum((y - y_bar) ** 2) #sum of squared differences between y(t) and y_bar
        numerator_p1 = y[lag:] - y_bar #y(t)-y_bar: difference between time series (from 'lag' till the end) and y_bar
        numerator_p2 = y[:-lag] - y_bar #y(t-k)-y_bar: difference between time series (from the start till lag) and y_bar
        numerator = sum(numerator_p1 * numerator_p2) #sum of y(t)-y_bar and y(t-k)-y_bar
        return (numerator / denominator)
    
    acf = [1] #initializing list with autocorrelation coefficient for lag k=0 which is always 1
    for i in range(1, (nlags + 1)):
        acf.append(autocorr(df.iloc[:, 0].values, lag=i)) #calling autocorr function for each lag 'i'
    return np.array(acf)

Let’s call the ‘my_auto_corr’ function by passing the ‘nifty’ time series data frame and nlags=10 as arguments. We’ll also compare the output of ‘my_auto_corr’ function with that of ‘acf’ method of ‘statsmodels’.

In [None]:
from statsmodels.graphics import tsaplots

print(f'my_auto_corr:\n{my_auto_corr(df=nifty, nlags=10)}\n\nstatsmodels acf:\n{tsaplots.acf(nifty.iloc[:,0].values,nlags=10)}')

The results of ‘my_auto_corr’ are same as those of the ‘acf’ method of ‘statsmodels’. Let’s once again look at the formula of autocorrelation that we saw earlier and try to understand it.

EQUATION

- The denominator is pretty straightforward, it is similar to the variance of the original time series, but doesn’t have ‘N-1' in the denominator. It is denoted by ‘denominator’ variable in the code.
- As discussed earlier, the numerator is similar to the covariance between the current and lagged versions of the time series (without N-1 as denominator). Let’s understand how to compute the numerator.

- The brown rectangle represents y(t) in the first part of the numerator. It is subtracted from the mean of the original time series, mean(y). The first part is denoted by ‘numerator_p1’ in the code & y(t)-mean(y) in the formula. y(t) is fixed at the bottom and its top moves down by 1 for every unit increase in the lag (k).
- Similarly, the green rectangle represents y(t-k) in the second part of the numerator. It is also subtracted from the mean of the original time series, mean(y). The second part is denoted by ‘numerator_p2’ in the code & y(t-k)-mean(y) in the formula. y(t-k) is fixed at the top and its bottom moves up by 1 for every unit increase in the lag (k).

However, as we saw earlier, the numerator of the formula is not exactly the same as covariance. However, the denominator is similar to the variance of original time series, but without ‘N-1’ in denominator. Hence, computing the covariance of the brown and green rectangle and dividing it by the variance of the original time series doesn’t give us the autocorrelation coefficient.

Breaking down the autocorrelation formula into fragments and implementing it in Python helped us understand it better. We saw how the covariance in the numerator is calculated between the current and the lagged versions of time series. Hence, it is important to know what’s under the hood to understand a concept better, be it a machine learning algorithm or a concept in statistics.

Know more about my work at https://ksvmuralidhar.in/

# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/understanding-autocorrelation-ddce4f0a9fda

# Understanding Autocorrelation

subtitle: And its impact on your data (with examples)

I've been dealing with autocorrelated data a lot lately. In finance, certain time series such as housing prices or private equity returns are notoriously autocorrelated. Properly accounting for this autocorrelation is critical to building a robust model.

First, what is autocorrelation? Autocorrelation is when past observations have an impact on current ones. For example, if you could use Apple’s returns last week to reliably predict its returns this week, then you can say that Apple’s stock price returns are autocorrelated. In math terms:

y = B0 + B1*y_lag1 + B2*y_lag2 + ... + Bn*y_lagn + errorIf any of B1 to Bn is significantly nonzero, then we can say that the time series represented by y is autocorrelated.

## Why Do We Care About Autocorrelation?

Two reasons come to my mind:

1. Many things are autocorrelated. So when we attempt to study the world by building simulations (e.g. Monte Carlo simulations of the economy), we need to take that autocorrelation into account. Otherwise our model would produce faulty results (see reason 2 for why).
2. Autocorrelation causes volatility to be understated, especially when compounded. The compounded product, that is (1+r1)*(1+r2)… of an autocorrelated random variable can have a much wider distribution of outcomes than that of a random variable with no autocorrelation.

## An Example With Inflation Data

Let’s use some actual data to study this. I downloaded some CPI data from FRED (the data repository of the Federal Reserve Bank of St. Louis). It looks like this:

IMAGE

Whenever analyzing time series data like CPI, you should start by taking the rate of change (to make it more mean-variance stationary). Let’s do that:

Here's what my raw CPI data looks like (stored in df):values
date              
1950-01-01   23.51
1950-02-01   23.61
1950-03-01   23.64
1950-04-01   23.65
1950-05-01   23.77
1950-06-01   23.88
1950-07-01   24.07
1950-08-01   24.20
1950-09-01   24.34
1950-10-01   24.50# My CPI data is stored in a dataframe called df
# The following line calculates the monthly rate of change
df_chg = (df/df.shift(1)-1).dropna()

The result looks like this (there’s some obvious seasonality in there that we will ignore today):

IMAGE

We can do a check for autocorrelation by looking at the correlation of the monthly change in CPI against its lagged values. We can use the shift method to create the lags.

df_chg.rename({'values': 'unlagged'}, axis=1, inplace=True)lags = 10for i in range(lags):
    if i > 0:
        df_chg['lag_'+str(i)] = df_chg['unlagged'].shift(i)

Plotting the correlations between the unlagged values of CPI change against its various lags, we see that there’s significant autocorrelation:

IMAGE

## Ignore Autocorrelation At Your Own Risk

Let’s do a quick simulation to see what happens if we ignore autocorrelation. The monthly change in inflation has a standard deviation of 0.32%. If it were a normally distributed random variable, we can annualize the 0.32% by multiplying it by the square root of 12 (because we are going from monthly to annual) — which gives us an annualized standard deviation of 1.11%. That’s a pretty low standard deviation and would imply that extreme events such as hyperinflation are virtually impossible.

The following code simulates a year’s worth of inflation 10,000 times so we can look at the difference in outcomes between including autocorrelation and ignoring it.

# c is the regression constant, a.k.a. the mean value
# we will set c to zero to simplify things
c = 0# List to store autocorrelated simulation results
auto_correl_list = []
# List to store normally distributed simulation results
normal_list = []# Run 10,000 scenarios
for sims in range(10000):
    # In each scenario, generate 12 months of "inflation"
    shocks = np.random.normal(0, target_vol, (12,))
    y_list = []
    # This loop takes the 12 shocks and adds autocorrelation
    for i, e in enumerate(shocks):
        if i == 0:
            y = c + betas[0]*0 + betas[1]*0 + e
            y_list.append(y)
        elif i == 1:
            y = c + betas[0]*y_list[i-1] + betas[1]*0 + e
            y_list.append(y)
        else:
            y = c + betas[0]*y_list[i-1] + betas[1]*y_list[i-2] + e
            y_list.append(y)
    # Calculate the compounded products
    auto_correl_list.append(((pd.Series(y_list)+1).cumprod()-1).iloc[-1])
    normal_list.append(((pd.Series(shocks)+1).cumprod()-1).iloc[-1])


Let’s take a look at the distribution of outcomes. Look at how much wider the autocorrelated version (in blue) is than the normal (in orange). The simulated standard deviation of the normal (the standard deviation of the orange histogram) is basically what we calculated earlier — 1.11%. The standard deviation of the autocorrelated version is 7.67%, almost seven times higher. Notice also that the means for both are the same (both zero) — autocorrelation impacts the variance but not the mean. This has implications for regression, which I will cover in a future article.

IMAGE

Finally, let’s talk a bit about why this occurs. When something is autocorrelated (and the correlation coefficients are positive), it’s much more susceptible to feedback loops. Trends tend to snowball — for example, in cases where the last few observations were high, the next observation tends to be high as well because the next is heavily impacted by its predecessors.

We can see this snowball effect by looking at a few of the individual paths from our CPI simulation (this time simulated out to 60 months instead of just 12). In the autocorrelated version, once things get out of hand, it tends to stay that way, either going to the stratosphere or to -100%.

IMAGE

The normal version of our CPI simulation fails to capture this snowball effect and therefore understates the range of possible outcomes:

IMAGE

Which is closer to reality? My autocorrelated simulation definitely overstates the likely range of outcomes for inflation — it’s a simple simulation that fails to account for some of inflation’s other properties such as mean reversion. But without autocorrelation, you would look at your simulation results and assume that extreme phenomena like hyperinflation or persistent deflation are statistical impossibilities. That would be a grave mistake.

# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================


https://medium.com/data-science/time-series-from-scratch-autocorrelation-and-partial-autocorrelation-explained-1dd641e3076f

# Time Series From Scratch — Autocorrelation and Partial Autocorrelation Explained

subtitle: Part 5 of Time Series from Scratch series — Learn all about ACF and PACF — from theory and implementation to interpretation.

Today you’ll learn two functions for analyzing time series and choosing model parameters — Autocorrelation function (ACF) and Partical autocorrelation function (PACF). Both are based on correlation, a simple concept from statistics, so you’ll get a recap on that first.

The article is structured as follows:

- From correlation to autocorrelation
- Autocorrelation — Theory and implementation
- Partial autocorrelation — Theory and implementation
- How to interpret ACF and PACF plots
- Conclusion

## From correlation to autocorrelation

Both terms are tightly connected. Correlation measures the strength of the linear relationship between two sequences:

    The closer the correlation to +1, the stronger the positive linear relationship
    The closer the correlation to -1, the stronger the negative linear relationship
    The closer the correlation to 0, the weaker the linear relationship.

The following figure summarizes this concept perfectly:

IMAGE

Autocorrelation is the same, but with a twist — you’ll calculate a correlation between a sequence with itself lagged by some number of time units. Don’t worry if you don’t fully get it, as we’ll explore it next.

You’ll use the Airline passengers dataset through the article. Here’s how to import the libraries, load and plot the dataset:


In [None]:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

import matplotlib.pyplot as plt
from matplotlib import rcParams
from cycler import cycler

rcParams['figure.figsize'] = 18, 5
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False
rcParams['axes.prop_cycle'] = cycler(color=['#365977'])
rcParams['lines.linewidth'] = 2.5


# Dataset
df = pd.read_csv('data/airline-passengers.csv', index_col='Month', parse_dates=True)

# Visualize
plt.title('Airline Passengers dataset', size=20)
plt.plot(df);

## Autocorrelation — Theory and implementation

As said before, autocorrelation shows the correlation of a sequence with itself lagged by some number of time units. Once plotted, X-axis shows the lag number, and Y-axis shows the correlation of the sequence with a sequence at that lag. Y-axis ranges from -1 to 1.

Here’s an example.

The airline passenger dataset shows the number of passengers per month from 1949 to 1960. Autocorrelation answers the following question: “How correlated is the number of passengers this month with the number of passengers in the previous month?”. Here, the previous month indicates the lag value of 1.

You can rephrase to question and ask how correlated the number of passengers this month is to the number of passengers a year ago. Then, the lag value would be 12. And this is a great question, since yearly seasonality is visible from the chart.

One thing to remember — the more lags you use, the lower the correlation will be. More recent periods have more impact.

Before calculating autocorrelation, you should make the time series stationary. We haven’t covered the concept of stationarity yet, but we will in the following article. In a nutshell — the mean, variance, and covariance shouldn’t change over time.

The easiest way to make time series stationary is by calculating the first-order difference. It’s not a way to statistically prove stationarity, but don’t worry about it for now.

Here’s how to calculate the first-order difference:

In [None]:
# First-order difference
df['Passengers_Diff'] = df['Passengers'].diff(periods=1)
df = df.dropna()

# Plot
plt.title('Airline Passengers dataset with First-order difference', size=20)
plt.plot(df['Passengers'], label='Passengers')
plt.plot(df['Passengers_Diff'], label='First-order difference', color='orange')
plt.legend();

Here’s how both series look like:

IMAGE

The differenced series doesn’t look completely stationary but will suit for now.

You can now use the acf() function from statsmodels to calculate autocorrelation:


In [None]:
# Calculate autocorrelation
acf_values = acf(df['Passengers_Diff'])

Here’s how the values look, rounded up to two decimal points:

IMAGE

The first value is 1, because a correlation between two identical series was calculated. But take a look at as 12th period — autocorrelation value is 0.83. This tells you a value 12 periods ago has a strong impact on the value today.

Further, you can use the plot_acf() function to inspect the autocorrelation visually:

In [None]:
# Plot autocorrelation
plot_acf(df['Passengers_Diff'], lags=30);

Here’s how it looks like:

IMAGE

The plot confirms our assumption about the correlation on lag 12. The same is visible at lag 24, but the correlation declines over time. Value 12 periods ago has more impact on the value today than value 24 periods ago does.

Another thing to note is the shaded area. Anything inside it isn’t statistically significant.


## How to interpret ACF and PACF plots

Time series models you’ll soon learn about, such as Auto Regression (AR), Moving Averages (MA), or their combinations (ARMA), require you to specify one or more parameters. These can be obtained by looking at ACF and PACF plots.

In a nutshell:

    If the ACF plot declines gradually and the PACF drops instantly, use Auto Regressive model.
    If the ACF plot drops instantly and the PACF declines gradually, use Moving Average model.
    If both ACF and PACF decline gradually, combine Auto Regressive and Moving Average models (ARMA).
    If both ACF and PACF drop instantly (no significant lags), it’s likely you won’t be able to model the time series.

Still, reading ACF and PACF plots is challenging, and you’re far better of using grid search to find optimal parameter values. An optimal parameter combination has the lowest error (such as MAPE) or lowest general quality estimator (such as AIC). We’ll cover time series evaluation metrics soon, so stay tuned.

## Conclusion

And there you have it — autocorrelation and partial autocorrelation in a nutshell. Both functions and plots help analyze time series data, but we’ll mostly rely on brute-force parameter finding methods for forecasting. It’s much easier to do a grid search than to look at charts.

Both ACF and PACF require stationary time series. We’ve only covered stationarity briefly for now, but that will change in the following article. Stay tuned to learn everything about stationarity, stationarity tests, and testing automation.

Thanks for reading.

# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
# ====================================================================
