<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Time Series: Autocorrelation


# What is Autocorrelation?

In previous weeks, our analyses has been concerned with the correlation between two or more variables (height and weight, education and salary, etc.). In time series data, autocorrelation is a measure of _how correlated a variable is with itself over time_.

Examples:

- In stock market data, the stock price at one point is correlated with the stock price of the point that's directly prior in time. 
- In sales data, sales on a Saturday are likely correlated with sales on the next Saturday and the previous Saturday, perhaps more than the preceding Friday.

**Exercise (2 mins., post immediately)**

- List at least three real-world examples of autocorrelation.

# How Do We Compute Autocorrelation?

**Recall:** Correlation between $X$ and $Y$:

$${corr(X, Y) = \frac{\operatorname{E}[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y}}$$

**Now:** The autocorrelation of $X$ for a given "lag" $k$ is just the correlation between $X$ and a "lagged" version of $X$ in which all observations are shifted by $k$ time units:

$${R(k) = \frac{\operatorname{E}[(X_{t} - \mu)(X_{t-k} - \mu)]}{\sigma^2}}$$

With a _k_ value of one, we'd compute how correlated a value is with the prior one. With a _k_ value of 10, we'd compute how correlated a variable is with one that's 10 time points earlier.

**Note:** It doesn't make much sense to talk about *the* correlation between $X$ and $Y$ if the relationship between $X$ and $Y$ was changing as the dataset was collected. In the same way, it doesn't make much sense to talk about *the* autocorrelation of $X$ with lag $k$ if the relationship between $X$ and its previous values changed over the course of the dataset.

Because time series data is gathered over time, changes in the underlying data-generating process over the course of data collection are particularly common for this kind of data. In a time series context, these kinds of changes are called *non-stationarity*.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (16.0, 8.0)

In [None]:
data = pd.read_csv('../assets/data/rossmann.csv', dtype={'StateHoliday': str})

In [None]:
# Cast "Date" column to Pandas Timestamp


In [None]:
# Make "Date" the row index


In [None]:
# Look at just Store 1


## Computing Autocorrelation

In [None]:
# Create a 1-unit lagged version of "Sales"


**Exercise (1 min, post immediately.)**

- Why is the first value 5020?

- Why is the last value NaN?

$\blacksquare$

In [None]:
# Compute the correlation between these series and the original "Sales" series


In [None]:
# Use .autocorr to compute this autocorrelation in one step


In [None]:
# Compute autocorrelation for "Sales" with a lag of 10


## Plotting Autocorrelation Functions Using StatsModels and Pandas

Pandas provides convenience plots for autocorrelations.

In [None]:
# Make an autocorrelation plot with pandas


StatsModels also comes with some convenient packages for calculating and plotting autocorrelation. Load up these two functions and try them out.

In [None]:
# Make an autocorrelation plot with statsmodels


This plots the correlation between the series and a lagged series for the lags indicated on the horizontal axis. For instance, at `0`, the series will be completely correlated with itself, so the blue dot is at `1.0`. The points that fall outside of the blue indicate significant correlation values. Big jumps in autocorrelation appear at lags that are multiples of seven. Our sales data are daily, so it makes a lot of sense that a single Monday's sales would be correlated with the prior Monday's (and the one before it... and so on) -- especially because every Sunday has zero sales!

These plots reveal seasonality in our time series.

In [None]:
# Get the acf values as an array 


## Partial Autocorrelation and the Partial Autocorrelation Function (PACF)

Another important chart for diagnosing your time series is the partial autocorrelation chart (PACF). This is similar to autocorrelation, but, instead of being just the correlation at increasing lags, it is the correlation at a given lag, _controlling for the effect of previous lags._

In [None]:
# Plot the partial autocorrelations using statsmodels


This plots the correlation at a given lag (indicated by the horizontal axis), controlling for all of the previous lags. We continue to see big jumps in correlation at the weekly time lags, but they go down through time. 

**Note:** A lot of this pattern is driven by the zeros on Sundays.

**Exercise (1 min, post immediately.)**

- How might seasonality in a data set (monthly, weekly, etc.) show up in autocorrelation plots?

## Problems Posed by Autocorrelation

Suppose we built a linear regression model to predict sales from other features in our dataset. Unfortunately, certain properties of linear regression hold only when the residuals/error terms are independent of one another, and autocorrelation removes this independence. For instance, if our model underestimates sales on Day 20, then it is likely to underestimate them on Day 27 as well, because of the autocorrelation.

> **What are some problems that could arise when using autocorrelated data with a linear model?**
* Estimated regression coefficients are still unbiased, but they are no longer the minimum-variance unbiased estimators.
* The MSE may seriously underestimate the true variance of the errors.
* The standard error of the regression coefficients may seriously underestimate the true standard deviation of the estimated regression coefficients.
* Statistical intervals and inference procedures are no longer strictly applicable.

Statisticians have developed specialized models specifically for time series data because of these issues.

# Recap

* Autocorrelation is a measure of how dependent a data point is on previous data points.
* Investigating ACF and PACF plots can help us identify seasonality in our time series data.
* Simple linear regression is problematic for data with autocorrelations because these data no longer have independent errors.

**Exercise (10 mins., pair programming)**

In [None]:
euro = pd.read_csv('../assets/data/euretail.csv')
euro.head()

- Set "Year" as the index

- Run the code below. What is does `.stack()` do?

In [None]:
euro.head()

In [None]:
euro = euro.stack()

In [None]:
euro.head()

- Make a line plot of the values in `euro`

- Use `plot_acf` and `plot_pacf` to look at the autocorrelation in the data set.

- Interpret your findings.

$\blacksquare$