<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px">

## Autocorrelation and properties of time series data

Week 10 | Lesson 2.3

---

We looking at the Rossman store data to practice more timeseries plotting, look at different ways of modeling the mean (or median, etc.) of timeseries, and learn about the autocorrelation of a vector.

Trends, moving averages, and autocorrelation are essential concepts to cover before jumping into modeling timeseries with ARIMA models.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---

## Timeseries trends

An example of an upward trend:

![](./assets/images/trend-line2.png)

Trend may “change direction” when it goes from an increasing trend to a decreasing trend. Trend can only be measured in the scope of the data collected; there may be trends that are unmeasureable if the data is not complete.

---

## Seasonality

When there are patterns that repeat over known, fixed periods of time within the data set it is considered to be **seasonality**.

A seasonal pattern exists when a series is influenced by factors relating to the cyclic nature of time - i.e. time of month, quarter, year, etc. 

Seasonality is of a fixed and known period, otherwise it is not truly seasonality, and must be either attributed to another factor or counted as a set of anomalous events in the data.

![](./assets/images/seasonality_decreasing_trend.png)

---

## Rossman store data

Load the rossman store data, then convert the date to datetime format and make it the index of the DataFrame:

In [2]:
data = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/rossman_stores/rossmann.csv', skipinitialspace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.columns = ['store','day_of_week','date','sales','customers','open','promo','state_holiday','school_holiday']

**Plot a seaborn factorplot `kind='box'` for stores 1, 4, and 9 with the x-axis the day of the week and the y-axis the sales:**

**Plot the sales and customer timeseries for store 1 when open:**

---

## Rolling means (and medians, etc.)

The "rolling mean" or median takes a specified lag and uses the current time point and time points prior up to the specified lag to calculate the statistic.


### Parameters for `rolling` functions

**`rolling().mean()`** (as well as **`rolling().median()`**) can take these parameters:

- the first is the series to aggregate
- **`window`** is the number of days to include in the average
- **`center`** is whether the window should be centered on the date or use data prior to that date
- **`freq`** is on what level to roll-up the averages to (as used in **`resample`**). Either **`D`** for day, **`M`** for month or **`A`** for year, etc.



**Calculate the weekly rolling median of sales for store 1 in 2014, using a weekly time period with an order 2 window:**

### Expanding mean

The "expanding mean" simply uses all datapoints up to the current time to calculate the mean, as opposed to a moving window.

**Plot the rolling mean and the expanded mean for store 1 sales in year 2014:**

### Exponentially weighted windows

Exponentially weighted windows are one of the most common and effective ways of averaging out noise in timeseries data. The averaging is done with an "exponential decay" on the contribution of prior means, decreasing the contribution of timepoints further in the past.

The (adjusted) exponentially weighted mean for time $t$ is defined as:

### $$ x_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2x_{t-1} + ... + (1 - \alpha)^{t}x_0} {1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{t}} $$

See:

http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows



**Plot the rolling and exponentially weighted mean of sales data for the winter months of store 1 sales in 2014:**

---

##  Autocorrelation and the autocorrelation function (acf)

While in previous weeks, our analyses has been concerned with the correlation between two or more variables (height and weight, education and salary, etc.), in time series data, autocorrelation is a measure of _how correlated a variable is with itself_.

Specifically, autocorrelation measures how closely related earlier values are with values occurring later in time.

Examples of autocorrelation:

    In stock market data the stock price at one point is correlated with the stock 
    price of the point directly prior in time. 
    
    In sales data (like we have seen), sales on a Saturday are likely correlated with 
    sales on the next Saturday and the previous Saturday, as well as other days to more
    or less extent.

Below is the formula for the autocorrelation funtion (acf):

$\text{Given measurements } x_1, x_2, x_3 ... x_n \text{ at time points } t_1, t_2, t_3 ... t_n:$

### $$lag_k\;acf() = \frac{\sum_{t=k+1}^{n}\left(\;x_t - \bar{x}\;\right)\left(\;x_{t-k} - \bar{x}\;\right)}{\sum_{t=1}^n\left(\;x_t - \bar{x}\;\right)^2}$$

Compare this to the formula for correlation:

$\text{Given measurements } x_1, x_2, x_3 ... x_n \text{ aand measurements } y_1, y_2, y_3 ... y_n:$

### $$r_{xy} = \frac{\sum_{i=1}^{n}\left(\;x_i - \bar{x}\;\right)\left(\;y_{i} - \bar{y}\;\right)}{\sqrt{\left(\sum_{i=1}^{n}\left(\;x_i - \bar{x}\;\right)^2\sum_{i=1}^n\left(\;y_i - \bar{y}\;\right)^2\right)}}$$

---

## Computing Autocorrelation

To compute autocorrelation, we fix a lag _k_ which is the delta between the given point and the prior point used to compute the correlation.

With a _k_ value of 1, we'd compute how correlated a value is with the prior one. With a _k_ value of 10, we'd compute how correlated a variable is with one 10 time points earlier.

**Calculate the autocorrelation for store 1 (when open) sales for day and week periods with lag 1:**

**Calculate the autocorrelation for store 1 (including days closed) for day with lag 7:**

**Calculate the autocorrelation for store 1 through 7 days:**

**Plot the autocorrelation for store 1 for 31 days:**

**Load the `acf` and `plot_acf` functions from statsmodels to plot the autocorrelation.**

In [4]:
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf

**Calculate and plot the autocorrelation again using the statsmodels functions:**