# Autocorrelation: Exploring Rossmann Drug Store Sales Data

In [1]:
import pandas as pd, numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('data/rossmann.csv', skipinitialspace=True, low_memory=False)

In [2]:
# we are most interested in `Date` column that contains date of sales per store; convert to `DateTime` and set as index


In [3]:
# describe and EDA


In [4]:
# sort dates


In [5]:
# df of store 1


To compare sales on holidays, we can compare the sales using box-plots, which allows us to compare the distribution of sales on holidays against all other days. On state holidays the store is closed (and as a nice sanity check there are 0 sales), and on school holidays the sales are relatively similar.

In [6]:
# do school holidays affect sales?


In [7]:
# does day of week affect sales?


Lastly, we want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were there any particularly interesting outliers in terms of sales or customer visits?

In [8]:
# plot store 1 sales when open


In [9]:
# plot store 1 customer count when open


## Autocorrelation

To measure how much the sales are correlated with each other, we want to compute the _autocorrelation_ of the 'Sales' column. In pandas, we do this we with the `autocorr` function.

`autocorr` takes one argument, the `lag` - which is how many prior data points should be used to compute the correlation. If we set the `lag` to 1, we compute the correlation between every point and the point directly preceding it, while setting `lag` to 10, computes the correlation between every point and the point 10 days earlier.

In [10]:
# resample all store data to average daily sales


In [11]:
# check autocorrelation for previous two weeks


In [12]:
# plot autocorrelation for different lags using pandas
from pandas.tools.plotting import autocorrelation_plot


## Rolling Averages

If we want to investigate trends over time in sales, as always, we will start by computing simple aggregates.  We want to know what the mean and median sales were for each month and year.

In Pandas, this is performed using the `resample` command, which is very similar to the `groupby` command. It allows us to group over different time intervals.

We can use `data.resample` and provide as arguments:
    - The level on which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' for year
    - What aggregation to perform: 'mean', 'median', 'sum', etc.

In [13]:
# resample original sales data mean and median by month


While identifying the monthly averages are useful, we often want to compare the sales data of a date to a smaller window. To understand holidays sales, we don't want to compare late December with the entire month, but perhaps a few days surrounding it. We can do this using rolling averages.

In [14]:
# resample to have the daily average over all stores, then find rolling mean


`rolling` has these important parameters:
    - the first is the series to aggregate
    - `window` is the number of days to include in the average
    - `center` is whether the window should be centered on the date or use data prior to that date
    - `freq` level to roll-up averages to (as in `resample`). `D` for day, `M` for month or `A` for year, etc.

Instead of plotting the full timeseries, we can plot the rolling mean instead, which smooths random changes in sales as well as removing outliers, helping us identify larger trends.

In [15]:
# plot rolling mean


## Pandas Window functions
Pandas `rolling` is an example of Pandas window function capabilities. Window functions operate on a set of N consecutive rows (a window) and produce an output: mean, median, min, max, sum, etc.

Another common one is `diff`, which takes the difference over time. `pd.diff` takes one arugment, `periods`, which is how many prior rows to use for the difference.

In addition to `rolling` functions, Pandas provides a similar collection of `expanding` functions, which instead of a window, use all values up until that time.

In [16]:
# calculate diff for open store 1 data


In [17]:
# compute average daily expanding sales


In [18]:
# does expanding sales at the last row work as assumed?


## Exercises

In [19]:
# plot the distribution of sales by month and compare the effect of promotions


In [20]:
# Are sales more correlated with the prior day, day of week, last month, or last year?


In [21]:
# plot the 15 day rolling mean of customers in the stores


In [22]:
# identify the date with largest drop in average sales from previous cycles: daily, weekly, etc.


In [23]:
# filter out closed days


In [24]:
# compute the total sales up until Dec. 2014


In [25]:
# When were the largest differences between 15-day moving/rolling averages?


In [26]:
# sort values
