In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Taming time series

Time series are collections of points collected at successive times, usually with equal time intervals between them.

## Importing and visualising Time Series data

In this module, we will consider the "bikes" dataset again but this time taking a time-series perspective. 

* Load the data `bikes.csv`
* Specify that you want to parse the column `'dates'` as dates using `parse_dates=['dates']` and
* Use the dates as the index column

Check with `.head()` that everything is fine then
* Display the temperature time series (you can either use `pandas` plotting facility by doing `bikes.plot(...)` or use `matplotlib` directly)

In [None]:
bikes = pd.read_csv('data/bikes.csv', parse_dates=['date'], index_col='date')
bikes.head()

In [None]:
bikes.plot()

### Exercise: only plot the temperature
With the graph above it's hard to get an indication for the temperature in the data. Plot a graph with just the temperature as y-axis.

In [None]:
bikes.plot(y='temperature', figsize=(12, 3), fontsize=12)


### Exercise: Number of bikes in a given month

From the chart above, you can see that the data covers two years and that there is a similar pattern in both years which is to be expected for the temperatures. 

The data can be queried according to dates. For example, you can aggregate the data from just January 2012 and check the sum of bikes hired that month. Can you compare that with the number of bikes hired in August? 

In [None]:
jan_start = pd.Timestamp("1st January 2012")
jan_end = pd.Timestamp("31st January 2012")
bikes_jan = bikes[jan_start:jan_end]['count'].sum()

aug_start = pd.Timestamp(2012, 8, 1)
aug_end = pd.Timestamp(2012, 8, 31)
bikes_aug = bikes[aug_start:aug_end]['count'].sum()

print("{0:.0f} bikes in January vs {1:.0f} bikes in August.".format(bikes_jan, bikes_aug))

## Resampling

We can aggregate time series by resampling the points on a coarser time level. 

* Use the `.resample` to get the data corresponding to monthly averages
* Display the `temperature` time series for the monthly averages. 

In [None]:
bikes_monthly = bikes.resample('M').mean()

plt.figure(figsize=(12, 3))
plt.plot(bikes_monthly.temperature, "-o")

### Resample by the mean of each week and uses the humidity

In [None]:
bikes.plot(y='humidity', figsize=(12, 3), fontsize=12)

In [None]:
bikes_weeks = bikes.resample('W').mean()

plt.figure(figsize=(12, 3))
plt.plot(bikes_weeks.humidity, "-o")

## Parsing custom date formats

When you loaded the bikes dataset, Pandas automatically detected the format of the dates for you.
This might often "just work" but there often will be cases where you need to be careful about parsing and might have to do it yourself.

Load the data `NZAlcoholConsumption` and have a look at it without specifying a column to parse for dates. 

In [None]:
alcohol_consumption = pd.read_csv('data/NZAlcoholConsumption.csv')
alcohol_consumption.head()

This dataset contains data aggregated by quarters, the timestamp is formatted in a string where the first 4 characters represent the year and the last two the quarter. 
To transform the timestamps in dates that pandas can directly use, you can write a parser function. 


### Exercise: parsing quarter
Write a function `parse_quarter` that takes a string of the form `YYYYQN` and convert it to `pandas.Timestamp` object. Use the following conversion for the quarters:

* Q1 --> mar 31
* Q2 --> jun 30
* Q3 --> sep 20
* Q4 --> dec 31

In [None]:
import re
def parse_quarter(string):
    """
    Converts a string from the format YYYYQN in datetime object at the end of quarter N.
    """
    
    # Note: you could also just retrieve the first four elements of the string
    # and the last one... Regex is fun but often not necessary
    year, qn = re.search(r'^(20[0-9][0-9])(Q[1-4])$', string).group(1, 2)
    
    # year and qn will be strings, pd.datetime expects integers.
    year = int(year)
    
    date = None
    
    if qn=='Q1':
        date = pd.Timestamp(year, 3, 31)
    elif qn=='Q2':
        date = pd.Timestamp(year, 6, 30)
    elif qn=='Q3':
        date = pd.Timestamp(year, 9, 20)
    else:
        date = pd.Timestamp(year, 12, 31)
        
    return date
# Check that it works!
print(parse_quarter("2000Q3")) # should show 2000-09-20 00:00:00

### Giving the parser to pandas

Pandas can parse dates using a custom made parser such as the one you just defined. For this just specify your function in the `date_parser` option.

In [None]:
# reload the data using your parser, set the index to the date 
alcohol_consumption = pd.read_csv('data/NZAlcoholConsumption.csv', 
                                  parse_dates=['DATE'], 
                                  date_parser=parse_quarter,
                                  index_col='DATE')
alcohol_consumption.sort_index(inplace=True)
alcohol_consumption.head()

### Exercise: Display the time series

Now, have a look at the consumtion of wine and beer, show both on the same figure. Discuss the two time series.

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(alcohol_consumption.TotalWine, 
         '-o', label='Wine')
plt.plot(alcohol_consumption.TotalBeer, 
         '-o', label='Beer')
plt.legend(fontsize=12)


The plots show that the two time series have similar patterns in terms of seasonality but different trends.
Both show that alcohol consumption is maximum in the last quarter of the year and is usually at its lowest in the second quarter. 
The average beer consumption seems stable during the years, while the wine consumption seems to be steadily increasing. 

### Exercise: resample the data per year (12 months) 
Can you resample the data per year (12 months) and see whether the trends come out better? 

In [None]:
alc_yearly = alcohol_consumption.resample('12M').mean()

plt.figure(figsize=(12, 3))
plt.plot(alc_yearly.TotalWine, "-o", label="Wine")
plt.plot(alc_yearly.TotalBeer, "-o", label="Beer")
plt.legend(fontsize=12)

## Moving Windows

In the cells below you will explore the effect of applying a "Rolling Average" to the data i.e.: look at a number of successive points, take the average, and replace the window by the average (either at the extreme right of the window, or at the center)

* Use the `rolling` method from `pd.Series` ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html#pandas.Series.rolling))
* specify a window of 4 points

plot the averaged line and the original time series and discuss. 

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(alcohol_consumption.TotalWine,
         '-o', label='wine consumption')
rolling_mean = alcohol_consumption.TotalWine.rolling(window=4).mean()
plt.plot(rolling_mean, label='trend')
plt.legend(fontsize=12);

The rolling mean curve seems to capture the trend nicely and removes much of the seasonal movements. 
This curves allows to better appreciate the overall increase of wine consumption over time as well as the dip in consumption in 2008. 

To explore this rolling average further, it's nice to look at widgets. Have alook at the cell below and modify at will. 

In [None]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

def rolling_avg_plot(window_size):
    plt.plot(alcohol_consumption.TotalWine, 
             '-o', label='wine consumption')
    rolling = alcohol_consumption.TotalWine.rolling(window=window_size).mean()
    plt.plot(rolling, label='trend')
    plt.legend();
    plt.show()

interact(rolling_avg_plot, window_size=(0, 10));

### Exercise: plot the moving sum with a window of width 4

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(alcohol_consumption.TotalWine,
         '-o', label='wine consumption')
rolling_mean = alcohol_consumption.TotalWine.rolling(window=4).sum()
plt.plot(rolling_mean, label='trend')
plt.legend(fontsize=12);


## Differencing

Differencing amounts to looking at the time series formed of differences between values separated by a given lag: 

$y'_t = y_t-y_{t-1}$

for a lag of 1. Show the time series for `TotalWine` and the differenced one (with lag 1). What do you observe? 

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(alcohol_consumption.TotalWine, '-o', 
         label="original ts")
plt.plot(alcohol_consumption.TotalWine.diff(1), '-o', 
         label="differenced ts (lag=1)")
plt.legend(fontsize=12)

To get a feel for what a good lag should be (though here, intuitively, you should realise that a lag of `4` is a good idea), you can look at the cell below that shows differenced series for increasing lags. 

In [None]:
def differencing_plot(d):
    differenced_ts = alcohol_consumption.TotalWine.diff(d)
    plt.plot(differenced_ts, '-o')
    plt.show()

interact(differencing_plot, d=(1, 10));

## Autocorrelation

Autocorrelation measures the correlation (similarity) between the time series and a lagged version of itself. 

* Use the `autocorr` method ([documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.autocorr.html)) to compute the autocorrelation for lag from 1 to 13
* Display the values of the autocorrelation using a stem plot (`plt.stem`) 

What do you observe? 

In [None]:
lags = range(1, 13)
autocorrs = [alcohol_consumption.TotalWine.autocorr(lag=lag) 
                   for lag in lags]
plt.figure(figsize=(8, 6))
plt.stem(lags, autocorrs)
plt.xlabel("Lag", fontsize=12)
plt.ylabel("Autocorrelation", fontsize=12)

It's quite clear from this plot that the time series is self-similar to itself with a lag of 4 and consistently so (so also with a lag of 8, 12, etc)