# Time series modeling

files needed = ( none )

We continue to work with the statsmodels package [(docs)](https://devdocs.io/statsmodels/). This notebook is a brief introduction to the time series methods. 

We are only going to touch on autoregressive models here, but statsmodels [tsa](https://www.statsmodels.org/dev/tsa.html) package has support for moving average and ARIMA models as well as vector autoregression and error correction models. The package also includes filtering methods and time series related plotting functions. 

Time series models are often used for forecasting. If forecasting interests you, Prof. Hansen offers a [forecasting course](https://www.ssc.wisc.edu/~bhansen/460/) (ECON 460) where you can learn a lot more about these kinds of models.

In [None]:
import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.graphics.tsaplots as tsaplots  # Gives the the autocorrelation plot
import statsmodels.tsa as tsa                     # The time series models

from pandas_datareader import data, wb    # we are grabbing the data and wb functions from the package
import datetime as dt                     # for time and date

### GDP 

Let's model US GDP as an autoregressive process. Grab the data from FRED.

In [None]:
codes = ['GDPC1']  # The code for real GDP at FRED.
start = start = dt.datetime(1947, 1, 1)

usa = data.DataReader(codes, 'fred', start)
usa.info()

In [None]:
usa.rename(columns={'GDPC1':'gdp'}, inplace=True)

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.plot(usa.index, usa['gdp'], color = 'red')
ax.set_ylabel('bil 2012 dollars')

sea.despine(ax=ax)

plt.show()

### Stationarity

The GDP data are not stationary --- there is clearly a trend. Autoregressive models are usually used on stationary data. We can filter the data to remove the trend and recover a stationary series (tsa has [methods](https://www.statsmodels.org/dev/tsa.html#time-series-filters) to to this), but GDP is usually stationary in growth rates. 

Compute the growth rate of the GDP data.

In [None]:
# The log-difference is a close approximation of the growth rate when growth rates are small. 
# log-differences are often used in practice because they are symmetric. 

usa['gdp_diff'] = np.log(usa['gdp']) - np.log(usa['gdp'].shift(1))
usa['gdp_pct'] = usa['gdp'].pct_change()

In [None]:
fig, ax = plt.subplots(figsize=(15,6))

ax.plot(usa.index, usa['gdp_diff'], color = 'red')
ax.plot(usa.index, usa['gdp_pct'], color = 'blue')

ax.set_ylabel('growth rate')

sea.despine(ax=ax)

plt.show()

This looks roughly stationary.  There are things like Dickey-Fuller tests for stationarity (again, these can be done in statsmodels) but they are outside of the scope of this notebook. 

### Autoregressive model 

We say [autoregressive model](https://en.wikipedia.org/wiki/Autoregressive_model) because we are modeling the variable as a function of its past values. The general model is

$$ y_t = \varphi_0 + \varphi_1y_{t-1} + \varphi_2 y_{t-2} + \cdots + \varphi_p y_{t-p}+\epsilon_t.$$

This model has $p$ lags. We refer to it as an AR(p) model or an autoregressive model of order $p$. 

Determining the number of lags to include is part of specifying the model.


Let's start by looking at the data to see if it has an autoregressive property. Is there a relationship between the current value of GDP growth and its past value?

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

ax.scatter(usa['gdp_diff'], usa['gdp_diff'].shift(1), color = 'red')

# plot the 45 degree line
ax.plot([-0.02, 0.045], [-0.02, 0.045], color='black')

ax.text(0.04, 0.037, '45-degree line')

ax.set_ylabel('gdp(t)')
ax.set_xlabel('gdp(t-1)')

sea.despine(ax=ax, trim=True)

plt.show()

There appears to be a positive relationship, but it is noisy. 

### The autocorrelation function

Our plot above tells us about the $t$ and $t-1$ relationship. How about $t$ and $t-2$? $t-5$? We could continue to make the plots above, but there are better ways. 

The autocorrelation function plots $\text{corr}(y_t, y_{t-k})$ for many values of $k$. statsmodels automates this for us with the `.plot_acf()` function [(docs)](https://www.statsmodels.org/dev/generated/statsmodels.graphics.tsaplots.plot_acf.html).  

In [None]:
fig, ax = plt.subplots(figsize=(10,6))

# plot_acf is picky about nas, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_acf(usa['gdp_diff'].dropna(), lags = 8, ax = ax)

ax.set_xlabel('lag = k', fontsize=14)
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)

plt.show()

### Estimating the AR(p) model

The AR(p) model in statsmodels works like the other models. You create a model object, then  you call the `.fit( )` method to estimate the parameters. 

The only real difference here is that you do not need to write out a string with the model specification. The `.AR( )` method knows that the model is AR(p). 

In [None]:
# I couldn't get these two methods to work. Need to look into this.
# usa = usa.asfreq('q')  
# usa.index.freq = 'q'

# Need to set the frequency of hte data. This is a not-very-elegant way. 
usa = usa.resample('q').mean()

# Construct the model. 
ar_mod = tsa.ar_model.AR(usa['gdp_diff'].dropna(), freq='Q')

type(ar_mod)  # What have we got?

Now we call fit [(docs)](https://www.statsmodels.org/dev/generated/statsmodels.tsa.ar_model.AR.fit.html#statsmodels.tsa.ar_model.AR.fit). 

Here we have to decide how many lags we want. There are two options to deal with. 

* ic is the criterion for choosing how many lags to include. Options are 'aic', 'bic', 'hic', and 't-stat'.Which one to choose is a technical detail. Let's  go with 'aic'. Alternatively, you can choose ic = None.
* maxlag is the maximum number of lags. If ic = None, this is the number of lags. 


In [None]:
#res = ar_mod.fit(ic = None, maxlag=3)

res = ar_mod.fit(ic = 'aic')

In [None]:
# The parameters
print(res.params)


In [None]:
# The optimal number of lags is
print('The number of lags is {0}.'.format(res.k_ar))

The results object has lots of stuff in it. Try tab completion to see what's in there. 

Let's plot the fitted values

In [None]:
fig, ax = plt.subplots(figsize=(10,6))

ax.plot(res.fittedvalues, color='blue', label = 'fitted values')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'data')

sea.despine(ax=ax)
ax.set_ylabel('growth rate')
ax.set_title('US GDP growth rates')

ax.legend(frameon=False)

plt.show()

The fitted values track the data, but do not generate enough volatility. If we really wanted to get serious about modeling GDP growth, we would want to expand our model to something like an autoregressive-moving-average model.

## Practice

Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

1. Read in the S\&P 500 price index from FRED. Use November 23, 2013 as the start date.
2. Resample the data to weekly frequency using mean().

3. Plot the weekly average price. Are the data stationary?

4. Compute the log difference of the weekly price data. 
5. Plot the growth rates. Are the data stationary?

6. Plot the autocorrelation function for 20 lags. What lags look meaningful? 

7. Estimate an AR(p) model of the price data. Include 1 lag.

8. Print out the number of lags used and the parameter estimates. 

9. Plot the data and the fitted values. 

10. Restimate the model letting the `.fit( )` method choose the number of lags. How many lags were chosen?
11. Print out the number of lags and the parameter estimates. 

\[If you have extra time, try ploting the estimates and their confidence intervals. I used errorbar, but there are many ways.\]

12. Plot the data and the fitted values from the 1-lag model and the 2-lag model. Plot just for the period 1/1/2015 through 1/1/2017.