## QBUS3850 Lab 3 – ARIMA Models

The AutoRegressive Integrated Moving Average (ARIMA) model, which consist 3 major components:
- The AR part of ARIMA indicates that the evolving variable of interest 𝑦' is regressed on its own lagged (i.e., prior $y_{t-1:t-p}$) values.
- The MA part indicates that the error is actually a linear combination of error terms $\epsilon_{t-1:t-q}$ whose values occurred contemporaneously and at various times in the past.
- The I (for "integrated") indicates that the data values have been replaced with the difference between their values and the previous values (and this differencing process may have been performed more than once) so that the ARIMA could handle the non-stationary data. (Note that the AR, MA and ARMA model are still require a weakly stationary data as input)

The purpose of each of these features is to make the model fit the data as well as possible. 

In this week, we mainly focus on the implementation of AR model. Mathematically, an AR model of order $p$ is defined as:

$$y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \phi_{3}y_{t-3} \dots \phi_{p}y_{t-p} + \epsilon_{t}$$

$$ y_{t} = c + \sum_{i = 1}^{p} \phi_{i}y_{t-i} + \epsilon_{t}$$

where c is a constant value, $\phi_{1} \dots \phi_{p}$ are the parameters for the model, and $\epsilon_{t}$ is white noise.

This can be equivalent to the backshift representation with the operator $L$ as:

$$ y_{t} = c + \sum_{i = 1}^{p} \phi_{i}L^{i}y_{t} + \epsilon_{t}$$

So that moving the summation term to the left side and using polynomial notation, we have:

$$ \epsilon_{t} = y_{t} - \sum_{i = 1}^{p} \phi_{i}L^{i}y_{t} - c$$

$$ \epsilon_{t} = (1 - \sum_{i = 1}^{p} \phi_{i}L^{i})(y_{t} - \mu)$$

Where $\mu = c/(1 − \sum_j \phi_{j})$.

An AR model can thus be viewed as the output of a filter which input is white noise.

However, for an AR(p) model, some parameter constraints are necessary for the model to ensure weakly stationary. More specifically, processes in the AR(1) model with only $| \phi_{1} < 1 |$ could be viewed as __weakly stationary__.

__Question__: Check whether the process $Y_t=0.3Y_{t-1}+0.04Y_{t-2}+\epsilon_t$ is stationary?

- Solution: Solve the polynomial $1-0.3L-0.04L^2=0$ for $L$ as if $L$ were a number. By the quadratic formula the roots are

$$\begin{aligned}\frac{0.3\pm\sqrt{0.3^2+4\times 0.04\times1}}{2\times 0.04}\\\frac{0.3\pm\sqrt{0.25}}{0.08}\\\frac{0.3\pm 0.5}{0.08}\end{aligned}$$

The roots are 10 and -2.5. Both lie outside the unit circle.

For AR(2) model it can be shown that stationarity holds under the following constraints in general:

$$|\phi_{2}| < 1, \phi_{1} + \phi_{2} < 1 \text{ and } \phi_{1} - \phi_{2} < 1$$


Selecting a suitable order value p can be tricky. We need to observe the PACF curve to find out the cutting off value as the order for our AR model. The partial autocorrelation of an AR(p) process equates zero at lag which is
bigger than order of p and provide a good model for the correlation in $y_{1:p+1}$, so the appropriate maximum lag is the one beyond which the partial autocorrelations are all zero.

In [None]:
import matplotlib.pyplot as plt
import statsmodels as sm 
import statsmodels.api as smt
import numpy as np
import pandas as pd


## 1. AR(1) Process

In [None]:
# Setting up the random generator 
np.random.seed(0x123abc)

# Y_t = 0.7 * Y_{t-1} + e_t          
arparams = np.array([0.7])
zero_lag = np.array([1])

ar = np.r_[1, -arparams]  # add zero-lag (coefficient 1) and negate
c = 0
sigma2 = 1
 
# TODO: generate 2000 samples from the arma model in tsa library, and visualise it

In [None]:
# Examine the ACF and PACF plots
lags = 30
alpha = 0.05

# TODO: plot acf and pacf using smt.graphics

In [None]:
# Pandas can also produce ACF (but not PACF)
from pandas.plotting import autocorrelation_plot

# TODO: plot acf using pandas

In [None]:
# Calculate Mean 
# Since c = 0 then this result will be 0 anyway
y_uncond_mean = c / (1 - arparams[0])
print(y_uncond_mean)

diff_y = pd.Series.diff(pd.Series(y)).dropna()

sample_mean = np.mean(y)
print(sample_mean)

In [None]:
# Calculate Variance
# arma_generate_sample uses np.random.randn to generate epsilon
# We know then that by default 
# var(epsilon) = 1 (or sigma^2 = 1)
# mean(epsilon) = 0

y_uncond_var = sigma2 / (1 - np.power(arparams[0],2))
print(y_uncond_var)

sample_var = np.var(y)
print(sample_var)


## 2. Fit AR(1) to synthesized data

Now let’s try and fit a model to these samples and compare the parameters.

In [None]:
## y was generated from the process Y_t = 0.7 * Y_{t-1} + e_t  

# TODO: fit a ARMA model via sm.tsa.arima_model.ARMA

#### Question:
 - What does trend='nc' mean?
 - Can you interpret the result of trend='c'?

In [None]:
forecast = fit1.predict( start=0, end=len(y) + 10 )

## Plot actual vs fitted + forecasts.
plt.figure()
plt.plot( forecast )
plt.plot(y)
plt.title('Forecasts', weight = 'bold')
plt.xlim(len(y)-100,len(y)+10)
plt.show()

# Quick plot actual vs fitted + forecasts with 95% confidence intervals. 
fit1.plot_predict( start=len(y)-100, end=len(y)+10, alpha=0.05 )
print()

In [None]:
p,q = 1,0

from statsmodels.tsa.arima_model import ARIMA

# TODO: fit a ARIMA model using order (p, 0, q)

## 3. Fit AR(p) to real data

In [None]:
data = pd.read_csv('data.csv')
data = data['Data']

#data = pd.read_csv('AirPassengers.csv')
#data = np.log( data['Passengers'] )

plt.figure()
plt.plot(data)

In [None]:
# We calculate difference series, data[1]-data[0], data[2]-data[1],...
# TODO: see pd.Series.diff

# Checking the first entry in diff_data. Why is it a nan? What should we do about it?
# TODO: handle na

# TODO: Plot the differenced data

In [None]:
# TODO: Plot the ACF/PACF for the data using lag=30

# How to plot 95% confidence intervals?
# Note that the standard deviation is computed according to 
# Bartlett’s formula.


In [None]:
# TODO: Compare it against differenced time series

#### Questions:
 - What order AR model is appropriate?
 - Fit an AR model to the differenced data.
 - Plot the differenced data against the AR model predictions.
 - Plot the original data against the AR model predictions.
 - Re-do these questions with Air Passenger data.

In [None]:
# TODO: fit AR model on differenced data using ordes (1,0) and plot the result


In [None]:
forecast = fit2.predict(start=0, end=len(diff_data) + 10 )

plt.figure()
plt.plot( data[0] + np.cumsum( forecast )  / fit2.params[0] )
plt.plot( data[0] + np.cumsum( diff_data ) )
plt.title('Forecasts', weight = 'bold')
plt.xlim( 0, len(diff_data) + 10 )
plt.show()

In [None]:
p,q = 1,0

from statsmodels.tsa.arima_model import ARIMA
# TODO: train an ARIMA model with order (p, 1, q), and plot the result