# ARIMA
### Jonathan Balaban
In regession and classification, we use features (collected during a cross-sectional study/survey/measurement) to predict an outcome. The model and parameters represent part of the underlying relationship between features and outcome. But what if we run out of funds to cross-section, or need to predict future outcomes for which the features don't exist yet?

Enter [Autoregressive Integrated Moving Average (ARIMA)](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) modeling. When we have autocorrelation between outcomes and their ancestors, we will see a theme, or relationship in the outcome plot. This relationship can be modeled in its own way, allowing us to predict the future with a confidence level commensurate to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).

![](assets/nonstationary.gif)

Notice that time series data has four general characteristics:
- Trend: growth or decay over time
- Cycles: repeating patterns
- Seasonality: repeating patterns in relation to human time blocks
    - 3 days, 1 week, 3 months (quarterly), 8 years, etc.
- Random walk: unaccountable variation due to nature
    - If a series is mostly random, the average may be the pest prediction

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#set figsize
plt.rcParams['figure.figsize'] = (20.0, 6.0)

data = pd.read_csv('data/rossmann.csv', skipinitialspace=True, low_memory=True)

# set the DateTime index


  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
# Filter to Store 1

# Filter to open days

# Plot the sales over time


In [3]:
# print Sales autocorrelation for k=1,2


In [4]:
# create autocorr plot
from pandas.tools.plotting import autocorrelation_plot


In [5]:
# plot autocorr with statsmodel
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf


Here we observe autocorrelation at 10 lag values. 1 and 2 are what we saw before. This implies a small, but limited impact based on the last few values, suggesting that an autoregressive model might be useful.

Check: We also observe a larger spike at 7 - what does that mean?

That's the amount of days in a week!
If we observed a handful of randomly distributed spikes - that would imply a MA model may be useful. This is because those random spikes suggest that at some point in time, something changed in the world and all values are shifted up down from there in a fixed way.

That may be the case here, but if we expand the window we can see that the spikes occur regularly at 7 days windows. This means we have a weekly cycle!

In [6]:
# plot ACF and PACF


## ARMA

In [7]:
# build arma10 model
from statsmodels.tsa.arima_model import ARMA


In [8]:
# plot residuals


In [9]:
# plot acf


In [10]:
# plot actual vs. predicted


By passing the (1, 0) in the second argument, we are fitting an ARMA model as ARMA(p=1, q=1). Remember, an ARMA(p, q) model is AR(p) + MA(q). This means that an ARMA(1, 0) is the same as an AR(1) model.

In this AR(1) model we learn an intercept value, or base sales values. Additionally, we learn a coefficient that tells us how to include the last sales values. In this case, we take the intercept of ~4700 and add in the previous months sales * 0.68.

Note the coefficient here does not match the lag 1 autocorrelation - implying the the data is not stationary.

We can learn an AR(2) model, which regresses each sales value on the last two, with the following:

In [11]:
# arma20


In [12]:
# plot residuals


In [13]:
# plot acf


In [14]:
# plot actual vs. predicted


In [15]:
# model ARMA11


In [16]:
# plot residuals


In [17]:
# plot acf


In [18]:
# plot actual vs. predicted


## ARIMA

In [19]:
# build arima202


In [20]:
# create ARIMA210 model


In [21]:
# calculate autocorr of differenced sales data


In [22]:
# plot differenced sales data


In [23]:
# plot_predict


In [24]:
# plot residuals


In [25]:
# plot acf


In [26]:
# plot actual vs. predicted


In [27]:
# arima712


What would the effect on our model be if we were to increase the p, q, and d terms?

- Increasing p would increase the dependency on previous values further (longer lag), but this isn't necessary past a given point.
- Increasing q would increase the dependency of an unexpected jump at a handful of points, but we did not observe that in our autocorrelation plot.
- Increasing d would increase differencing, but with d=1 we saw a move towards stationarity already (except at a few problematic regions). Increasing to 2 may be useful if we are saw an exponential trend, but that we did not here.

There are variants of ARIMA that will handle the seasonal aspect better, known as Seasonal ARIMA. In short, these models fit two ARIMA models, one of the daily frequency and another on the seasonal frequency (monthly or yearly, whichever the pattern may be). We will be revisiting this topic later as we discuss ways to further tune the ARIMA model to the dataset provided.