# Auto ARIMA model

Automating the process of choosing a well performed model. 


Pros:

1. Saves time

2. Removes ambiguity

3. Reduces the risk of human error

Cons:

1. Blindly putting our faith into one criterion

2. Never really see how well the other models perform

3. Topic expertise

4. Human error

In [1]:
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from statsmodels.tsa.arima_model import ARIMA
from arch import arch_model
import yfinance
import warnings
warnings.filterwarnings('ignore')
sns.set()

In [17]:
raw_data = yfinance.download(tickers='^GSPC ^FTSE ^N225 ^GDAXI', start='1994-01-07', end='2021-10-27', interval='1d', group_by='ticker', auto_adjust=True, threads=True)

[*********************100%***********************]  4 of 4 completed


In [3]:
df_comp = raw_data.copy()

In [4]:
df_comp['spx'] = df_comp['^GSPC'].Close[:]
df_comp['dax'] = df_comp['^GDAXI'].Close[:]
df_comp['ftse'] = df_comp['^FTSE'].Close[:]
df_comp['nikkei'] = df_comp['^N225'].Close[:]

In [5]:
df_comp = df_comp.iloc[1:]
del df_comp['^N225']
del df_comp['^GSPC']
del df_comp['^GDAXI']
del df_comp['^FTSE']
df_comp = df_comp.asfreq('b')
df_comp = df_comp.fillna(method='ffill')

In [6]:
# Creating returns
df_comp['ret_spx'] = df_comp.spx.pct_change(1).mul(100)
df_comp['ret_ftse'] = df_comp.ftse.pct_change(1).mul(100)
df_comp['ret_dax'] = df_comp.dax.pct_change(1).mul(100)
df_comp['ret_nikkei'] = df_comp.nikkei.pct_change(1).mul(100)

In [7]:
# Splitting the data
size = int(len(df_comp) * 0.8)
df, df_test = df_comp.iloc[:size], df_comp.iloc[size:]

In [8]:
# Fitting a model
from pmdarima.arima import auto_arima

In [9]:
model_auto = auto_arima(df.ret_ftse[1:])

In [13]:
model_auto



In [14]:
model_auto.summary()

# even though it says SARIMAX, this is using ARMA model because there is no seasonality, no exog parameters, no integration
# This is an ARMA(4, 5)



0,1,2,3
Dep. Variable:,y,No. Observations:,5801.0
Model:,"SARIMAX(4, 0, 5)",Log Likelihood,-8972.725
Date:,"Sun, 31 Oct 2021",AIC,17967.45
Time:,20:38:23,BIC,18040.774
Sample:,0,HQIC,17992.957
,- 5801,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,0.0222,0.021,1.068,0.285,-0.019,0.063
ar.L1,0.0031,0.080,0.039,0.969,-0.154,0.161
ar.L2,-0.5816,0.080,-7.250,0.000,-0.739,-0.424
ar.L3,-0.1949,0.072,-2.707,0.007,-0.336,-0.054
ar.L4,0.2816,0.078,3.615,0.000,0.129,0.434
ma.L1,-0.0264,0.080,-0.329,0.742,-0.183,0.131
ma.L2,0.5334,0.081,6.596,0.000,0.375,0.692
ma.L3,0.1113,0.070,1.586,0.113,-0.026,0.249
ma.L4,-0.2834,0.077,-3.694,0.000,-0.434,-0.133

0,1,2,3
Ljung-Box (Q):,67.02,Jarque-Bera (JB):,7363.47
Prob(Q):,0.0,Prob(JB):,0.0
Heteroskedasticity (H):,1.34,Skew:,-0.19
Prob(H) (two-sided):,0.0,Kurtosis:,8.51


1. The rules of model selection are rather 'rules of thumb' than 'fixed
2. Auto ARIMA only considers a single feature - the AIC
3. We could have easily overfitted while going through the models in our previous sections
4. The default arguments of the method restrict the number of AR and MA components

## Basic Auto ARIMA arguments

exogenous - outside factors (e.g. other time series)

m - seasonal cycle length

max_order - maximum amount of variables to be used in the regression (p+q)

max_p - maximum AR components

max_q - maximum MA components

max_d - maximum integrations

maxiter - maximum iterations we are giving the model to converge the coefficients (becomse harder as the order increases)

return_valid_fits - whether or not the method should validate the results

alpha - level of significance, default is 5% which we should be using most of the time

n_jobs - how many models to fit at a time (-1 indicates 'as many as possible')

trend - 'ct' usually

information_criterion - 'aic', 'aicc', 'bic', 'hqic', 'oob'

out_of_sample_size - validates the model selection (pass the entire dataset, and set 20% to be out_of_sample size)

Mixing stationary and non-stationary data could lead to some misleading results. Make sure to use the same type of data

In [15]:
model_auto = auto_arima(df_comp.ret_ftse[1:], exogenous=df_comp[['ret_spx', 'ret_dax', 'ret_nikkei']][1:], m=5, max_order=None, max_p=7, max_q=7, max_d=2, max_P=4, max_Q=4, max_D=2, 
maxiter=50, alpha=0.05, n_jobs=-1, trend='ct', information_criterion='oob', out_of_sample_size=int(len(df_comp) * 0.2))

In [16]:
model_auto.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,7252.0
Model:,SARIMAX,Log Likelihood,-7296.567
Date:,"Sun, 31 Oct 2021",AIC,14605.134
Time:,21:02:05,BIC,14646.469
Sample:,0,HQIC,14619.352
,- 7252,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-0.0017,0.016,-0.107,0.915,-0.034,0.030
drift,-1.947e-06,4.26e-06,-0.457,0.648,-1.03e-05,6.4e-06
x1,0.0938,0.005,17.878,0.000,0.083,0.104
x2,0.5605,0.005,120.993,0.000,0.551,0.570
x3,0.0732,0.004,18.541,0.000,0.066,0.081
sigma2,0.4636,0.004,121.640,0.000,0.456,0.471

0,1,2,3
Ljung-Box (Q):,160.88,Jarque-Bera (JB):,22076.09
Prob(Q):,0.0,Prob(JB):,0.0
Heteroskedasticity (H):,0.48,Skew:,0.32
Prob(H) (two-sided):,0.0,Kurtosis:,11.52
