# ARIMA

AR - indicates taht the evolving variable of interest is regressed on its own lagged (prior) values <br>
MA - uses the dependency between an observation and a residual error from a moving average model applied to the lagged observations<br>
I - Indicates that the data values have been replaced with the difference between their values and the previous values. The number of times the data was differenced to get it stationary so the AR and MA componenets can work together<br>
<br>
Non-seasonal ARIMA models are denoted as ARIMA(p,d,q) where parameters p, d and q are non-negative integers
* p - The number of lag observations included in the model
* d - The number of times the raw observations are differenced
* q - The size of hte moving average window, also called the order of moving average

## Choosing ARIMA Orders

Selecting p, d and q 
* If the autocorrelation plot shows positive autocorrelation at the first lag (lag - 1), then it suggests to use the AR term in relation to the lag
* If the autocorrelation plot shows negative autocorrelation at the first lag, then it suggest using MA terms
<br>
Using AR-k model - used typically when observing a sharp drop after lag 'k'<br>
Using a MA model - when observing a gradual decline<br>

**Identification of an AR model is often best done with the PACF**<br>
**Identification of a MA bolde is often best done with the ACF rather than the PACF**<br>

### Using Grid Search to identify p, d, q and P, D, Q

Pyramid ARIMA (pmdarima) library designed to perform grid searches across multiple combinations of p, d, q and P, D, Q <br>
pmdarima library utilizes teh Akaike information criterion (AIC) - that penalizing more complicated models - as a metric to compare the performance of various ARIMA based modles <br>
* k - number of estimated parameters
* L - maximum value of hte likelihood function for the model
AIC = 2k - 2ln(L-hat)

In [1]:
import pandas as pd
import numpy as np  
%matplotlib inline
import matplotlib.pyplot as plt

# load non-stationary dataset
df1 = pd.read_csv('/Users/tanojudawattage/1_tanoj/0.00_Cloud_Computing_and_Streaming_Tech/Python_for_Time_Series_Files_JosePortilla/Data/airline_passengers.csv', parse_dates= True, index_col='Month')
df1.index.freq = 'MS'

# load stationary dataset
df2 = pd.read_csv('/Users/tanojudawattage/1_tanoj/0.00_Cloud_Computing_and_Streaming_Tech/Python_for_Time_Series_Files_JosePortilla/Data/DailyTotalFemaleBirths.csv', parse_dates= True, index_col='Date')
df2.index.freq = 'D'

Install the PMD envrionment on the command line <br>
pip install pmdarima

In [2]:
from pmdarima import auto_arima

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
help(auto_arima)

Help on function auto_arima in module pmdarima.arima.auto:

auto_arima(
    y,
    X=None,
    start_p=2,
    d=None,
    start_q=2,
    max_p=5,
    max_d=2,
    max_q=5,
    start_P=1,
    D=None,
    start_Q=1,
    max_P=2,
    max_D=1,
    max_Q=2,
    max_order=5,
    m=1,
    seasonal=True,
    stationary=False,
    information_criterion='aic',
    alpha=0.05,
    test='kpss',
    seasonal_test='ocsb',
    stepwise=True,
    n_jobs=1,
    start_params=None,
    trend=None,
    method='lbfgs',
    maxiter=50,
    offset_test_args=None,
    seasonal_test_args=None,
    error_action='trace',
    trace=False,
    random=False,
    random_state=None,
    n_fits=10,
    return_valid_fits=False,
    out_of_sample_size=0,
    scoring='mse',
    scoring_args=None,
    with_intercept='auto',
    sarimax_kwargs=None,
    **fit_args
)
    Automatically discover the optimal order for an ARIMA model.

    The auto-ARIMA process seeks to identify the most optimal
    parameters for an ``ARIMA``

### Evaluate p and q for Family Births Dataset

In [5]:
stepwise_fit = auto_arima(df2['Births'], 
                            start_p=0, start_q=0,
                            max_p=6, max_q=3,
                            seasonal=False,
                            trace=True) 

Performing stepwise search to minimize aic
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=2650.760, Time=0.05 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=2565.234, Time=0.04 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=2463.584, Time=0.06 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=2648.768, Time=0.01 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=2460.154, Time=0.10 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=2461.271, Time=0.15 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=0.30 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=2460.722, Time=0.12 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=2536.154, Time=0.07 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=2462.899, Time=0.38 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=2459.074, Time=0.04 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=2462.221, Time=0.02 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=2563.261, Time=0.02 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=2460.367, Time=0.06 sec
 ARIMA(1,1,2)(0,0,0)[0]             : 

In [6]:
# summary of best performing model
stepwise_fit.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,365.0
Model:,"SARIMAX(1, 1, 1)",Log Likelihood,-1226.537
Date:,"Fri, 05 Dec 2025",AIC,2459.074
Time:,13:28:45,BIC,2470.766
Sample:,01-01-1959,HQIC,2463.721
,- 12-31-1959,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.1252,0.060,2.097,0.036,0.008,0.242
ma.L1,-0.9624,0.017,-56.429,0.000,-0.996,-0.929
sigma2,49.1512,3.250,15.122,0.000,42.781,55.522

0,1,2,3
Ljung-Box (L1) (Q):,0.04,Jarque-Bera (JB):,25.33
Prob(Q):,0.84,Prob(JB):,0.0
Heteroskedasticity (H):,0.96,Skew:,0.57
Prob(H) (two-sided):,0.81,Kurtosis:,3.6


### Evaluate p and q for Airline Passengers Dataset

In [7]:
stepwise_fit = auto_arima(df1['Thousands of Passengers'], 
                            start_p=0, start_q=0,
                            max_p=4, max_q=4,
                            seasonal=True, m=12,
                            trace=True)

Performing stepwise search to minimize aic
 ARIMA(0,1,0)(1,1,1)[12]             : AIC=1032.128, Time=0.16 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=1031.508, Time=0.02 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=1020.393, Time=0.08 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=1021.003, Time=0.11 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=1020.393, Time=0.03 sec
 ARIMA(1,1,0)(2,1,0)[12]             : AIC=1019.239, Time=0.19 sec
 ARIMA(1,1,0)(2,1,1)[12]             : AIC=inf, Time=1.08 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=1020.493, Time=0.22 sec
 ARIMA(0,1,0)(2,1,0)[12]             : AIC=1032.120, Time=0.15 sec
 ARIMA(2,1,0)(2,1,0)[12]             : AIC=1021.120, Time=0.23 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=1021.032, Time=0.29 sec
 ARIMA(0,1,1)(2,1,0)[12]             : AIC=1019.178, Time=0.20 sec
 ARIMA(0,1,1)(1,1,0)[12]             : AIC=1020.425, Time=0.08 sec
 ARIMA(0,1,1)(2,1,1)[12]             : AIC=inf, Time=1.00 sec
 ARIMA(0,1,1)(1,1,1)[12]     

In [8]:
stepwise_fit.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,144.0
Model:,"SARIMAX(0, 1, 1)x(2, 1, [], 12)",Log Likelihood,-505.589
Date:,"Fri, 05 Dec 2025",AIC,1019.178
Time:,13:41:57,BIC,1030.679
Sample:,01-01-1949,HQIC,1023.851
,- 12-01-1960,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ma.L1,-0.3634,0.074,-4.945,0.000,-0.508,-0.219
ar.S.L12,-0.1239,0.090,-1.372,0.170,-0.301,0.053
ar.S.L24,0.1911,0.107,1.783,0.075,-0.019,0.401
sigma2,130.4480,15.527,8.402,0.000,100.016,160.880

0,1,2,3
Ljung-Box (L1) (Q):,0.01,Jarque-Bera (JB):,4.59
Prob(Q):,0.92,Prob(JB):,0.1
Heteroskedasticity (H):,2.7,Skew:,0.15
Prob(H) (two-sided):,0.0,Kurtosis:,3.87
