# __SARIMA(p,d,q)(P,D,Q)m__
# __Seasonal Autoregressive Integrated Moving Averages__

While ARIMA accepts the parameters $(p,d,q)$, SARIMA accepts an <i>additional</i> set of parameters $(P,D,Q)m$ that specifically describe the seasonal components of the model. Here $P$, $D$ and $Q$ represent the seasonal regression, differencing and moving average coefficients, and $m$ represents the number of data points (rows) in each seasonal cycle (e.g. 12 months within one year).

<b>NOTE:</b> The statsmodels implementation of SARIMA is called SARIMAX. The “X” added to the name means that the function also supports <i>exogenous</i> regressor variables. This is covered in the next notebook.

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

from statsmodels.tsa.statespace.sarimax import SARIMAX

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf # for determining (p,q) orders manually
from statsmodels.tsa.seasonal import seasonal_decompose      # for ETS plots (seasonal_decompose)
from pmdarima import auto_arima                              # for determining ARIMA orders

import matplotlib.pyplot as plt

#ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

CO2 PPM - Trends in Atmospheric Carbon Dioxide. Data are sourced from the US Government’s Earth System Research Laboratory, Global Monitoring Division.

In [None]:
df = pd.read_csv('../data/co2_mm_mlo.csv')

### __Inspect the data, create a DatetimeIndex__

In [None]:
df.head()

We need to combine two integer columns (year and month) into a DatetimeIndex. We can do this by passing a dictionary into <tt>pandas.to_datetime()</tt> with year, month and day values.<br>
For more information visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [None]:
#summarise all time info in single columns
df['date']=pd.to_datetime({'year': df['year'], 'month': df['month'], 'day': 1})

In [None]:
# Set "date" to be the index
df.set_index('date',inplace=True)
df.index.freq = 'MS'
df.head()

In [None]:
df.info()

### __Plot data__

In [None]:
plt.style.use('ggplot')

title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million' #no xlabel necessary

ax = df['interpolated'].plot(figsize=(12, 6),title=title, color='blue')
ax.autoscale(axis='x',tight=True)
ax.set(ylabel=ylabel);

### __Seasonal decompose__

In [None]:
result = seasonal_decompose(df['interpolated'], model='add')
result.plot();

Seasonality exists

### __Run pmdarima.auto_arima to get recommended orders__
Prepare to wait for a while. This function performs SARIMA of different order combinations in order to find the respective AICs for comparison.

__Note that depending on the computing power of your computer, we might end up with slightly different SARIMA oder recommendations as you laptop checks a greater or smaller number of order combinations.__

In [None]:
# For SARIMA Orders we set seasonal=True and pass in an m value
auto_arima(df['interpolated'],seasonal=True, m=12).summary()

This provides an ARIMA Order of (0,1,1) combined with a seasonal order of (2,0,2,12). Next step is to train & test the SARIMA(0,1,1)(2,0,2,12) model, evaluate it, then produce a forecast of future values.
### __train / test split__

In [None]:
len(df)

In [None]:
#we set one year for testing
train = df.iloc[:717]
test = df.iloc[717:]

### __Fit SARIMA(0,1,3)(1,0,1,12) model__

In [None]:
model = SARIMAX(train['interpolated'],order=(0, 1, 1),seasonal_order=(2, 0, 2, 12))
results = model.fit()
results.summary()

In [None]:
#let's get the forecast
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels').rename('SARIMA(0,1,3)(1,0,1,12) Predictions')

Passing <tt>dynamic=False</tt> means that forecasts at each point are generated using the full history up to that point (all lagged values).

Passing <tt>typ='levels'</tt> predicts the levels of the original endogenous variables. If we'd used the default <tt>typ='linear'</tt> we would have seen linear predictions in terms of the differenced endogenous variables.

Find more infos on arguments here: https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMAResults.predict.html

In [None]:
#compare predictions to actuals
for i in range(len(predictions)):
    print(f"predicted={round(predictions[i],5)}, expected={test['interpolated'][i]}")

In [None]:
# Plot predictions vs actuals

plt.style.use('ggplot')

title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million' #no xlabel needed

ax = test['interpolated'].plot(legend=True,figsize=(12, 6),title=title)
predictions.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(ylabel=ylabel);

### __Model evaluation__

In [None]:
from sklearn.metrics import mean_squared_error

error = mean_squared_error(test['interpolated'], predictions)
print(f'SARIMA(0,1,3)(1,0,1,12) MSE Error: {error:11.10}')

In [None]:
from statsmodels.tools.eval_measures import rmse

error = rmse(test['interpolated'], predictions)
print(f'SARIMA(0,1,3)(1,0,1,12) RMSE Error: {error:11.10}')

In [None]:
error = error / test.interpolated.mean()
error

In [None]:
mape = (sum(abs((test['interpolated'] - predictions)\
                /test['interpolated'])))*(100/len(test['interpolated']))
mape

__Remember the MAPE returns a percentage error!__

### __Apply model to full dataset to forecast the future!__

In [None]:
model = SARIMAX(df['interpolated'],order=(0,1,1),seasonal_order=(2,0,2,12))
results = model.fit()
fcast = results.predict(len(df),len(df)+11,typ='levels').rename('SARIMA(0,1,3)(1,0,1,12) Forecast')

In [None]:
# Plot predictions against known values
title = 'Monthly Mean CO₂ Levels (ppm) over Mauna Loa, Hawaii'
ylabel='parts per million'
xlabel=''

ax = df['interpolated'].plot(legend=True,figsize=(12, 6),title=title)
fcast.plot(legend=True)
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);