<a href="https://colab.research.google.com/github/cagBRT/timeSeries/blob/main/TimeSeriesForecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install statsmodels

In [None]:
import statsmodels.api as sm
import pandas as pd
import pandas.util.testing as tm

from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
from pandas import read_csv
from matplotlib import pyplot
from pandas.plotting import autocorrelation_plot

https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/


Forecasting a time series can be broadly divided into two types.

- **Univariate Time Series Forecasting**: use only the previous values of the time series to predict its future values.<br>

- **Multi Variate Time Series Forecasting**: use predictors other than the series (a.k.a exogenous variables) to forecast.



# **ARIMA** (AutoRegressive Integrated Moving Average)

This notebook focuses on a Univariate Time forecasting method called ARIMA modeling.

ARIMA, short for ‘AutoRegressive Integrated Moving Average’, is a forecasting algorithm based on the idea that the information in the past values of the time series can alone be used to predict the future values.

ARIMA is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.

**Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models**.

An ARIMA model is characterized by 3 terms: p, d, q

where,

- p is the order of the AR term
- q is the order of the MA term
- d is the number of differencing required to make the time series stationary

**If a time series, has seasonal patterns, then you need to add seasonal terms and it becomes SARIMA, short for ‘Seasonal ARIMA’.**

# **Step 1: Make the time series stationary**<br>
The first step to build an ARIMA model is to make the time series stationary.

Why?

Because, term ‘Auto Regressive’ in ARIMA means it is a linear regression model that uses its own lags as predictors. Linear regression models work best when the predictors are not correlated and are independent of each other.



**Use Differencing to make a time series stationary**<br>

The most common approach is to difference it. That is, subtract the previous value from the current value. Sometimes, depending on the complexity of the series, more than one differencing may be needed.

The value of d, therefore, is the minimum number of differencing needed to make the series stationary. And if the time series is already stationary, then d = 0.

‘p’ is the order of the ‘Auto Regressive’ (AR) term. It refers to the number of lags of Y to be used as predictors. And ‘q’ is the order of the ‘Moving Average’ (MA) term. It refers to the number of lagged forecast errors that should go into the ARIMA Model.

Shampoo Sales Dataset<br>
This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).


In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/timeSeries.git cloned-repo
%cd cloned-repo

In [None]:
def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

In [None]:
!cat shampoo.csv

In [None]:
series = read_csv('shampoo.csv', header=0, parse_dates=['Month'], index_col='Month')
series.head()

In [None]:
series.plot()
pyplot.show()

There is a clear trend <br>
This suggests that the time series is not stationary and will require differencing to make it stationary, at least a difference order of 1

# **AutoCorrelation**

Autocorrelation represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals. Autocorrelation measures the relationship between a variable's current value and its past values.

Looking at the autocorrelation plot, we see there is a positive correlation with the first 10 - 12 lags that is significant for the first 5 lags. <br>
A good starting point for the AR parameter of the model may be 5.

Repeated from above:<BR>
An ARIMA model is characterized by 3 terms: p, d, q

where,

- p is the order of the AR term
- q is the order of the MA term
- d is the number of differencing required to make the time series stationary

In [None]:
autocorrelation_plot(series)
pyplot.show()

https://www.statsmodels.org/stable/index.html<br>

https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html


**Train the model on the time series**

In [None]:
# fit model
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit()
# summary of fit model
print(model_fit.summary())

The “residuals” in a time series model are what is left over after fitting a model. <br>
For many (but not all) time series models, the residuals are equal to the difference between the observations and the corresponding fitted values

In [None]:
# line plot of residuals
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()

In [None]:
# density plot of residuals
residuals.plot(kind='kde')
pyplot.show()

In [None]:
# summary stats of residuals
print(residuals.describe())

Note, that although above we used the entire dataset for time series analysis, ideally we would perform this analysis on just the training dataset when developing a predictive model.

Next, let’s look at how we can use the ARIMA model to make forecasts.<br>
The ARIMA model can be used to forecast future time steps.



# **Training an ARIMA model**

In [None]:
# load dataset
#series = read_csv('shampoo.csv', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
series.index = series.index.to_period('M')
series

In [None]:
# split into train and test sets
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]

#History contains all the sales values in the training set
history = [x for x in train]
#Predictions is an empty list
predictions = list()

In [None]:
predictions

In [None]:
history

In [None]:
# walk-forward validation
#Train the model on the past data and compare the predictions to the ground truth
for t in range(len(test)):
	model = ARIMA(history, order=(5,1,0))
	model_fit = model.fit()
	output = model_fit.forecast()
	yhat = output[0]
	predictions.append(yhat)
	obs = test[t]
	history.append(obs)
	print('predicted=%f, expected=%f' % (yhat, obs))

In [None]:
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
# plot forecasts against actual outcomes
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.title('Test RMSE: %.3f' %rmse)
pyplot.legend()
pyplot.show()

The model could use further tuning of the p, d, and maybe even the q parameters.

**Assignment** <br>
Tune the model using the p,d, and q parameters