# Assignment Lesson 9:  Forecasting
In this assignment, we will explore the python package [statsmodels](http://www.statsmodels.org/stable/tsa.html) to forecast time series data. You will learn to use different time series modeling technique for forecasting.
<br>
Original version found in MLEARN 510 Canvas. Updated and modified by Ernst Henle
<br>
Copyright © 2024 by Ernst Henle 

# Learning Objectives:
- Decompose time series into autocorrelation, seasonality, trend, and noise. 
- Explain the effects of exponential smoothing models and differentiate them from other models.
- Apply and evaluate the results of an autoregressive model. 
- Apply and evaluate the results of a moving average model. 
- Apply and evaluate the results of an autoregressive integrated moving average model.
- Apply and evaluate the results of ARIMA model for forecasting (time series prediction).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
# # Suppress the specific warnings
# warnings.filterwarnings("ignore", category=UserWarning, message="No frequency information was provided")
%matplotlib inline

## Air Passenger Dataset
This dataset provides monthly totals of international airline passengers from 1949 to 1960. You can find a copy of the dataset on [Kaggle](https://www.kaggle.com/rakannimer/air-passengers) or [R datasets](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/AirPassengers.html).
1. The file is read in as a dataframe
2. The first column of the file is read in as an index of the dataframe
3. The index datatype is parsed as a datetime (`parse_dates=True`)
4. The index column header ('Month') is removed
5. The value column is called 'airline passengers'
6. The dataframe (144 rows) is split into training datframe (first 130 rows) and a testing dataframe (last 14 rows)

In [None]:
# Read csv, use first column as index, parse 
df = pd.read_csv('../data/airline-passengers.csv', index_col=[0], parse_dates=True)
df.index = df.index.values
display(df.head())

# split the data into train and test
train, test = df.iloc[:130, [0]], df.iloc[130:, [0]]
print(f'Data ({df.shape}) is split into training ({train.shape}) and  testing ({test.shape}) dataframes')

# Remove original data to avoid accidental usage
df = None

# Present the data
plt.plot(train)
plt.plot(test)
plt.show()

## Question 1.1
Using [seasonal_decompose](https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html) API from `statsmodels.tsa.seasonal`, apply additive decomposition to the training dataset and plot each component from the decomposition.

In [None]:
# Add Code here

# import function from statsmodels

# additive decomposition


## Question 1.2
Using [seasonal_decompose](https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html) API from `statsmodels.tsa.seasonal`, apply multiplicative decomposition to the same training dataset and plot each component from the decomposition. 

In [None]:
# Add Code here

# multiplicative decomposition


## Question 1.3
Determine the p-values of the [Augmented Dickey-Fuller test](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.adfuller.html) for the residuals of both the additive and multiplicative decompositions.

In [None]:
# Augmented Dickey-Fuller tests on residuals

# Present P-values


## Question 1.4
Which decomposition makes more sense for this dataset?  Why? .
- Compare and discuss the two sets of decomposition plots
- Compare and discuss the two Augmented Dickey-Fuller test p-values
- Use 'Stationarity' to explain the value of the decompositions

### Add Discussion here:


## Question 2.1
- Apply the simple exponential smoothing technique ([SimpleExpSmoothing](https://www.statsmodels.org/stable/generated/statsmodels.tsa.holtwinters.SimpleExpSmoothing.html)) to the airline dataset.
- Find the hyper-parameter `smoothing_level` (+/- 0.1) that has lowest RMSE.
- Report the prediction accuracy (RMSE) on the test dataset.
- Present the training, test, and predicted time series using the method `plotTrainTestPred`. 

In [None]:
def plotTrainTestPred(train, test, pred):
    plt.plot(train['airline passengers'], label='train')
    plt.plot(test['airline passengers'], linewidth=3, label='test')
    plt.plot(pred, linestyle='--', label='predicted')
    plt.title(f'Compare Train, Test, and Predicted Time Series')
    plt.legend()
    plt.grid()
    plt.show();

In [None]:
warnings.filterwarnings('ignore')

# Create SimpleExpSmoothing object and train on training data
# Add code here


In [None]:
# Optimize SimpleExpSmoothing
# fit model, make predictions, determine error to find best smoothing_level

# Plot Accuracy (RMSE) vs smoothing level

# Plot training, testing, and predicted time series


## Question 2.2
Apply the HWES ([ExponentialSmoothing](https://www.statsmodels.org/stable/_modules/statsmodels/tsa/holtwinters/model.html)) technique to the airline dataset and report the prediction accuracy (RMSE) on the test dataset.
- Use the smoothing level from before.
- Use `trend` and `seasonal` hyper-parameters to improve model accuracy.

In [None]:
warnings.filterwarnings('ignore')

# Optimize ExponentialSmoothing
# fit model, make predictions, determine error to find best trend and seasonal parameters

# create ExponentialSmoothing object with different trend and seasonal hyper-parameters

# present RMSE

# Plot training, testing, and predicted time series


## Question 3
Apply Autoregressive (AR) model to the airline dataset and report the prediction accuracy (RMSE) on the test dataset. An AR model is a subset of the ARIMA [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html), where only the `p` parameter of the `order=(p, d, q)` is used.
- Differencing `d` of the `order=(p, d, q)` is set to zero in AR.
- Lag `q` of the `order=(p, d, q)` is set to zero in AR.
- Find lag `p` of the `order=(p, d, q)` that minimizes RMSE. Try lags 10 through 22

In [None]:
# AR optimization

# Determine best AR lag "p"; determine and present RMSE

# Plot training, testing, and predicted time series


## Question 4
Apply Auto Regressive Moving Average (ARMA) model to the airline dataset and report the prediction accuracy (RMSE) on the test dataset. An ARMA model is a subset of [ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html), where only the `p` and `q` parameters of the `order=(p, d, q)` are used. 
- Set the `p` value of the `order=(p, d, q)` that you found for the AR model.
- Differencing `d` of the `order=(p, d, q)` is set to zero in ARMA.
- Find the lag `q` of the `order=(p, d, q)` that minimizes RMSE. Try values 10 through 22

In [None]:
# ARMA optimization

# Determine best MA lag "q" given the parameter p determined for the AR model
# present RMSE and q

# Plot training, testing, and predicted time series


## Question 5
Apply Auto Regressive Integrated Moving Average model ([ARIMA](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html)) to the airline dataset and report the prediction accuracy (RMSE) on the test dataset. In an ARIMA model we need to set the `p`, `d`, and `q` parameters of the `order=(p, d, q)` hyper parameter: 
- Set the `p` parameter of the `order=(p, d, q)` that you found for the AR model.
- Set the `q` parameter of the `order=(p, d, q)` that you found for the ARMA model.
- Optimize the ARIMA by finding the best `d` parameter of the `order=(p, d, q)` that minimizes RMSE:  try values 0, 1, and 2

In [None]:
# ARIMA optimization

# fit ARMA model, find best 'd', present 'd' and RMSE

# Plot training, testing, and predicted time series


## Question 6
After running through various time series models, summarize your findings. 

### Add Discussion here
