# QBUS3850 Lab 1 Tasks

This tutorial will cover reading in data using pandas, understanding how dates and times work in Python and implementing an expanding window. The dataset that we will use is the electricity dataset from lectures (this can be found on Canvas). We will evaluate forecasts from simple exponential smoothing, Holt's method and the Holt Winters' additive method.

## Data and data types

1. Read the data from the file into a variable called `df`. 
2. Check the type of each variable by running `df.dtypes`. Is the variable `SETTLEMENTDATE` a datetime or an object? 
3. If it is an object, convert it to a datetime.

The function `read_csv()` can be used to read in the data, make sure that the file electricity.csv is in your working directory otherwise provide a full path to the file.

In [1]:
import pandas as pd
df = pd.read_csv('electricity.csv')
df.dtypes


REGION             object
SETTLEMENTDATE     object
TOTALDEMAND       float64
RRP               float64
PERIODTYPE         object
dtype: object

By default, `read_csv()` has read in `SETTLEMENTDATE` as the same type as `REGION`, i.e. as a string and not a datetime. Sometimes we can add the argument `parse_dates=True` but that will not work in this case. Instead to coerce `SETTLEMENTDATE` to a datetime run the following:

In [2]:
df['SETTLEMENTDATE'] = pd.to_datetime(df['SETTLEMENTDATE'])
df.dtypes

REGION                    object
SETTLEMENTDATE    datetime64[ns]
TOTALDEMAND              float64
RRP                      float64
PERIODTYPE                object
dtype: object

Note now that the data type of `SETTLEMENTDATE` is a datetime.

Let's define a helper function that trains and evaluate a given model

In [3]:
def evaluate_model(model, hmax, test_ses):
    """
    A helper function that evalutae a given model, and returns the
    squared error against a given test series.
    """
    #Fit Model
    fit_ses = model.fit()
    #Make forecasts
    fc_ses = fit_ses.forecast(hmax)
    #Compute square error
    return np.square(fc_ses-test_ses)

### Forecast exercise

1. Using data from April 1, 00:30  to April 25, 00:00 as training data generate forecasts for the next 6 hours (12 half hour periods) using:
  - Simple Exponential Smoothing
  - Holt's Method
  - Holt Winters' additive method
2. Compute the squared error (i.e. $(y_{t+h}-\hat{y}_{t+h})^2$) for each method at each horizon


In [4]:
#Switch off warnings
import warnings
warnings.simplefilter('ignore')

#datetime needed to manipulate dates
import datetime 
#numpy needed to work with vectors
import numpy as np
#statmodels needed for models
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, Holt, ExponentialSmoothing

#longest horizon in half hour steps
hmax = 12
#full horizon as a time interval
fullhor = datetime.timedelta(hours=hmax/2)
#construct end of training period as datetime
endtrain = datetime.datetime(2021,4,25,0,0)

#Filter training data
train = df[(df['SETTLEMENTDATE']<=endtrain)]
#Filter test data
test = df[(df['SETTLEMENTDATE']>endtrain) & (df['SETTLEMENTDATE']<=(endtrain+fullhor))]

train_totaldemand_np = train['TOTALDEMAND'].to_numpy()
test_totaldemand_np = test['TOTALDEMAND'].to_numpy()

#Simple Exponential Smoothing
#Specify model
model = SimpleExpSmoothing(train_totaldemand_np)
sqerr_ses = evaluate_model(model, hmax, test_totaldemand_np)

#Holt's Method (same steps as above)
model = Holt(train_totaldemand_np)
sqerr_holt = evaluate_model(model, hmax, test_totaldemand_np)

#Holt Winters' Method  (same steps as above)
model = ExponentialSmoothing(train_totaldemand_np,trend='add',seasonal='add',seasonal_periods=48)
sqerr_hw = evaluate_model(model, hmax, test_totaldemand_np)

#Initialise Results data frame
res = pd.DataFrame({
    'h': range(1, hmax+1),
    'SES': sqerr_ses,
    'Holt': sqerr_holt,
    'HW': sqerr_hw,
})

print(res)




     h           SES           Holt            HW
0    1  3.610507e+04    3763.606711     78.704645
1    2  9.428790e+04    2530.230310   3591.170381
2    3  2.279140e+05    8564.493388   8883.444086
3    4  4.493736e+05   24774.036449   7417.397438
4    5  7.990632e+05   63933.554848   5127.934627
5    6  1.016857e+06   57237.695536   8988.582463
6    7  1.209457e+06   41009.150568   2983.450182
7    8  1.299197e+06   13105.742564   1451.263339
8    9  1.177645e+06    4657.589049   8954.805446
9   10  1.088774e+06   56688.456645  18350.298637
10  11  9.784837e+05  176778.391352  14736.249698
11  12  7.638818e+05  440533.625007  23593.396129


Some things to notice:

- Holt Winters has the lowest square error at one step ahead
- Holt has the lowest square error 12-steps  ahead
- Simple exponential smoothing has the worst squared errors at all horizons
- None of this is meaningful since we are only looking at a single instance of forecasts.

3. Repeat the same exercise but for an expanding window that expands one half hour at a time. Do this over 192 windows (i.e. four days). This will take a minute or two to run.
4. Compute Root Mean Square Error (RMSE) given by $RMSE=\sqrt{\frac{1}{192}\sum (y_{t+h}-\hat{y}_{t+h})^2}$ over all windowsfort each forcasting horizon.
5. Which method is the best at a one-step ahead horizon?
6. Which method is the best at a twelve-step ahead horizon?


In [5]:
#Switch off warnings
import warnings
warnings.simplefilter('ignore')

#set number of windows
n_wind=192
#define window increment
windowinc = datetime.timedelta(minutes=30)

#Create list of dates
datetime_list = [endtrain+i*windowinc for i in range(n_wind)]

#Initialise vectors to store root mean squared error
rmse_ses=np.zeros(hmax)
rmse_holt=np.zeros(hmax)
rmse_hw=np.zeros(hmax)

#Loop

for i in datetime_list:
    train = df[(df['SETTLEMENTDATE']<=i)]
    test = df[(df['SETTLEMENTDATE']>i) & (df['SETTLEMENTDATE']<=(i+fullhor))]
    
    train_totaldemand_np = train['TOTALDEMAND'].to_numpy()
    test_totaldemand_np = test['TOTALDEMAND'].to_numpy()

    #Simple Exponential Smoothing
    model = SimpleExpSmoothing(train_totaldemand_np)
    rmse_ses += evaluate_model(model, hmax, test_totaldemand_np)
    
    #Simple Exponential Smoothing
    model = Holt(train_totaldemand_np)
    rmse_holt += evaluate_model(model, hmax, test_totaldemand_np)

    #Holt-Winters Smoothing
    model = ExponentialSmoothing(train_totaldemand_np,trend='add',seasonal='add',seasonal_periods=48)
    rmse_hw += evaluate_model(model, hmax, test_totaldemand_np)


rmse_ses = np.sqrt(rmse_ses/n_wind)
rmse_holt = np.sqrt(rmse_holt/n_wind)
rmse_hw = np.sqrt(rmse_hw/n_wind)

#Initialise Results data frame
res = pd.DataFrame({
    'h': range(1, hmax+1),
    'SES': rmse_ses,
    'Holt': rmse_holt,
    'HW': rmse_hw,
})

print(res)

     h          SES         Holt           HW
0    1   225.410181   121.637894    79.098298
1    2   432.825043   289.433727   174.602173
2    3   624.453797   517.086586   300.297167
3    4   794.982706   779.362488   442.903622
4    5   943.366806  1064.788754   593.996161
5    6  1069.218372  1362.731069   743.764688
6    7  1173.787186  1667.923161   889.596493
7    8  1257.547522  1972.243714  1026.013996
8    9  1322.038006  2266.699075  1149.103118
9   10  1369.831946  2550.898687  1261.191304
10  11  1402.841879  2824.432033  1368.665138
11  12  1422.703148  3088.345673  1472.676263


- The best method one-step ahead is the Holt Winters Method.
- The best method twelve-steps ahead is Simple Exponential Smoothing.
- Holt performs reasonably well at short horizons but very poorly at medium to long horizons.
- Another thing to note is that we are not using all of the data. If we use all available data then we will have some windows towards the end for which longer-horizon forecasts can not be evaluated. This is not a major problem, but note that the denominator in MSE will be different for different horizons (unlike here where we could divide by 192 at all horizons to compute RMSE).  

7. What is a major shortcoming of this evaluation? Hint: What happens on April 25th in Australia.

April 25th is Anzac Day a major public holiday in Australia. In 2021 it was on a Sunday meaning the holiday was moved to the 26th. Therefore the evaluation period includes a public holiday and these days are typically idiosyncratic.

## Additional Exercises (For those who finish quickly or as subsequent homework)

Modify the code above:

  1. To use a rolling rather than expanding window.
  2. To roll the window forward by 4 hours rather than half an hour.
  3. To use the mean absolute error $MAE=\frac{1}{192}\sum|y+{t+h}-\hat{y}_{t+h}|$ as an evaluation criterion.