## Establish Validation and Testing

### Naive Baseline Model and 

* Establish a similarity heuristic
* Define a simple model that predicts the average load for similar hours

### Model Validation
* Compare to MTLF using walk-forward validation
  * 24 hour windows
  * MAE and maximum absolute error

In [10]:
import pandas as pd
from datetime import timedelta

data = pd.read_parquet('data/zone1.parquet')
hours = data.index.to_series()
first_hour = hours.iloc[0]
last_hour = hours.iloc[-1]

test_start = last_hour - timedelta(days=364, hours=23)
test_end = last_hour
validation_start = test_start - timedelta(days=365)
validation_end = test_start - timedelta(hours=1)
train_start = first_hour
train_end = validation_start - timedelta(hours=1)

train_data = data[train_start:train_end]
validation_data = data[validation_start:validation_end]
test_data = data[test_start:test_end]

train_data.head()

Unnamed: 0_level_0,MSP,DayOfYear,HourEnding,IsBusinessHour,LRZ1 MTLF (MWh),LRZ1 ActualLoad (MWh)
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-02-01 00:00:00-05:00,23.0,32,1,0,11099,11337.89
2015-02-01 01:00:00-05:00,21.02,32,2,0,10829,11014.87
2015-02-01 02:00:00-05:00,19.04,32,3,0,10565,10795.37
2015-02-01 03:00:00-05:00,19.04,32,4,0,10468,10714.42
2015-02-01 04:00:00-05:00,17.06,32,5,0,10432,10700.09


### Similar Days Heuristic

A similar hour is a (potentially lagged) historical hour from a similar day. A
similar day is determined using similarity characteristics such as:
* weather,
* day of week,
* date,
* holidays,
* etc.

This practice follows the intuition that the shape of a load curve is unlikely
to be highly dissimilar from any prior load curve. Using a regression procedure,
a number of similar days can be used instead of a single day.

### Naive Similarity
In our naive baseline model, we will simply use `k`-nearest neighbors from preceding hours.

In [17]:
mtlf_col = 'LRZ1 MTLF (MWh)'
actual_col = 'LRZ1 ActualLoad (MWh)'
X = train_data.drop([mtlf_col, actual_col], axis=1)
y = train_data[actual_col]
Xvalid = validation_data.drop([mtlf_col, actual_col], axis=1)
yvalid = validation_data[actual_col]

In [18]:
from sklearn.neighbors import KNeighborsRegressor

n_neighbors = 8
knn = KNeighborsRegressor(n_neighbors, weights='distance')
yhat = knn.fit(X, y).predict(Xvalid)

### Walk Forward Validation

Predict each day, incorporating all previous days in the training set.

In [44]:
from validation import show_error
import numpy as np

predictions = []
for d in range(0,365):
    next_Xtrain = pd.concat([X, Xvalid.iloc[:24*d]])
    next_ytrain = pd.concat([y, yvalid.iloc[:24*d]])
    next_Xpredict = Xvalid.iloc[24*d:24*(d+1)]
    #print(f'Predicting {next_Xpredict.index.to_series().iloc[0]}')
    #print(f'Predicting {next_Xpredict.index.to_series().iloc[-1]}')
    prediction = knn.fit(next_Xtrain, next_ytrain).predict(next_Xpredict)
    predictions.append(prediction)
yhat = np.concatenate(predictions)
show_error(yvalid, yhat)

'Mean Absolute Error = 694.0462069496332, Max Error = 3627.74177138688, Total Error = 6079844.772878793'

In [45]:
'MTLF ' + show_error(validation_data[actual_col], validation_data[mtlf_col])

'MTLF Mean Absolute Error = 222.98570091324197, Max Error = 2217.33, Total Error = 1953354.7399999993'

## Baseline

The MAE of the naive approach is roughly 2.4 times larger than the MTLF and the maximum error is about 2.6 times larger.