## Establish Validation and Testing

### Naive Baseline Model and 

* Establish a similarity heuristic
* Define a simple model that predicts the average load for similar hours

### Model Validation
* Compare to MTLF using walk-forward validation
  * 24 hour windows
  * MAE and maximum absolute error

In [None]:
test_data = 

### Similar Days Heuristic

A similar hour is a (potentially lagged) historical hour from a similar day. A
similar day is determined using similarity characteristics such as:
* weather,
* day of week,
* date,
* holidays,
* etc.

This practice follows the intuition that the shape of a load curve is unlikely
to be highly dissimilar from any prior load curve. Using a regression procedure,
a number of similar days can be used instead of a single day.

### Naive Similarity
In our naive baseline model, we will simply use `k`-nearest neighbors from preceding hours.

In [38]:
# load existing dataset
from MISO_data import get_data, MISO_PREDICTION_COLUMN_NAME, TARGET_NAME

data = get_data()
data_train = data[data.index < '2021-12-1']
data_test = data.drop(data_train.index)

X = data_train.drop([MISO_PREDICTION_COLUMN_NAME, TARGET_NAME], axis=1)
y = data_train[TARGET_NAME]
Xtest = data_test.drop([MISO_PREDICTION_COLUMN_NAME, TARGET_NAME], axis=1)
yTest = data_test[TARGET_NAME]

In [77]:
from sklearn.neighbors import KNeighborsRegressor

n_neighbors = 8
knn = KNeighborsRegressor(n_neighbors, weights='distance')
yhat = knn.fit(X, y).predict(Xtest)

In [82]:
from validation import show_error

show_error(yTest, yhat)

#TODO implement walk-forward validation. This should simulate doing knn.fit(X.append(last_hour), y_last_hour).fit(next_hour)

'Mean Absolute Error = 2621.884609475655, Max Error = 12656.180143081277, Total Error = 1536424.3811527349'

In [79]:
'MTLF ' + show_error(data_test[TARGET_NAME], data_test[MISO_PREDICTION_COLUMN_NAME])

'MTLF Mean Absolute Error = 1055.5164505119458, Max Error = 4932.380000000005, Total Error = 618532.6400000001'

## Baseline

The MAE of the naive approach is roughly 2.4 times larger than the MTLF and the maximum error is about 2.6 times larger.