# DARE Deluxe Data Challenge I
DARE 2023

In this notebook, we will provide a template for the DARE Deluxe Data Challenge. The main steps covered here will be:

- Load the data
- Provide an overview of what is in the data
- Provide an example of a terribly performing baseline model
- Provide functions to quantify model predictive performance

In [None]:
colab = False
if colab:
    !git clone https://github.com/dare-centre/DDC-I
    import os
    os.chdir('DDC-I')
    !ls

In [None]:
# magic
%load_ext autoreload
%autoreload 2
%pdb 0
%matplotlib inline

## Imports and settings
Everything we need to get started.

In [None]:
import os, sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# plotting
from functions.plotting_functions import (plot_model_fit) 
# helper
from functions.helper_functions import (assess_model_prediction, inversescaler_pred_dict)
# data
from functions.data_functions import(load_hourly_data) 

## Load the data


In [None]:
train_val_x, train_val_y, test_x_all, test_y = load_hourly_data()

train_val_x.head()

In [None]:
# lets look at some properties of the data 
print('Number of features: {}'.format(train_val_x.shape[1]))
print('Number of training observations: {}'.format(train_val_x.shape[0]))
print('Number of test observations: {}'.format(test_x_all.shape[0]))
print('Feature names: {}'.format(test_x_all.columns.tolist()))
print('Predict variable: {}'.format(train_val_y.columns.tolist()))

## Split our training data into training and validation sets
We can either split this randomly if we are just modelling as a plain regression problem (set `shuffle = True`), or we can split it sequentially if we are modelling as a time series problem (set `shuffle = False`).


In [None]:
# randomly or sequentially split the data into training and validation sets
val_split = False # there is no benefit to using a validation in our simple linear model
shuffle = True

if val_split:
    train_x_all, val_x_all, train_y, val_y = train_test_split(train_val_x, train_val_y, test_size=0.2, shuffle=shuffle)
else:
    train_x_all = train_val_x
    train_y = train_val_y
    val_x_all = None
    val_y = None

# create placeholders for our model predictions
train_y_pred = None
val_y_pred = None
test_y_pred = None


## Select predictors
We will use the very simplistic approach of just assessing the correlation to the target variable and selecting the most useful features.

In [None]:
top_n = 5 # select the top n features - use None for all features

if top_n is not None:
    ## YOUR CODE HERE
    # you can use your own code here to find the optimal features

    # example with simpler correlation approach
    # select the top n features
    corr_df = pd.concat([train_x_all, train_y], axis=1).corr().abs()
    # print the top n predictors
    print(corr_df.sort_values(ascending=False).head(top_n))
    top_n_features = corr_df.sort_values(ascending=False).head(top_n).index.tolist()
    ## END CODE
else:
    # use all features
    top_n_features = train_x_all.columns.tolist()

# Now 
train_x = train_x_all[top_n_features]
train_time = train_x_all.index
test_x = test_x_all[top_n_features]
test_time = test_x_all.index
if val_split:
    val_x = val_x_all[top_n_features]
    val_time = val_x_all.index
else:
    val_x = None
    val_time = None


## Scale the data if needed

In [None]:
use_scaler = True

if use_scaler:
    # standardise the data for better performance
    scaler_x = StandardScaler()
    train_x = scaler_x.fit_transform(train_x)
    test_x = scaler_x.transform(test_x)
    if val_split:
        val_x = scaler_x.transform(val_x)

    scaler_y = StandardScaler()
    train_y = scaler_y.fit_transform(train_y)
    test_y = scaler_y.transform(test_y)
    if val_split:
        val_y = scaler_y.transform(val_y)

else:
    scaler_x = None
    scaler_y = None
    # easier if they're all numpy arrays from this point in
    train_x = train_x.values
    test_x = test_x.values
    if val_split:
        val_x = val_x.values
    train_y = train_y.values
    test_y = test_y.values
    if val_split:
        val_y = val_y.values

## Train a model
We will use a simple linear regression model as a baseline. You can implement your own model here.

In [None]:
import sklearn.linear_model as lm
from sklearn.neural_network import MLPRegressor

In [None]:
## YOUR CODE HERE

# fit a linear model to the data
model = lm.LinearRegression()
model.fit(train_x,train_y)

# fit a neural network to the data
# model = MLPRegressor(
#     hidden_layer_sizes=(100,20), max_iter=1000,
#     activation='relu', solver='adam',
# )
# model.fit(train_x,train_y.squeeze())

# predict on data
train_y_pred = model.predict(train_x)
test_y_pred = model.predict(test_x)
if not val_x is None:
    val_y_pred = model.predict(val_x)

## END CODE

## Assess model performance

When calling the `assess_model_prediction`, we will leave `test=False` so that the test data are unseen (honesty system!) until the end of the data challenge. This is to ensure that we don't just overfit the model to the test data. When making your final model run, please set `test=True` to get a final assessment of your model performance. 

In [None]:
# construct the dict of data for plotting and metrics
data_dict = {
    'train_time': train_time,
    'train_y': train_y,
    'train_y_pred': train_y_pred,
    'test_time': test_time,
    'test_y': test_y,
    'test_y_pred': test_y_pred,
    'val_time': val_time,
    'val_y': val_y,
    'val_y_pred': val_y_pred,
}
predicted_data  = inversescaler_pred_dict(data_dict, scaler=scaler_y)

In [None]:
# plot the model performance and get metrics
metrics = assess_model_prediction(predicted_data,test=False)
print('Model performance metrics:')
metrics