# Modelling

The objective fo this notebook is to utilize some methods to find and determine the optimal model to solve the challenge problem. 

## Libraries

In [1]:
import numpy as np 
import pandas as pd
from cnr_methods import get_simplified_data, transform_data, metric_cnr

from sklearn.model_selection import TimeSeriesSplit
from collections import deque
import xgboost as xgb

## Read Data

Here, the data used correspond to the results of the Feature Engineering and Selection Step. (Add Later)

In [2]:
# Initially using the Original Data
X = get_simplified_data()
X = X[X['Set']=='Train']
y = pd.read_csv('Data/Y_train.csv')

In [3]:
X['Time'] = pd.to_datetime(X['Time'],dayfirst=True)
X = X.set_index('Time')

For initial debugging, only One Windfarm will be considered.

In [4]:
WF = 'WF1'
X = X[X['WF']==WF]
y = y[y['ID'].isin(X['ID'])]

## Validation Scheme

Before proceeding to the Hyperparameter search, it is necessary first to have some way to reliably measure the performance of the model. For this purpose, it will be used a Time Split Cross Validation Method, were the "Test" Fold for each Iteration is going to be used as the Validation Data, and so, to make Early Stopping on the data.

In [5]:
def validation_scheme(X,y,k_fold_splits,param,num_boost_round,early_stopping_rounds):
    #Define Time Split Cross Validation
    tscv = TimeSeriesSplit(n_splits=k_fold_splits)

    scores = np.empty(0)
    for train_index, val_index in tscv.split(X):
        # Get the Data of the Split
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        dtrain = xgb.DMatrix(X_train.drop(['ID','WF','Set'],axis=1),label=y_train['Production'])
        dval = xgb.DMatrix(X_val.drop(['ID','WF','Set'],axis=1),label=y_val['Production'])

        # Train the Model
        #obj=metric_cnr, feval=metric_cnr
        watchlist = [(dtrain,'train'),(dval,'eval')]
        bst = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=watchlist, feval=metric_cnr,early_stopping_rounds=early_stopping_rounds)
        scores = np.append(scores,bst.best_score)
    return scores

In [11]:
param = {'max_depth': 2, 'eta': 1, 'tree_method' : 'gpu_hist'}

In [12]:
scores = validation_scheme(X,y,k_fold_splits=8,param=param,num_boost_round=10,early_stopping_rounds=10)

[0]	train-rmse:1.64069	eval-rmse:1.33832	train-CAPE:68.31301	eval-CAPE:51.10481
Multiple eval metrics have been passed: 'eval-CAPE' will be used for early stopping.

Will train until eval-CAPE hasn't improved in 10 rounds.
[1]	train-rmse:1.30794	eval-rmse:1.36047	train-CAPE:49.81842	eval-CAPE:46.96684
[2]	train-rmse:1.24399	eval-rmse:1.32486	train-CAPE:45.24114	eval-CAPE:44.37091
[3]	train-rmse:1.21064	eval-rmse:1.33637	train-CAPE:43.32950	eval-CAPE:44.45235
[4]	train-rmse:1.17991	eval-rmse:1.30851	train-CAPE:41.72203	eval-CAPE:42.84243
[5]	train-rmse:1.12658	eval-rmse:1.79660	train-CAPE:40.10080	eval-CAPE:55.83307
[6]	train-rmse:1.08664	eval-rmse:1.83328	train-CAPE:38.94361	eval-CAPE:57.18132
[7]	train-rmse:1.05842	eval-rmse:1.96902	train-CAPE:37.57953	eval-CAPE:59.86401
[8]	train-rmse:1.04275	eval-rmse:1.92221	train-CAPE:36.48556	eval-CAPE:58.57725
[9]	train-rmse:1.01942	eval-rmse:1.90141	train-CAPE:36.33889	eval-CAPE:59.07350
[0]	train-rmse:1.45748	eval-rmse:1.54113	train-CAPE:55.92

In [13]:
scores

array([42.84243 , 50.470889, 78.238845, 42.27013 , 54.72114 , 38.346235,
       70.172557, 37.985143])