# Modelling

The objective fo this notebook is to utilize some methods to find and determine the optimal model to solve the challenge problem. 

## Libraries

In [1]:
import numpy as np 
import pandas as pd
from cnr_methods import get_simplified_data, transform_data, metric_cnr

from sklearn.model_selection import TimeSeriesSplit
from collections import deque
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import xgboost as xgb

## Read Data

Here, the data used correspond to the results of the Feature Engineering and Selection Step. (Add Later)

In [2]:
# Initially using the Original Data
full_data = get_simplified_data()
full_data['Time'] = pd.to_datetime(full_data['Time'],dayfirst=True)
full_data = full_data.set_index('Time')

In [3]:
X = full_data[full_data['Set']=='Train']
y = pd.read_csv('Data/Y_train.csv')

For initial debugging, only One Windfarm will be considered.

In [4]:
WF = 'WF1'
X = X[X['WF']==WF]
y = y[y['ID'].isin(X['ID'])]

## Validation Scheme

Before proceeding to the Hyperparameter search, it is necessary first to have some way to reliably measure the performance of the model. For this purpose, it will be used a Time Split Cross Validation Method, were the "Test" Fold for each Iteration is going to be used as the Validation Data, and so, to make Early Stopping on the data.

In [5]:
k_fold_splits = 8
num_boost_round = 10
early_stopping_rounds = 1

In [6]:
def objective(param,k_fold_splits=k_fold_splits,num_boost_round=num_boost_round,early_stopping_rounds=early_stopping_rounds):
    # Define Time Split Cross Validation
    tscv = TimeSeriesSplit(n_splits=k_fold_splits)

    # Set XGBoost for GPU
    param['tree_method'] = 'gpu_hist'

    progress = dict() # Dict to save Results
    scores = np.empty(0) # Append Scores of each Fold to get Mean Score
    
    for train_index, val_index in tscv.split(X):
        # Get the Data of the Split
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        dtrain = xgb.DMatrix(X_train.drop(['ID','WF','Set'],axis=1),label=y_train['Production'])
        dval = xgb.DMatrix(X_val.drop(['ID','WF','Set'],axis=1),label=y_val['Production'])

        # Train the Model
        watchlist = [(dtrain,'train'),(dval,'eval')]
        bst = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=watchlist, feval=metric_cnr,early_stopping_rounds=early_stopping_rounds,evals_result=progress)
        scores = np.append(scores,progress['eval']['CAPE'][-1])

    return {'loss': scores.mean(), 'params': param, 'status': STATUS_OK}

## Hyperparameter Tuning

For the Hyperparameter Tuning, the HyperOpt Library will be used, which implements some techniques for a more efficient search for parameters.

### Domain Space

In [7]:
space = {
    'max_depth' : hp.randint('max_depth', 15),
    'subsample' : hp.uniform('subsample', 0, 1),
    'colsample_bytree' : hp.uniform('colsample_bytree', 0, 1),
    'colsample_bylevel' : hp.uniform('colsample_bylevel', 0, 1),
    'min_child_weight' : hp.uniform('min_child_weight', 0, 10),
    'lambda' : hp.uniform('lambda', 0, 1),
    'alpha' : hp.uniform('alpha', 0, 1),
    'eta' : hp.uniform('eta', 0, 1)
}

### Optimization Algorithm

In [8]:
tpe_algorithm = tpe.suggest
bayes_trials = Trials()

### Bayesian Optimization

In [9]:
MAX_EVALS = 10

In [10]:
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = MAX_EVALS, trials = bayes_trials, rstate = np.random.RandomState(50))

25087499999][3]	train-rmse:1.83968	eval-rmse:1.12517	train-CAPE:64.48514	eval-CAPE:82.52834

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][4]	train-rmse:1.66999	eval-rmse:1.05779	train-CAPE:58.86150	eval-CAPE:79.33852

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][5]	train-rmse:1.52377	eval-rmse:1.00331	train-CAPE:53.96324	eval-CAPE:76.88695

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][6]	train-rmse:1.39916	eval-rmse:0.97267	train-CAPE:50.19463	eval-CAPE:76.44528

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][7]	train-rmse:1.29404	eval-rmse:0.96278	train-CAPE:46.95075	eval-CAPE:76.27942

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][8]	train-rmse:1.19824	eval-rmse:0.95266	train-CAPE:43.74177	eval-CAPE:75.96969

 10%|█         | 1/10 [00:02<00:07,  1.20trial/s, best loss: 94.12825087499999][9]	train-rmse:1.10820	eval-

## Generating Predictions

In [None]:
X_train = full_data[full_data['Set']=='Train']
X_test = full_data[full_data['Set']=='Test']

Loop of Wind Farms