# Modelling

The objective fo this notebook is to utilize some methods to find and determine the optimal model to solve the challenge problem. 

## Libraries

In [50]:
import numpy as np 
import cupy as cp
import pandas as pd
from cnr_methods import get_simplified_data, transform_data, metric_cnr

from sklearn.model_selection import TimeSeriesSplit
from collections import deque
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import xgboost as xgb

## Read Data

Here, the data used correspond to the results of the Feature Engineering and Selection Step. (Add Later)

In [51]:
# Initially using the Original Data
full_data = get_simplified_data()
full_data['Time'] = pd.to_datetime(full_data['Time'],dayfirst=True)
full_data = full_data.set_index('Time')

In [52]:
X = full_data[full_data['Set']=='Train']
y = pd.read_csv('Data/Y_train.csv')

For initial debugging, only One Windfarm will be considered.

In [53]:
WF = 'WF1'
X = X[X['WF']==WF]
y = y[y['ID'].isin(X['ID'])]

## Validation Scheme

Before proceeding to the Hyperparameter search, it is necessary first to have some way to reliably measure the performance of the model. For this purpose, it will be used a Time Split Cross Validation Method, were the "Test" Fold for each Iteration is going to be used as the Validation Data, and so, to make Early Stopping on the data.

In [54]:
k_fold_splits = 8
num_boost_round = 100
early_stopping_rounds = 10
numerical_features = X.drop(['ID','WF','Set'],axis=1).columns

In [55]:
def gpu_df(df,y,numerical_features):
    gpu_matrix = cp.asarray(df[[feature for feature in numerical_features]])
    gpu_matrix = xgb.DMatrix(gpu_matrix,label=y)
    return gpu_matrix

In [56]:
def objective(param,numerical_features=numerical_features,k_fold_splits=k_fold_splits,num_boost_round=num_boost_round,early_stopping_rounds=early_stopping_rounds):
    # Define Time Split Cross Validation
    tscv = TimeSeriesSplit(n_splits=k_fold_splits)

    # Separating a Holdout Set
    X_holdout = X[-round(len(X)/8):]
    y_holdout = y[-round(len(X)/8):]
    dhold = gpu_df(X_holdout,y_holdout['Production'],numerical_features)

    X_cv = X[:-round(len(X)/8)]
    y_cv = y[:-round(len(X)/8)]

    # Set XGBoost for GPU
    param['tree_method'] = 'gpu_hist'
    
    first_time = True
    for train_index, val_index in tscv.split(X_cv):
        # Get the Data of the Split
        X_train, X_val = X_cv.iloc[train_index], X_cv.iloc[val_index]
        y_train, y_val = y_cv.iloc[train_index], y_cv.iloc[val_index]
        dtrain = gpu_df(X_train,y_train['Production'],numerical_features)
        dval = gpu_df(X_val,y_val['Production'],numerical_features)

        # Train the Model
        watchlist = [(dtrain,'train'),(dval,'eval')]
        if first_time == True:
            bst = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=watchlist, feval=metric_cnr,early_stopping_rounds=early_stopping_rounds,verbose_eval=False)
            first_time = False
        else:
            bst = xgb.train(param, dtrain, num_boost_round=num_boost_round, evals=watchlist, feval=metric_cnr,early_stopping_rounds=early_stopping_rounds,verbose_eval=False,xgb_model=bst)
        
    preds = bst.predict(dhold,ntree_limit=bst.best_ntree_limit)
    score = metric_cnr(preds,dhold)
    return {'loss' : score[1], 'params' : param, 'status' : STATUS_OK}

## Hyperparameter Tuning

For the Hyperparameter Tuning, the HyperOpt Library will be used, which implements some techniques for a more efficient search for parameters.

### Domain Space

In [57]:
space = {
    'max_depth' : hp.randint('max_depth', 15),
    'subsample' : hp.uniform('subsample', 0, 1),
    'colsample_bytree' : hp.uniform('colsample_bytree', 0, 1),
    'colsample_bylevel' : hp.uniform('colsample_bylevel', 0, 1),
    'min_child_weight' : hp.uniform('min_child_weight', 0, 10),
    'lambda' : hp.uniform('lambda', 0, 1),
    'alpha' : hp.uniform('alpha', 0, 1),
    'eta' : hp.uniform('eta', 0, 1)
}

### Optimization Algorithm

In [58]:
tpe_algorithm = tpe.suggest
bayes_trials = Trials()

### Bayesian Optimization

In [59]:
MAX_EVALS = 10

In [60]:
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = MAX_EVALS, trials = bayes_trials, rstate = np.random.RandomState(50))

100%|██████████| 10/10 [00:43<00:00,  4.34s/trial, best loss: 41.09568128893979]


## Generating Predictions

In [61]:
X_train = full_data[full_data['Set']=='Train']
X_test = full_data[full_data['Set']=='Test']

Loop of Wind Farms