# Pre-processing the data again 

We will do the prepprocessing of the data as done on the previous part.

In [4]:
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings('ignore')

#importing the  necessary modules
import pandas                                      #to read and manipulate data
import zipfile                                     #to extract data
import numpy as np                                 #for matrix operations
#read will be imported as and when required
#read the train and test zip file
zip_ref = zipfile.ZipFile("train.csv.zip", 'r')    
zip_ref.extractall()                               
zip_ref.close()

train_data = pandas.read_csv("train.csv")

import copy
test_data = copy.deepcopy(train_data.iloc[150000:])
train_data = train_data.iloc[:150000]

y_true = test_data['loss']

ids = test_data['id']

target = train_data['loss']

#drop the unnecessary column id and loss from both train and test set.
train_data.drop(['id','loss'],1,inplace=True)
test_data.drop(['id','loss'],1,inplace=True)

shift = 200
target = np.log(target+shift)

#merging both the datasets to make single joined dataset
joined = pandas.concat([train_data, test_data],ignore_index = True)
del train_data,test_data                                         #deleting previous one to save memory.

cat_feature = [n for n in joined.columns if n.startswith('cat')]  #list of all the features containing categorical values

#factorizing them
for column in cat_feature:
    joined[column] = pandas.factorize(joined[column].values, sort=True)[0]
        
del cat_feature

#dividing the training data between training and testing set
train_data = joined.iloc[:150000,:]
test_data = joined.iloc[150000:,:]

In [8]:
#inporting additional files
from sklearn.metrics import mean_absolute_error
from bayes_opt import BayesianOptimization
import xgboost as xgb



In [9]:
#making the function to be used for different values of hyper-parameter
def xgb_evaluate(min_child_weight,colsample_bytree,max_depth,subsample,gamma,alpha):
    params['min_child_weight'] = int(min_child_weight)
    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    params['alpha'] = max(alpha, 0)

    cv_result = xgb.cv(params, xgtrain, num_boost_round=num_rounds, nfold=5,seed=random_state,
             callbacks=[xgb.callback.early_stop(50)])
    
    #returning negative of cv result since, xgb.cv maximizes the score and we need to minimize the error
    return -cv_result['test-mae-mean'].values[-1]              
    

In [10]:
#making xgb matrix to be used as input
xgtrain = xgb.DMatrix(train_data, label=target)

In [16]:

#although after running a couple of iterations , it was seen that num_rounds crossed even 2000, So 
#to not take a risk, we set it to 3000
num_rounds = 3000
#random seed value to make the results replicable
random_state = 2016
num_iter = 25
init_points = 5
params = {'eta': 0.1,'silent': 1,'eval_metric': 'mae','verbose_eval': 2,'seed': random_state}

xgbBO = BayesianOptimization(xgb_evaluate, {'min_child_weight': (1, 20),
                                                'colsample_bytree': (0.1, 1),
                                                'max_depth': (5, 15),
                                                'subsample': (0.5, 1),
                                                'gamma': (0, 10),
                                                'alpha': (0, 10),
                                                })

#using bayesian optimization to maximize the passed score like accuracy
xgbBO.maximize(init_points=init_points, n_iter=num_iter)


"\n#although after running a couple of iterations , it was seen that num_rounds never crossed even 1000, but \n#to not take a risk, we set it to 3000\nnum_rounds = 3000\n#random seed value to make the results replicable\nrandom_state = 2016\nnum_iter = 25\ninit_points = 5\nparams = {'eta': 0.1,'silent': 1,'eval_metric': 'mae','verbose_eval': 2,'seed': random_state}\n\nxgbBO = BayesianOptimization(xgb_evaluate, {'min_child_weight': (1, 20),\n                                                'colsample_bytree': (0.1, 1),\n                                                'max_depth': (5, 15),\n                                                'subsample': (0.5, 1),\n                                                'gamma': (0, 10),\n                                                'alpha': (0, 10),\n                                                })\n\n#using bayesian optimization to maximize the passed score like accuracy\nxgbBO.maximize(init_points=init_points, n_iter=num_iter)\n"

## Running the task

Since, it is a computationally heavy task to run model so many times, i couldn't do it on my laptop even after running for hours, therefore, i used google cloud to run this code as a script. Even on a 8 core 50 gb ram instance of google cloud, you can get an idea the time it was taking to complete, I'm posting the results below.

Initialization
---------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |     gamma |   max_depth |   min_child_weight |   subsample | 
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 50 rounds.
Stopping. Best iteration:
[995]   train-mae:0.353615+0.000552759  test-mae:0.37291+0.00241128

    1 | 38m51s |   -0.37291 |    4.4158 |             0.9119 |    2.4020 |     14.5118 |            10.6760 |      0.5503 | 
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 50 rounds.
Stopping. Best iteration:
[566]   train-mae:0.379244+0.000735928  test-mae:0.38179+0.00192141

    2 | 10m22s |   -0.38179 |    7.4460 |             0.9433 |    9.5554 |      7.4645 |            18.5685 |      0.6715 | 
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 50 rounds.
Stopping. Best iteration:
[1669]  train-mae:0.374388+0.00059465   test-mae:0.378342+0.0019866

    3 | 46m57s |   -0.37834 |    8.0450 |             0.5499 |    7.9558 |     12.3267 |            19.2943 |      0.9618 | 
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 50 rounds.
Stopping. Best iteration:
[2728]  train-mae:0.371039+0.000594925  test-mae:0.376165+0.0022578

    4 | 31m21s |   -0.37616 |    9.9794 |             0.8270 |    3.5062 |      5.0242 |             7.3661 |      0.7791 | 
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 50 rounds.
Stopping. Best iteration:
[1003]  train-mae:0.361415+0.000529711  test-mae:0.37304+0.00236389

    5 | 21m30s |   -0.37304 |    6.8519 |             0.8878 |    2.4687 |     10.0548 |             7.1863 |      0.9971 | 

it took quite some time as you can see from time column and after that the SSH connection was broken. Since, it did give an idea of what paratmeters to try like max_depth should be in range 10-15 and alpha less that 5, col_sample bytree around 0.8 and gamma around 2,min_child weight 10 or less,subsample, not so much so have to try that between 0.5 to 1
Next, we would use grid search to find exact parameters.

NOTE- this code is taken from Vladimir Iglovikov github repository. Link - https://github.com/fmfn/BayesianOptimization/blob/master/examples/xgboost_example.py

Minor modifications were also tried but this provides the best results.