# Model Training

After the data has been pre-processed and converted into train test sets, the train set can be used for training the model. In this step the algorithm maps the features or the independent variables to the output dependent variable. The module cuML has lot of algorithms for classification problems. 

## Random Forest Algorithm

Random Forests is an ensemble algorithm that consists of a group of decision trees. The average predictions of all decision trees is taken as the prediction of the Random Forests algorithm. I will be using random forest classifier from cuML module. The reason for using Random Forest algorithm is because it is the best traditional machine learning algorithm out there. It can even match up the results of neural networks by increasing the number of trees, but the model will overfit then. Usually, data scientists train multiple models and choose the best. I will try multiple algorithms in this notebook.

In [140]:
from cuml import RandomForestClassifier as cuRF

The hyper-parameters will be set and the random forest object will be created. The description of the hyper-parameters and there default values are given below. The default values are selected with a lot of research and experimentations by the developers. So, I will be using the default values for the first experiment and then tune it accordingly for further experiments. 
* n_estimators: Number of trees in RF, default value is 100. I will go ahead with 100. Increasing this value makes the algorithm more complex and resource hungry. THe accuracy will increase but the model might overfit the data.
* max_depth: Max depth of each tree, default is 16. In sklearn the default is unlimited. But our data is huge. Keeping to it unlimited will consume lot of time in training. So, i'll keep it at 16.
* n_bins: Number of bins used in split point calculation, default is 128. Most of the features have normal distribution data. So, i will not increase it and keep it at 128.
* n_streams: CUDA stream to use for parallel processing on GPU, default is 4. I am using GPU, so this will be fine.
* max_samples: Percentage of input data to be considered for each tree, default is 1. Using whole data for each tree might overfit the data but will improve accuracy. I will keep it as 1.
* split_criterion: Split algorithm, default is 0 for gini impurity. Gini and entropy can be used for classification. Previous experiments have shown gini to perform better.
* random_state: Seed used for Random Number Generator

In [142]:
# cuml Random Forest params
cu_rf_params = {
    'n_estimators': 100,
    'max_depth': 16,
    'n_bins': 128,
    'n_streams': 4,
    'max_samples': 1,
    'split_criterion': 0,
    'random_state': 123
}
cu_rf = cuRF(**cu_rf_params)

Now, I will train the Random Forest classifier with the training dataset.

In [143]:
%%time
cu_rf.fit(X_train_scaled, y_train)

CPU times: user 1min 36s, sys: 288 ms, total: 1min 36s
Wall time: 26.5 s


RandomForestClassifier()

The the predict method will be run to give predictions.

In [145]:
%%time
# using the predict method on test set
y_pred = cu_rf.predict(X_test_scaled)

CPU times: user 3min 50s, sys: 231 ms, total: 3min 51s
Wall time: 3min 50s


Let's use accuracy score function to find out the accuracy of the model

In [146]:
print('Accuracy score: ', accuracy_score(y_test, y_pred))

Accuracy score:  0.7332103252410889


This is not a bad score. I will do some hyperparameter optimizations in the next section and see if there is any improvement.

### Hyper-parameter Optimisation

I will try with different hyper-parameters settings to see if better results comes.

In [69]:
# cuml Random Forest params
cu_rf_params_2 = {
    'n_estimators': 500, # increase no. of trees to 500
    'max_depth': 10, # change to 10
    'n_bins': 180, # change to 180 as for bigger datasets increasing this value increases accuracy
    'n_streams': 4, # CUDA stream to use for parallel processing on GPU, default is 4
    'max_samples': 1, # Percentage of input data to be considered for each tree, default is 1
    'split_criterion': 0, # Split algorithm, default is 0 for gini impurity
    'random_state': 1234 # Seed used for Random Number Generator
}
cu_rf_2 = cuRF(**cu_rf_params_2)

In [None]:
%%time
cu_rf_2.fit(X_train_scaled, y_train)

CPU times: user 6min 15s, sys: 1.06 s, total: 6min 16s
Wall time: 1min 44s


RandomForestClassifier()

In [71]:
%%time
# using the predict method on test set
y_pred_2 = cu_rf_2.predict(X_test_scaled)

CPU times: user 11min 33s, sys: 564 ms, total: 11min 33s
Wall time: 11min 32s


In [72]:
print('Accuracy score: ', accuracy_score(y_test, y_pred_2))

Accuracy score:  0.71075439453125


The accuracy has reduced with this combination of parameters. Let's try one more combination. I will change max_depth back to default 16 and n_bins as 128. But keep n_estimators as 500 itself.

In [73]:
%%time
# cuml Random Forest params
cu_rf_params_3 = {
    'n_estimators': 500, # increase no. of trees to 500
    'max_depth': 16, # change to 16
    'n_bins': 128, # change to 128
    'n_streams': 4, # CUDA stream to use for parallel processing on GPU, default is 4
    'max_samples': 1, # Percentage of input data to be considered for each tree, default is 1
    'split_criterion': 0, # Split algorithm, default is 0 for gini impurity
    'random_state': 12345 # Seed used for Random Number Generator
}
cu_rf_3 = cuRF(**cu_rf_params_3)
cu_rf_3.fit(X_train_scaled, y_train)

CPU times: user 8min 2s, sys: 1.51 s, total: 8min 3s
Wall time: 2min 13s


RandomForestClassifier()

In [74]:
%%time
y_pred_3 = cu_rf_3.predict(X_test_scaled)

CPU times: user 22min 20s, sys: 1.04 s, total: 22min 21s
Wall time: 22min 19s


In [75]:
print('Accuracy score: ', accuracy_score(y_test, y_pred_3))

Accuracy score:  0.7337703108787537


The accuracy is almost same as the first experiment.

### Train Random Forest with PCA components

Now let's train the random forest algorithm with the PCA components and see the results. I will be using the same parameters as I used for the first experiment.

In [76]:
# initailise RF object
cu_rf_pca = cuRF(**cu_rf_params)

Fit the PCA components into the RF object

In [77]:
%%time
cu_rf_pca.fit(components, y_train)

CPU times: user 1min 8s, sys: 240 ms, total: 1min 8s
Wall time: 19 s


RandomForestClassifier()

The training took less time as the number of features is less

The test set will be transformed using the PCA to predict on them

In [147]:
components_test = pca.transform(X_test_scaled)
components_test.head()

Unnamed: 0,0,1,2
0,2.103734,1.478275,-1.150761
1,-0.613508,0.66898,-0.019162
2,-0.155646,0.298515,-1.125776
3,1.151846,0.549032,0.19088
4,1.414101,-2.323358,1.706555


The predict method will be run on the test set generated by PCA

In [79]:
%%time
# using the predict method on test set
y_pred_pca = cu_rf_pca.predict(components_test)

CPU times: user 3min 10s, sys: 176 ms, total: 3min 11s
Wall time: 3min 10s


In [80]:
print('Accuracy score: ', accuracy_score(y_test, y_pred_pca))

Accuracy score:  0.581381618976593


The accuracy has decreased by using the PCA components. Random Forests works better with more number of dimensions and has the ability to perform better parallel computing. 

## XGBoost

XGBoost stands for “Extreme Gradient Boosting”, it is also a powerful supervised machine learning algorithm. In classical machine learning Random Forest and XGBoost are the most powerful algorithms and gives state of the art results which can even be compared to neural networks. XGBoost with RAPIDS can be used for training on GPU. XGBoost can also paralellize well and can train on huge datasets efficiently. 

In [74]:
import xgboost as xgb

### Convert cuDF data to DMatrix format

The data loaded is cuDF dataframe. It should be converted to a DMatrix object that XGBoost can work with. We can instantiate an object of the xgboost.DMatrix by passing in the feature matrix as the first argument followed by the label vector using the label= keyword argument.

In [77]:
%%time
dtrain = xgb.DMatrix(X_train_scaled, label=y_train)
dvalidation = xgb.DMatrix(X_test_scaled, label=y_test)

CPU times: user 74.1 ms, sys: 36.5 ms, total: 111 ms
Wall time: 110 ms


The parameters will be set in the below cell. There is a huge list of parameters that can be tweaked to improve the performance of the model. I will be altering the most important ones that are relevant to our experiment.
* silent: It is the verbosity of printing messages. Keeping it at 1 is a standard way and will give us enough information.
* tree_method: It is the tree construction algorithm. There are lot of values for it. But the two most important for our purpose are hist and gpu_hist. They perform better on larger datasets, gpu_hist is the gpu implimentation of hist. Since, I am training on GPU, i wil keep it to gpu_hist.
* n_gpus: number of gpus to use, change this to -1 to use all GPUs available or 0 to use the CPU
* eval_metric: The training method will perform evaluation at the same time. I will be using the area under curve (AUC) score to evaluate the model.
* objective: Since ours is the classification problem, I will keep it to binary:logistic to output the classification probablities.

In [118]:
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 1  # change this to -1 to use all GPUs available or 0 to use the CPU
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'
    
params.update(learning_task_params)
print(params)

{'silent': 1, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


The evaluation list is set with train and validation data. The num_round is kept at 100 for 100 steps of training.

In [119]:
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 100

In [120]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.74346	train-auc:0.74303
[1]	validation-auc:0.75482	train-auc:0.75451
[2]	validation-auc:0.76458	train-auc:0.76433
[3]	validation-auc:0.77001	train-auc:0.76988
[4]	validation-auc:0.77490	train-auc:0.77481
[5]	validation-auc:0.78058	train-auc:0.78058
[6]	validation-auc:0.78486	train-auc:0.78496
[7]	validation-auc:0.78731	train-auc:0.78747
[8]	validation-auc:0.78977	train-auc:0.78991
[9]	validation-auc:0.79189	train-auc:0.79205
[10]	validation-auc:0.79357	train-auc:0.79377
[11]	validation-auc:0.79512	train-auc:0.79533
[12]	validation-auc:0.79724	train-auc:0.79750
[13]	validation-auc:0.79834	train-auc:0.79860
[14]	validation-auc:0.79956	train-auc:0.799

In [121]:
bst_pred = bst.predict(dvalidation)

By default, the predictions made by XGBoost are probabilities. To get the accuracy score, I will convert them to binary class values by rounding them to 0 or 1.

In [122]:
bst_predictions = [round(value) for value in bst_pred]
bst_predictions = np.array(bst_predictions)

There is an improvement of 1% in accuracy. XGBoost has performed better than Random Forest on this data with these set of parameters.

In [123]:
print('Accuracy score: ', accuracy_score(y_test, bst_predictions))

Accuracy score:  0.742078959941864


One thing to note is that XGBoost does not give option to evaluate on accuracy score, it gives option of ROC AUC score for classification problems. The difference between them is that accuracy score calculates accuracy on the predicted classes while ROC AUC calculates on the predicted scores. ROC AUC is a better classification metric for complex problems. Let's check the ROC AUC score of the model below.

In [124]:
print('ROC AUC score: ', roc_auc_score(y_test, bst_pred))

ROC AUC score:  0.8233452439308167


It gives a good result. I will discuss more about the results in the evaluation notebook. Let's optimize the hyperparameters and perform another experiment.

Most of the parameters are same with some additional changes. 
* max_depth: Default value is 6, I will change it to 7. Increasing this value will result in overfitting, that is why I will increase it to just 7.
* reg_lambda: Increasing this value makes the model more conservative. I will update it to 2.
* scale_pos_weight: This is useful for unblanced classes. It controls the balance between negative and positive weights. Default is 1, I will update it to 2.
* gamma: Default is 0. Larger values will make the model more conservative. I will update it to 1

In [131]:
params_2 = {
    'silent': 1, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'eval_metric': 'auc', 
    'objective': 'binary:logistic', 'max_depth': 7, 'reg_lambda': 2, 'scale_pos_weight': 2, 
    'gamma': 1
}

In [132]:
%%time

bst_2 = xgb.train(params_2, dtrain, num_round, evallist)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.75312	train-auc:0.75312
[1]	validation-auc:0.75983	train-auc:0.75979
[2]	validation-auc:0.76449	train-auc:0.76444
[3]	validation-auc:0.76761	train-auc:0.76758
[4]	validation-auc:0.77079	train-auc:0.77077
[5]	validation-auc:0.77253	train-auc:0.77252
[6]	validation-auc:0.77430	train-auc:0.77431
[7]	validation-auc:0.77615	train-auc:0.77620
[8]	validation-auc:0.77786	train-auc:0.77790
[9]	validation-auc:0.77969	train-auc:0.77975
[10]	validation-auc:0.78109	train-auc:0.78117
[11]	validation-auc:0.78244	train-auc:0.78252
[12]	validation-auc:0.78397	train-auc:0.78405
[13]	validation-auc:0.78502	train-auc:0.78513
[14]	validation-auc:0.78667	train-auc:0.786

In [133]:
bst_pred_2 = bst_2.predict(dvalidation)

In [134]:
bst_predictions_2 = [round(value) for value in bst_pred_2]
bst_predictions_2 = np.array(bst_predictions_2)

In [135]:
print('Accuracy score: ', accuracy_score(y_test, bst_predictions_2))

Accuracy score:  0.7026330232620239


In [136]:
print('ROC AUC score: ', roc_auc_score(y_test, bst_pred_2))

ROC AUC score:  0.8179947137832642


The accuracy and ROC AUC scores have reduced with the optimisations. It is a complex dataset and more feature engineering is required to improve the performance.

### XGBoost with PCA componenets

Now, I will use the PCA components with XGBoost and check the performance.

In [148]:
dtrain_pca = xgb.DMatrix(components, label=y_train)
dvalidation_pca = xgb.DMatrix(components_test, label=y_test)

In [150]:
# model training settings
evallist_pca = [(dvalidation_pca, 'validation'), (dtrain_pca, 'train')]

In [151]:
%%time

bst_pca = xgb.train(params, dtrain_pca, num_round, evallist_pca)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.59165	train-auc:0.59142
[1]	validation-auc:0.59369	train-auc:0.59349
[2]	validation-auc:0.59533	train-auc:0.59512
[3]	validation-auc:0.59667	train-auc:0.59643
[4]	validation-auc:0.59732	train-auc:0.59708
[5]	validation-auc:0.59753	train-auc:0.59732
[6]	validation-auc:0.59781	train-auc:0.59764
[7]	validation-auc:0.59807	train-auc:0.59801
[8]	validation-auc:0.59817	train-auc:0.59813
[9]	validation-auc:0.59844	train-auc:0.59843
[10]	validation-auc:0.59852	train-auc:0.59852
[11]	validation-auc:0.59855	train-auc:0.59857
[12]	validation-auc:0.59861	train-auc:0.59866
[13]	validation-auc:0.59868	train-auc:0.59875
[14]	validation-auc:0.59872	train-auc:0.598

In [152]:
bst_pred_pca = bst_pca.predict(dvalidation_pca)

In [153]:
print('ROC AUC score: ', roc_auc_score(y_test, bst_pred_pca))

ROC AUC score:  0.5987499952316284


XGBoost with PCA components did not perform good. Both XGBoost and Random Forests behaves in a same way and performs better with high dimensions of data. 