# Classification with Cross Validation (CV) using xgboost algorithm

The first problem we have to attack is to define a metric we want to maximize (or minimize if it is a loss), and we will use it to select the best (or bests) models and hyperparameters. In order to understand the mechanism, we will code the validation loop by hand, and then we will use sklearn functions which do it automatically.

Although we are only interested in the metric on the validation set, it is interesting to compare it with the metric on the training set, to check the presence of overfitting.

In [1]:
%load_ext autoreload
%autoreload 2

In [60]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import scipy
import sklearn
import xgboost as xgb
from xgboost_utils import F1_eval
# plt.style.use('fivethirtyeight')
sns.set_style("whitegrid")
sns.set_context("notebook")
DATA_PATH = '../data/'

VAL_SPLITS = 4

In [61]:
from plot_utils import plot_confusion_matrix
from cv_utils import run_cv_f1
from cv_utils import plot_cv_roc
from cv_utils import plot_cv_roc_prc
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

For this part of the project, we will only work with the training set, that we will split again into train and validation to perform the hyperparameter tuning.

We will save the test set for the final part, when we have already tuned our hyperparameters.

In [62]:
df = pd.read_csv(os.path.join(DATA_PATH,'df_train.csv'))
df.drop(columns= df.columns[0:2],inplace=True)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V24,V25,V26,V27,V28,Class,TimeScaled,TimeSin,TimeCos,AmountBC
0,-0.829392,1.118573,0.926038,1.163686,0.009824,0.527347,0.17337,0.723997,-0.638939,-0.162923,...,-0.298908,-0.060301,-0.217935,0.291312,0.120779,0,0.460069,-0.480989,0.876727,3.195062
1,-2.814527,1.613321,0.654307,0.581821,0.399491,0.73004,0.456233,-2.464347,0.654797,2.248682,...,-0.329526,-0.307374,-0.440007,-2.135657,0.011041,0,0.266395,-0.204567,-0.978853,3.125269
2,2.105028,-0.7004,-1.338043,-0.596395,-0.395217,-0.75505,-0.276951,-0.291562,-0.965418,1.107179,...,-0.278137,-0.040685,0.789267,-0.066054,-0.069956,0,0.762303,-0.153992,-0.988072,3.421235
3,2.205839,-1.023897,-1.270137,-0.950174,-0.868712,-0.975492,-0.475464,-0.280564,0.503713,0.448173,...,-0.041177,0.089158,1.105794,-0.066285,-0.079881,0,0.87974,-0.998227,0.059524,1.072145
4,2.02709,-0.778666,-1.552755,-0.558679,0.020939,-0.026071,-0.20781,-0.124288,-0.635953,0.817757,...,0.033477,-0.157992,-0.606327,-0.003931,-0.039868,0,0.821649,-0.783558,-0.621319,3.97149


## Using xgboost in Python

In [None]:
cv = StratifiedShuffleSplit(n_splits=VAL_SPLITS,test_size=0.15,random_state=0)
clf = xgb.sklearn.XGBClassifier(n_jobs=-1,eval_metric='auc',verbosity=1)

# In case we want to select a subset of features
df_ = df[['Class','V9','V14','V16']]
X = df_.drop(columns='Class').to_numpy()
y = df_['Class'].to_numpy()

metrics = cross_val_score(clf,X,y,cv=cv,scoring='f1',n_jobs=-1)
print('F1 value (Val): {:.2f} ± {:.2f}'.format(
                np.mean(metrics),
                np.std(metrics, ddof=1)
            ))

In [65]:
cv = StratifiedShuffleSplit(n_splits=VAL_SPLITS,test_size=0.15,random_state=0)
clf = xgb.sklearn.XGBClassifier(n_jobs=-1,verbosity=1)
# In case we want to select a subset of features
df_ = df[['Class','V9','V14','V16']]
X = df_.drop(columns='Class').to_numpy()
y = df_['Class'].to_numpy()

for idx_t,idx_v in cv.split(X,y):
    X_train = X[idx_t]
    y_train = y[idx_t]
    X_val = X[idx_v]
    y_val = y[idx_v]
    clf.fit(X_train,y_train,eval_metric=F1_eval,eval_set=[(X_val,y_val)],early_stopping_rounds=25)
    break

[0]	validation_0-error:0.000829	validation_0-f1_err:0.294118
Multiple eval metrics have been passed: 'validation_0-f1_err' will be used for early stopping.

Will train until validation_0-f1_err hasn't improved in 25 rounds.
[1]	validation_0-error:0.000829	validation_0-f1_err:0.294118
[2]	validation_0-error:0.000829	validation_0-f1_err:0.27619
[3]	validation_0-error:0.000802	validation_0-f1_err:0.271028
[4]	validation_0-error:0.000802	validation_0-f1_err:0.271028
[5]	validation_0-error:0.000802	validation_0-f1_err:0.259259
[6]	validation_0-error:0.000802	validation_0-f1_err:0.259259
[7]	validation_0-error:0.000774	validation_0-f1_err:0.259259
[8]	validation_0-error:0.000802	validation_0-f1_err:0.259259
[9]	validation_0-error:0.000774	validation_0-f1_err:0.259259
[10]	validation_0-error:0.000774	validation_0-f1_err:0.259259
[11]	validation_0-error:0.000746	validation_0-f1_err:0.245283
[12]	validation_0-error:0.000719	validation_0-f1_err:0.245283
[13]	validation_0-error:0.000719	validatio

In [43]:
cv = StratifiedShuffleSplit(n_splits=1,test_size=0.15,random_state=0).get_n_splits


AttributeError: 'StratifiedShuffleSplit' object has no attribute 'get_nplits'

In [47]:
cv = StratifiedShuffleSplit(n_splits=VAL_SPLITS,test_size=0.15,random_state=0)
clf = xgb.sklearn.XGBClassifier(n_jobs=-1,verbosity=1)

# In case we want to select a subset of features
df_ = df
X = df_.drop(columns='Class').to_numpy()
y = df_['Class'].to_numpy()

# We create two eampty lists to save the metrics at each fold for train and validation. 
metrics = []
metrics_train = []
# Loop over the different validation folds
for i,(idx_t, idx_v) in enumerate(cv.split(X,y)):
    X_train = X[idx_t]
    y_train = y[idx_t]
    X_val = X[idx_v]
    y_val = y[idx_v]
    
    clf.fit(X_train,y_train,eval_metric=F1_eval,eval_set=[(X_val,y_val)],early_stopping_rounds=25)
    
    y_pred = clf.predict(X_val)
    metric = f1_score(y_val,y_pred)
    metrics.append(metric)
    
    y_t_pred = clf.predict(X_train)
    metric_train = f1_score(y_train,y_t_pred)
    metrics_train.append(metric_train)
    
    print('{}-fold / {} completed!'.format(i+1,cv.get_n_splits()))
    
metric_mean = np.mean(metrics)
metric_std = np.std(metrics, ddof=1)
metric_t_mean = np.mean(metrics_train)
metric_t_std = np.std(metrics_train, ddof=1)
print('Metric value (Train): {:.2f} ± {:.2f}'.format(metric_t_mean,metric_t_std))
print('Metric value(Val): {:.2f} ± {:.2f}'.format(metric_mean,metric_std))

[0]	validation_0-error:0.000857	validation_0-f1_err:0.263158
Multiple eval metrics have been passed: 'validation_0-f1_err' will be used for early stopping.

Will train until validation_0-f1_err hasn't improved in 25 rounds.
[1]	validation_0-error:0.000829	validation_0-f1_err:0.263158
[2]	validation_0-error:0.000829	validation_0-f1_err:0.263158
[3]	validation_0-error:0.000829	validation_0-f1_err:0.256637
[4]	validation_0-error:0.000802	validation_0-f1_err:0.256637
[5]	validation_0-error:0.000802	validation_0-f1_err:0.25
[6]	validation_0-error:0.000802	validation_0-f1_err:0.254545
[7]	validation_0-error:0.000802	validation_0-f1_err:0.252174
[8]	validation_0-error:0.000802	validation_0-f1_err:0.245614
[9]	validation_0-error:0.000802	validation_0-f1_err:0.243243
[10]	validation_0-error:0.000802	validation_0-f1_err:0.239669
[11]	validation_0-error:0.000802	validation_0-f1_err:0.233333
[12]	validation_0-error:0.000719	validation_0-f1_err:0.232143
[13]	validation_0-error:0.000719	validation_0

[65]	validation_0-error:0.000663	validation_0-f1_err:0.207207
[66]	validation_0-error:0.000663	validation_0-f1_err:0.207207
[67]	validation_0-error:0.000663	validation_0-f1_err:0.207207
[68]	validation_0-error:0.000663	validation_0-f1_err:0.207207
Stopping. Best iteration:
[43]	validation_0-error:0.000663	validation_0-f1_err:0.207207

2-fold / 4 completed!
[0]	validation_0-error:0.000553	validation_0-f1_err:0.181818
Multiple eval metrics have been passed: 'validation_0-f1_err' will be used for early stopping.

Will train until validation_0-f1_err hasn't improved in 25 rounds.
[1]	validation_0-error:0.000553	validation_0-f1_err:0.171171
[2]	validation_0-error:0.000553	validation_0-f1_err:0.171171
[3]	validation_0-error:0.000553	validation_0-f1_err:0.171171
[4]	validation_0-error:0.000553	validation_0-f1_err:0.168142
[5]	validation_0-error:0.000525	validation_0-f1_err:0.168142
[6]	validation_0-error:0.000525	validation_0-f1_err:0.160714
[7]	validation_0-error:0.000525	validation_0-f1_err

[40]	validation_0-error:0.000608	validation_0-f1_err:0.155172
[41]	validation_0-error:0.000608	validation_0-f1_err:0.155172
[42]	validation_0-error:0.000608	validation_0-f1_err:0.155172
[43]	validation_0-error:0.000608	validation_0-f1_err:0.155172
[44]	validation_0-error:0.00058	validation_0-f1_err:0.155172
[45]	validation_0-error:0.00058	validation_0-f1_err:0.155172
[46]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[47]	validation_0-error:0.00058	validation_0-f1_err:0.155172
[48]	validation_0-error:0.00058	validation_0-f1_err:0.155172
[49]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[50]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[51]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[52]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[53]	validation_0-error:0.000553	validation_0-f1_err:0.152542
[54]	validation_0-error:0.000553	validation_0-f1_err:0.159664
[55]	validation_0-error:0.000553	validation_0-f1_err:0.155172
[56]	validat

In [59]:
cv = StratifiedShuffleSplit(n_splits=VAL_SPLITS,test_size=0.15,random_state=0)
clf = xgb.sklearn.XGBClassifier(n_jobs=-1,verbosity=0)

# In case we want to select a subset of features
df_ = df
X = df_.drop(columns='Class').to_numpy()
y = df_['Class'].to_numpy()

# We create two eampty lists to save the metrics at each fold for train and validation. 
metrics = []
metrics_train = []
# Loop over the different validation folds
for i,(idx_t, idx_v) in enumerate(cv.split(X,y)):
    X_train = X[idx_t]
    y_train = y[idx_t]
    X_val = X[idx_v]
    y_val = y[idx_v]
    
    clf.fit(X_train,y_train,
            eval_metric=F1_eval,
            eval_set=[(X_val,y_val)],
            early_stopping_rounds=25,
            sample_weight= (1+y_train/y_train.mean()))
    
    y_pred = clf.predict(X_val)
    metric = f1_score(y_val,y_pred)
    metrics.append(metric)
    
    y_t_pred = clf.predict(X_train)
    metric_train = f1_score(y_train,y_t_pred)
    metrics_train.append(metric_train)
    
    print('{}-fold / {} completed!'.format(i+1,cv.get_n_splits()))
    
metric_mean = np.mean(metrics)
metric_std = np.std(metrics, ddof=1)
metric_t_mean = np.mean(metrics_train)
metric_t_std = np.std(metrics_train, ddof=1)
print('Metric value (Train): {:.2f} ± {:.2f}'.format(metric_t_mean,metric_t_std))
print('Metric value(Val): {:.2f} ± {:.2f}'.format(metric_mean,metric_std))

[0]	validation_0-error:0.058132	validation_0-f1_err:0.502825
Multiple eval metrics have been passed: 'validation_0-f1_err' will be used for early stopping.

Will train until validation_0-f1_err hasn't improved in 25 rounds.
[1]	validation_0-error:0.040192	validation_0-f1_err:0.362319
[2]	validation_0-error:0.029522	validation_0-f1_err:0.362319
[3]	validation_0-error:0.02709	validation_0-f1_err:0.263158
[4]	validation_0-error:0.028665	validation_0-f1_err:0.252174
[5]	validation_0-error:0.026482	validation_0-f1_err:0.263158
[6]	validation_0-error:0.021921	validation_0-f1_err:0.263158
[7]	validation_0-error:0.020456	validation_0-f1_err:0.263158
[8]	validation_0-error:0.020373	validation_0-f1_err:0.233645
[9]	validation_0-error:0.02181	validation_0-f1_err:0.233645
[10]	validation_0-error:0.01841	validation_0-f1_err:0.226415
[11]	validation_0-error:0.019709	validation_0-f1_err:0.226415
[12]	validation_0-error:0.017083	validation_0-f1_err:0.226415
[13]	validation_0-error:0.01736	validation_0

[1]	validation_0-error:0.026565	validation_0-f1_err:0.396226
[2]	validation_0-error:0.02427	validation_0-f1_err:0.396226
[3]	validation_0-error:0.026565	validation_0-f1_err:0.255814
[4]	validation_0-error:0.026039	validation_0-f1_err:0.179487
[5]	validation_0-error:0.031955	validation_0-f1_err:0.179487
[6]	validation_0-error:0.029578	validation_0-f1_err:0.179487
[7]	validation_0-error:0.023496	validation_0-f1_err:0.179487
[8]	validation_0-error:0.023413	validation_0-f1_err:0.179487
[9]	validation_0-error:0.020787	validation_0-f1_err:0.179487
[10]	validation_0-error:0.020981	validation_0-f1_err:0.179487
[11]	validation_0-error:0.020566	validation_0-f1_err:0.179487
[12]	validation_0-error:0.018327	validation_0-f1_err:0.157895
[13]	validation_0-error:0.020179	validation_0-f1_err:0.157895
[14]	validation_0-error:0.018272	validation_0-f1_err:0.168142
[15]	validation_0-error:0.019709	validation_0-f1_err:0.168142
[16]	validation_0-error:0.018272	validation_0-f1_err:0.168142
[17]	validation_0-

[47]	validation_0-error:0.013324	validation_0-f1_err:0.140351
[48]	validation_0-error:0.013158	validation_0-f1_err:0.145299
[49]	validation_0-error:0.012964	validation_0-f1_err:0.137931
[50]	validation_0-error:0.013296	validation_0-f1_err:0.137931
[51]	validation_0-error:0.013407	validation_0-f1_err:0.137931
[52]	validation_0-error:0.013434	validation_0-f1_err:0.137931
[53]	validation_0-error:0.013434	validation_0-f1_err:0.137931
[54]	validation_0-error:0.013103	validation_0-f1_err:0.145299
[55]	validation_0-error:0.012633	validation_0-f1_err:0.159664
[56]	validation_0-error:0.012522	validation_0-f1_err:0.152542
[57]	validation_0-error:0.012605	validation_0-f1_err:0.147826
[58]	validation_0-error:0.012522	validation_0-f1_err:0.152542
[59]	validation_0-error:0.012163	validation_0-f1_err:0.145299
[60]	validation_0-error:0.012025	validation_0-f1_err:0.145299
Stopping. Best iteration:
[35]	validation_0-error:0.013213	validation_0-f1_err:0.130435

4-fold / 4 completed!
Metric value (Train):

## Using sklearn function for CV (Preferred)

Even if it is easy to build the CV loop, and we can change the metric and the outputs in a personalized way, using a `for` loop makes the CV step non parallelizable (we could use the library `multiprocessing`) to solve this, but `sklearn` has implemented such an utility.

These utilities can be found in sklearn's [webpage](https://scikit-learn.org/stable/modules/classes.html#model-validation). Some of them are:
* [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) : Returning the validation score for some given metric.
* [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) : Evaluate metric(s) by cross-validation and also record fit/score times. It can also returns the train metric.

In [51]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate