In this notebook, we are using dask clusters hosted on Saturn Cloud to spin up multiple GPUs to speed up training.
The code chunks below will hyperparameter search, final model building and cross validation processes.
We will be parallelizing hyperparameter search with 10 fold cross validation folds across multiple GPU workers.

We will be using GBDT classifier from Catboost to train this binary classification model. Since all of our features are categorial, catboost would be ideal for the following reasons:
     1. Catboost handles categorical features automatically.
     2. It converts categorical features to numeric using various statistics derived from combination of multiple
     categorical features.
     3. One-hot encodiing is used for categorical features by default
     4. It can train models on the GPU using native package which makes training on large datasets much faster
     

In [1]:
# import dask libraries to connect to dask clusters on saturn cloud
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster()
client = Client(cluster)
client

INFO:dask-saturn:Cluster is ready
INFO:dask-saturn:Registering default plugins
INFO:dask-saturn:Success!


0,1
Client  Scheduler: tcp://d-dkark-dinesh-ml-e7cccc4d496b430cbdc6b8eb801eab0c.main-namespace:8786  Dashboard: https://d-dkark-dinesh-ml-e7cccc4d496b430cbdc6b8eb801eab0c.community.saturnenterprise.io,Cluster  Workers: 3  Cores: 12  Memory: 43.31 GiB


In [None]:
# install packages not available in the image

!pip install -q scikit-learn optuna

As we can see we have 3 workers available (due to resource restrictions) although 10 workers would be preferable given we will be interating over 10 folds. It is much faster to use 3 GPU workers than a single machine.

In [14]:
# import relevant packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier, Pool
from catboost.utils import eval_metric

import optuna
from optuna.samplers import TPESampler


We have saved the imputed dataframe from the first stage that we will be using to build a binary classification model

In [99]:
# read the crime data file with imputed values from the first stage
model_df = pd.read_csv("imputed_df.csv", index_col=0)
model_df

Unnamed: 0,primary_type,location_description,district,ward,domestic,arrest,year,month,day,weekday,hour
0,BATTERY,STREET,2.0,3.0,False,False,2005.0,1.0,1.0,5.0,1.0
1,WEAPONS VIOLATION,RESIDENCE,5.0,9.0,False,False,2005.0,1.0,1.0,5.0,1.0
2,CRIMINAL DAMAGE,CHURCH / SYNAGOGUE / PLACE OF WORSHIP,2.0,3.0,False,False,2005.0,1.0,1.0,5.0,1.0
3,THEFT,DEPARTMENT STORE,1.0,42.0,False,True,2005.0,1.0,1.0,5.0,1.0
4,THEFT,BAR OR TAVERN,18.0,42.0,False,False,2005.0,1.0,1.0,5.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
999995,OTHER OFFENSE,STREET,2.0,3.0,False,False,2008.0,12.0,31.0,2.0,12.0
999996,BATTERY,STREET,1.0,42.0,True,False,2008.0,12.0,31.0,2.0,12.0
999997,MOTOR VEHICLE THEFT,STREET,7.0,17.0,False,False,2008.0,12.0,31.0,2.0,12.0
999998,BURGLARY,RESIDENCE,6.0,17.0,False,False,2008.0,12.0,31.0,2.0,12.0


Now we will doing some pre-processing to build train and test sets using train_test_split.
We are stratifyinig by the target column to make sure our target is balanced across train and test splits.

In [19]:
X = model_df.drop('arrest', axis = 1)
y = model_df.arrest.apply(lambda x: 1 if x else 0)

In [20]:
X['district'] = X['district'].astype(np.int64)
X['ward'] = X['ward'].astype(np.int64)
X['year'] = X['year'].astype(np.int64)
X['month'] = X['month'].astype(np.int64)
X['day'] = X['day'].astype(np.int64)
X['weekday'] = X['weekday'].astype(np.int64)
X['hour'] = X['hour'].astype(np.int64)

In [21]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify=y,
                                                    test_size=0.25,
                                                    random_state=21)

We should compare the distribution of target values next:

In [100]:
y.value_counts()

0    706652
1    293348
Name: arrest, dtype: int64

Let's look at the distribution of training target values -- well, the proportion looks fine compared to the full dataset

In [22]:
y_train.value_counts()

0    529989
1    220011
Name: arrest, dtype: int64

We will now scatter large dataframes into distributed memory as futures so that data is readily available for the workers to fetch

In [53]:
# create dask futures for features and target dataframes
X_train_f, y_train_f = client.scatter([X_train, y_train])

The function below will fit the Catboost Classifier model using Logloss as loss function and AUC as evaluation metric which acts as the overfitting detector. As a base parameter, we will be using early_stopping_rounds of 50 which will trigger overfitting detector and stops the training after 50 iteratiions of no improvement in the evaluation metric i.e AUC for the eval_set. We will set the task_type to GPU for training on GPU.

The function will take the train and val indices corresponding to one of the folds as inputs and return the eval metric for that fold

In [None]:
# functioin to fit the model with parameters from search space
def fit_reg(train_idx, val_idx, train_x, train_y, cat_cols, params):
      train_x, val_x = X_train.iloc[train_idx], X_train.iloc[val_idx]
      train_y, val_y = y_train.iloc[train_idx], y_train.iloc[val_idx]

      cat_cols = X_train.columns.values.tolist()

      train_pool = Pool(data = train_x,
                  label = train_y,
                  cat_features = cat_cols)

      val_pool = Pool(data = val_x,
                    label = val_y,
                    cat_features = cat_cols)

      model = CatBoostClassifier(
          **params,
          loss_function= 'Logloss',
          eval_metric='AUC',
          task_type='GPU',
          early_stopping_rounds=50,
          random_seed=21,
          )
    
      model.fit(train_pool, 
                eval_set=val_pool,
                verbose=0)
      y_preds = model.predict_proba(val_pool)
      return eval_metric(val_pool.get_label(), y_preds[:,1], 'AUC')
      #return roc_auc_score(val_y, y_preds)
  

The following Objective function will be used as the objective function for the hyperparameter tuning algorithm with Optuna. We will be using TPESampler as our sampler which is a Bayesian optimization approach that uses Gaussian Mixture Models(GMM).

The candidates for parameter search are selected carefully in order to avoid overfitting or underfitting and achieve speed of computation. We will be searching for optimal parameters for the following:
    1. iterations : sets number of trees (speed and overfitting)
    2. depth: depth of the tree (optimizing speed)
    3. learning_rate: reducing gradient step (time of training)
    4. l2_leaf_reg: coefficient of l2 regularization of loss function (overfitting)
    5. max_ctr_complexity: max # of features to be combined (idead for categorical features)
    6. random_strength: randomness added while scoring splits/ avoids overfitting
    7. bagging_temperature: used in bayesian bootstrap to assign random weights to objects

We will create k fold training and validation splits and submit the fit_reg function with futures and other variables the workers. We will gather result from each worker and append the AUC score to an array. The objective function will return the mean AUC metric for 10 folds in every trial that the hyperparameter search algorithm will try to maximize.

In [70]:
# Objective function
def objective(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 50, 500),
        'depth': trial.suggest_int('depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'max_ctr_complexity': trial.suggest_int('max_ctr_complexity', 0, 8),
        'random_strength': trial.suggest_int('random_strength', 0, 100),
        'bagging_temperature': trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
    }
    
    cat_cols = X_train.columns.values.tolist()
    
    kf = KFold(n_splits=10, random_state=21, shuffle=True)
    scores = []

    for train_idx, val_idx in kf.split(X_train):
      result = client.submit(fit_reg, train_idx, val_idx, X_train_f, y_train_f, cat_cols, params)
      score = client.gather(result)
      scores.append(score)

    return np.mean(scores)


Now lets test if the above parallelization works fine with 2 trails.
We create an optuna study with direction set to maximize and run the optimization using the objective function defined above.

In each trial, it will return the mean AUC metric for 10 folds and the hyperparameter values used.

In [71]:
%%time
# Create study
study = optuna.create_study(direction = "maximize", sampler = TPESampler(seed=21))

# Run optimization
study.optimize(objective, n_trials=2)

[32m[I 2021-12-04 23:39:24,953][0m A new study created in memory with name: no-name-91c856a7-4d38-40ef-b88c-510501eb4b6a[0m


[0.8822942820490333]
[0.8860948296898893]
[0.8837665439758037]
[0.8833926371325325]
[0.8858184499041227]
[0.8850700331292951]
[0.8867382604521217]
[0.8838457248995057]
[0.881783118059049]


[32m[I 2021-12-04 23:39:56,618][0m Trial 0 finished with value: 0.884491967761954 and parameters: {'iterations': 71, 'depth': 5, 'learning_rate': 0.05259765072681896, 'l2_leaf_reg': 1.1945462492435481, 'max_ctr_complexity': 1, 'random_strength': 5, 'bagging_temperature': 0.16184063577845234}. Best is trial 0 with value: 0.884491967761954.[0m


[0.8861157983281878]
[0.8997025552862772]
[0.9005105391985607]
[0.8999762494773915]
[0.8993866475301515]
[0.9009166195144634]
[0.9013334567596435]
[0.9025292279095771]
[0.9007857510217169]
[0.898742410585051]


[32m[I 2021-12-04 23:45:41,459][0m Trial 1 finished with value: 0.9005266370004759 and parameters: {'iterations': 349, 'depth': 5, 'learning_rate': 0.03833463003091129, 'l2_leaf_reg': 1.626138591513405, 'max_ctr_complexity': 7, 'random_strength': 13, 'bagging_temperature': 0.051582055717860724}. Best is trial 1 with value: 0.9005266370004759.[0m


[0.9013829127219252]
CPU times: user 53.3 s, sys: 14.5 s, total: 1min 7s
Wall time: 6min 16s


In [73]:
# Run optimization for 20 more trials
study.optimize(objective, n_trials=20)

[0.9011729156564899]
[0.9018962350792717]
[0.9014146963471952]
[0.9005806076586915]
[0.9018249518327144]
[0.9024457151839516]
[0.9037405898695046]
[0.9017146452603527]
[0.8997667605758828]


[32m[I 2021-12-04 23:56:21,865][0m Trial 2 finished with value: 0.9016973496699554 and parameters: {'iterations': 273, 'depth': 9, 'learning_rate': 0.057404222025053464, 'l2_leaf_reg': 9.734366128142419, 'max_ctr_complexity': 6, 'random_strength': 38, 'bagging_temperature': 0.43139322253545526}. Best is trial 2 with value: 0.9016973496699554.[0m


[0.9024163792355009]
[0.9015933477018165]
[0.9022101008550821]
[0.9017145111822122]
[0.9009555759951295]
[0.902471915960919]
[0.9029795084828871]
[0.904093156355288]
[0.9024258757117025]
[0.9005551532272226]


[32m[I 2021-12-05 00:02:20,727][0m Trial 3 finished with value: 0.9021780505362212 and parameters: {'iterations': 371, 'depth': 5, 'learning_rate': 0.07146655899479473, 'l2_leaf_reg': 9.21847572650625, 'max_ctr_complexity': 6, 'random_strength': 52, 'bagging_temperature': 0.04691838567431964}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.902781359889953]
[0.8845754536983939]
[0.8891435316632558]
[0.8868992585933315]
[0.8840251928010967]
[0.883145368754376]
[0.8882647602670274]
[0.8870291958643496]
[0.88590179822283]
[0.8804368552939432]


[32m[I 2021-12-05 00:05:07,279][0m Trial 4 finished with value: 0.8853731846944018 and parameters: {'iterations': 184, 'depth': 5, 'learning_rate': 0.021242839930699108, 'l2_leaf_reg': 5.141263004136785, 'max_ctr_complexity': 4, 'random_strength': 21, 'bagging_temperature': 15.829722723991496}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.8843104317854131]
[0.8990367673412136]
[0.8996448826289598]
[0.8992313958030472]
[0.8986364148479303]
[0.9001128298406438]
[0.9003659548530877]
[0.9015170895667662]
[0.8996083662515053]
[0.8977888081860577]


[32m[I 2021-12-05 00:14:34,087][0m Trial 5 finished with value: 0.8996559769207003 and parameters: {'iterations': 377, 'depth': 9, 'learning_rate': 0.018514918495483337, 'l2_leaf_reg': 6.533422400335912, 'max_ctr_complexity': 5, 'random_strength': 39, 'bagging_temperature': 0.4379421178920733}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.9006172598877917]
[0.8963946060012367]
[0.8976093825309067]
[0.8963766835803988]
[0.8965605133225553]
[0.8972084257791366]
[0.8982412066821379]
[0.9000504000977192]
[0.8975235749897177]
[0.895612611129396]


[32m[I 2021-12-05 00:21:49,376][0m Trial 6 finished with value: 0.8965075756105974 and parameters: {'iterations': 415, 'depth': 7, 'learning_rate': 0.07912984929177935, 'l2_leaf_reg': 5.410290025441162, 'max_ctr_complexity': 7, 'random_strength': 86, 'bagging_temperature': 14.248874933718481}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.8894983519927693]
[0.901170462295483]
[0.9016834277674414]
[0.9012818261165797]
[0.9004205867227141]
[0.9020879213511809]
[0.9023420687019428]
[0.9035870104888828]
[0.9016543859953872]
[0.8999294531160223]


[32m[I 2021-12-05 00:32:10,973][0m Trial 7 finished with value: 0.9016673725394773 and parameters: {'iterations': 450, 'depth': 8, 'learning_rate': 0.02847018620606998, 'l2_leaf_reg': 3.7532364830996996, 'max_ctr_complexity': 7, 'random_strength': 14, 'bagging_temperature': 0.40288954161165436}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.9025165828391368]
[0.8978615815940882]
[0.8994200312947438]
[0.8981762214790522]
[0.8976900798355565]
[0.899448136167836]
[0.8992817548439009]
[0.9007292211143144]
[0.898807097589148]
[0.8969097746302274]


[32m[I 2021-12-05 00:35:58,717][0m Trial 8 finished with value: 0.8988370781215881 and parameters: {'iterations': 154, 'depth': 7, 'learning_rate': 0.07696053142870848, 'l2_leaf_reg': 3.9496743137127104, 'max_ctr_complexity': 7, 'random_strength': 80, 'bagging_temperature': 4.600177517331423}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.9000468826670134]
[0.8961294553519024]
[0.8973144418450717]
[0.8967823436395318]
[0.8957677436684408]
[0.8975461251008143]
[0.898019983425401]
[0.898369627890917]
[0.8972071910518995]
[0.8953402288712289]


[32m[I 2021-12-05 00:39:37,606][0m Trial 9 finished with value: 0.8970024755625474 and parameters: {'iterations': 160, 'depth': 9, 'learning_rate': 0.029349532672828978, 'l2_leaf_reg': 7.651978542856805, 'max_ctr_complexity': 4, 'random_strength': 87, 'bagging_temperature': 6.436609590712361}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.8975476147802671]
[0.8850923437206392]
[0.8880102307740498]
[0.8876591778976887]
[0.8858295722471742]
[0.8878423248943362]
[0.8874578720834368]
[0.8879554118133774]
[0.886526645186868]
[0.8844987661464973]


[32m[I 2021-12-05 00:42:49,649][0m Trial 10 finished with value: 0.886851077072025 and parameters: {'iterations': 282, 'depth': 3, 'learning_rate': 0.013090387784708538, 'l2_leaf_reg': 9.84226789779026, 'max_ctr_complexity': 2, 'random_strength': 64, 'bagging_temperature': 0.010986262784020914}. Best is trial 3 with value: 0.9021780505362212.[0m


[0.8876384259561829]
[0.9028935472567374]
[0.9035901650481188]
[0.9033325894575696]
[0.9024281015093848]
[0.9035972613016092]
[0.9044698990764766]
[0.9051113865947898]
[0.903852203113506]
[0.9021398185211356]


[32m[I 2021-12-05 00:52:17,340][0m Trial 11 finished with value: 0.9035761295689717 and parameters: {'iterations': 293, 'depth': 10, 'learning_rate': 0.09902880913698733, 'l2_leaf_reg': 9.863411715860835, 'max_ctr_complexity': 5, 'random_strength': 45, 'bagging_temperature': 0.044726461496793435}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.9043463238103879]
[0.9007749232560915]
[0.9011429641104574]
[0.9007264942482811]
[0.9001316850338663]
[0.9014203698174754]
[0.9023246477001909]
[0.9032774938919738]
[0.901602573884735]
[0.8997188500405655]


[32m[I 2021-12-05 00:57:37,145][0m Trial 12 finished with value: 0.901313352967577 and parameters: {'iterations': 498, 'depth': 3, 'learning_rate': 0.09824595908646311, 'l2_leaf_reg': 8.00353650056507, 'max_ctr_complexity': 3, 'random_strength': 57, 'bagging_temperature': 0.011876438696668499}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.9020135276921316]
[0.6794090083864393]
[0.6848143667530306]
[0.6821839541534489]
[0.6837527713681888]
[0.6857386576902342]
[0.6811840790604009]
[0.6822119120567418]
[0.6841607370269629]
[0.6813831780248305]


[32m[I 2021-12-05 01:00:18,740][0m Trial 13 finished with value: 0.6827870575832728 and parameters: {'iterations': 303, 'depth': 10, 'learning_rate': 0.05495045923296133, 'l2_leaf_reg': 8.287211772668302, 'max_ctr_complexity': 5, 'random_strength': 40, 'bagging_temperature': 93.43605104023345}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.6830319113124506]
[0.9017882814374784]
[0.9025756849279056]
[0.9020492441177952]
[0.9011189456976982]
[0.9026050140563914]
[0.9031969001811815]
[0.9041158049457425]
[0.9024123578785994]
[0.9008370247404186]


[32m[I 2021-12-05 01:05:16,354][0m Trial 14 finished with value: 0.9023821473438568 and parameters: {'iterations': 243, 'depth': 6, 'learning_rate': 0.09851236219369892, 'l2_leaf_reg': 9.016982383135149, 'max_ctr_complexity': 8, 'random_strength': 70, 'bagging_temperature': 0.056962030663622105}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.9031222154553558]
[0.9014393868166257]
[0.902346730457719]
[0.9019738837807989]
[0.9009933914034686]
[0.9024373823314693]
[0.9030823645025708]
[0.9040246732904622]
[0.9022060366458263]
[0.9006373611435974]


[32m[I 2021-12-05 01:10:03,043][0m Trial 15 finished with value: 0.902214257332097 and parameters: {'iterations': 231, 'depth': 6, 'learning_rate': 0.09996022059147683, 'l2_leaf_reg': 7.319287207042267, 'max_ctr_complexity': 8, 'random_strength': 71, 'bagging_temperature': 0.07226853560813466}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.903001362948431]
[0.8970785978849709]
[0.8980618166023262]
[0.8975071573007987]
[0.896863101797123]
[0.8984642154873227]
[0.8988583667795956]
[0.8995371725835078]
[0.8980631079961716]
[0.896443870041391]


[32m[I 2021-12-05 01:13:29,251][0m Trial 16 finished with value: 0.8979813058714624 and parameters: {'iterations': 87, 'depth': 10, 'learning_rate': 0.040901413502549805, 'l2_leaf_reg': 8.788531155678182, 'max_ctr_complexity': 8, 'random_strength': 71, 'bagging_temperature': 1.620192814398018}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.8989356522414169]
[0.8919745343985441]
[0.8941063530293177]
[0.8933932910073745]
[0.8925766964331605]
[0.8947004993128136]
[0.8946809840672564]
[0.8955393711643626]
[0.8938216956856259]
[0.8915949936123974]


[32m[I 2021-12-05 01:15:54,635][0m Trial 17 finished with value: 0.8936878814226805 and parameters: {'iterations': 223, 'depth': 6, 'learning_rate': 0.06809065938052149, 'l2_leaf_reg': 6.60851368744403, 'max_ctr_complexity': 0, 'random_strength': 29, 'bagging_temperature': 0.02814502549634434}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.8944903955159518]
[0.8984370608900761]
[0.8994182218479696]
[0.8992238417446656]
[0.8981263392663971]
[0.8997632599322839]
[0.9000740652921744]
[0.9014576547980663]
[0.8993065219524824]
[0.8978512411053924]


[32m[I 2021-12-05 01:21:55,985][0m Trial 18 finished with value: 0.8993910552896096 and parameters: {'iterations': 323, 'depth': 8, 'learning_rate': 0.04345414855512785, 'l2_leaf_reg': 8.875071082118854, 'max_ctr_complexity': 3, 'random_strength': 49, 'bagging_temperature': 0.16549058284818846}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.9002523460665881]
[0.8892994902736198]
[0.8912441098070047]
[0.8902822759611426]
[0.8899244046529635]
[0.8914221563321941]
[0.8910003880853146]
[0.892670272490796]
[0.8908232136994224]
[0.8879208986160585]


[32m[I 2021-12-05 01:25:53,144][0m Trial 19 finished with value: 0.8906224284557824 and parameters: {'iterations': 247, 'depth': 4, 'learning_rate': 0.011717295742002982, 'l2_leaf_reg': 6.702777469099536, 'max_ctr_complexity': 5, 'random_strength': 98, 'bagging_temperature': 1.2461429312831218}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.8916370746393077]
[0.8986369598840844]
[0.899543427809972]
[0.8992974519708745]
[0.8985555545328219]
[0.900076368956261]
[0.9002577223193445]
[0.9019533067318684]
[0.899522974316256]
[0.8979681180681712]


[32m[I 2021-12-05 01:29:30,666][0m Trial 20 finished with value: 0.8996309861695024 and parameters: {'iterations': 122, 'depth': 8, 'learning_rate': 0.0885529564249055, 'l2_leaf_reg': 9.872596990037772, 'max_ctr_complexity': 6, 'random_strength': 63, 'bagging_temperature': 0.13378459350565755}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.900497977105371]
[0.9011733873904659]
[0.9019690454941827]
[0.9015165572582776]
[0.9005758946751182]
[0.9023057509053053]
[0.9026194002248668]
[0.9037222730469179]
[0.9018609543704879]
[0.9002381616965216]


[32m[I 2021-12-05 01:34:03,688][0m Trial 21 finished with value: 0.9018745897561129 and parameters: {'iterations': 211, 'depth': 6, 'learning_rate': 0.09929375041920775, 'l2_leaf_reg': 7.419679047846429, 'max_ctr_complexity': 8, 'random_strength': 73, 'bagging_temperature': 0.0681445643077473}. Best is trial 11 with value: 0.9035761295689717.[0m


[0.9027644724989857]


In [74]:
#select the best trail parameters from the study
tuned_params = study.best_trial.params
print(tuned_params)

{'iterations': 293, 'depth': 10, 'learning_rate': 0.09902880913698733, 'l2_leaf_reg': 9.863411715860835, 'max_ctr_complexity': 5, 'random_strength': 45, 'bagging_temperature': 0.044726461496793435}


Now we have the tuned parameters we are going to use them to fit further 5 fold cross validation models to get predicted values for the training data to use it to compare performance with the hold out sample (the test data created in the beginninig with train_test_split which was not used in hyperparameter search)

The cross_val_predict function below will iterate the training of the classifier using tuned parameters across 5 folds and return predictions for each holdout. As we complete this process, we will have predicted values for the training sample using 5 different models.

In [55]:
# function to predict training target using k fold cross validation and parameters tuned using Bayesian Optimizaiton above

def cross_val_predict(folds, tuned_params):
      
    kf = KFold(n_splits=folds, random_state=21, shuffle=True)
    preds = []
    true = []
    n = 0

    for train_idx, val_idx in kf.split(X_train):
        n +=1
        print("Training fold #{}".format(n))
        train_x, val_x = X_train.iloc[train_idx], X_train.iloc[val_idx]
        train_y, val_y = y_train.iloc[train_idx], y_train.iloc[val_idx]

        cat_cols = X_train.columns.values.tolist()

        train_pool = Pool(data = train_x,
                    label = train_y,
                    cat_features = cat_cols)

        val_pool = Pool(data = val_x,
                    label = val_y,
                    cat_features = cat_cols)

        model = CatBoostClassifier(
            **tuned_params,
            loss_function= 'Logloss',
            #eval_metric='AUC',
            task_type='GPU',
            #early_stopping_rounds=50,
            random_seed=21,
            )
    
        model.fit(train_pool, 
                #eval_set=val_pool,
                verbose=50)
        y_preds = model.predict_proba(val_pool)
        print("The mean prediction is: {}".format(np.mean(y_preds)))
        print("AUC score is: {}".format(eval_metric(val_pool.get_label(), y_preds[:,1], 'AUC')))
        preds.append(y_preds[:,1])
        true.append(val_y)
        
    return np.vstack(preds), np.vstack(true)


In [56]:
%%time
train_data_preds, train_data_true = cross_val_predict(folds=5, tuned_params)

Training fold #1
0:	learn: 0.6070195	total: 125ms	remaining: 36.5s
50:	learn: 0.3163955	total: 6.68s	remaining: 31.7s
100:	learn: 0.3118849	total: 12.8s	remaining: 24.4s
150:	learn: 0.3084177	total: 19s	remaining: 17.9s
200:	learn: 0.3027622	total: 25.7s	remaining: 11.8s
250:	learn: 0.2992880	total: 32.3s	remaining: 5.4s
292:	learn: 0.2968334	total: 37.9s	remaining: 0us
The mean prediction is: 0.5
AUC score is: [0.9028937406793534]
Training fold #2
0:	learn: 0.6074046	total: 127ms	remaining: 37s
50:	learn: 0.3157671	total: 6.57s	remaining: 31.2s
100:	learn: 0.3113085	total: 12.9s	remaining: 24.4s
150:	learn: 0.3082973	total: 19.1s	remaining: 18s
200:	learn: 0.3030123	total: 25.7s	remaining: 11.8s
250:	learn: 0.2992543	total: 32.5s	remaining: 5.43s
292:	learn: 0.2966949	total: 38s	remaining: 0us
The mean prediction is: 0.4999999999999999
AUC score is: [0.9025781930450302]
Training fold #3
0:	learn: 0.6074503	total: 125ms	remaining: 36.5s
50:	learn: 0.3167052	total: 6.55s	remaining: 31.1

In [72]:
train_data_preds = np.concatenate(train_data_preds, axis = 0)
train_data_true = np.concatenate(train_data_true, axis = 0)

Finally we will fit the final model with the training sample and use the model to predict values for the holdout (test sample not used in hyperparameter search or cross-val predict)

In [70]:
# create train/test pools for catboost classifier

cat_cols = X.columns.values.tolist()

train_data = Pool(data = X_train,
                  label = y_train,
                  cat_features = cat_cols)

test_data = Pool(data = X_test,
                    label = y_test,
                    cat_features = cat_cols)


In [65]:
final_model = CatBoostClassifier(
            **tuned_params,
            loss_function= 'Logloss',
            task_type='GPU',
            random_seed=21)

In [66]:
# fit the final model 
final_model.fit(train_data,
                verbose=10)

0:	learn: 0.6072652	total: 147ms	remaining: 43s
10:	learn: 0.3514783	total: 1.58s	remaining: 40.6s
20:	learn: 0.3263344	total: 3.18s	remaining: 41.2s
30:	learn: 0.3203455	total: 4.7s	remaining: 39.7s
40:	learn: 0.3181676	total: 6.23s	remaining: 38.3s
50:	learn: 0.3163639	total: 7.82s	remaining: 37.1s
60:	learn: 0.3154713	total: 9.43s	remaining: 35.9s
70:	learn: 0.3147162	total: 10.9s	remaining: 34.1s
80:	learn: 0.3139279	total: 12.4s	remaining: 32.5s
90:	learn: 0.3130938	total: 14.1s	remaining: 31.2s
100:	learn: 0.3123880	total: 15.7s	remaining: 29.9s
110:	learn: 0.3117915	total: 17.1s	remaining: 28.1s
120:	learn: 0.3112879	total: 18.7s	remaining: 26.6s
130:	learn: 0.3106964	total: 20.3s	remaining: 25.1s
140:	learn: 0.3101486	total: 21.7s	remaining: 23.3s
150:	learn: 0.3092162	total: 23.2s	remaining: 21.8s
160:	learn: 0.3085636	total: 24.6s	remaining: 20.2s
170:	learn: 0.3073701	total: 26.2s	remaining: 18.7s
180:	learn: 0.3062996	total: 27.8s	remaining: 17.2s
190:	learn: 0.3052282	tota

<catboost.core.CatBoostClassifier at 0x7f8545fa68e0>

In [71]:
# Get predictions for test data

test_data_preds = final_model.predict_proba(test_data)

Let's compute some performance metrics on training and test predictions to evaluate the performance of our model.
We will use the following robust evaluation metrics used for classification problems:
 1. AUC
 2. Logloss
 3. Accuracy
 4. F1 Score

In [92]:
# Compute AUC score

auc_train = eval_metric(train_data_true, train_data_preds, 'AUC')
auc_test = eval_metric(test_data.get_label(), test_data_preds[:,1], 'AUC')
print("Train AUC:{}".format(auc_train))
print("Test AUC:{}".format(auc_test))

Train AUC:[0.9033115522547814]
Test AUC:[0.9037158947782884]


The train and test AUCs are greater than 0.90 which means our model is pretty good at discriminating cases that result in arrest with those that do not.

We used Logloss as our loss function and our model tried to minimize it towards zero besides that we can't say much about the raw Logloss values. 

In [95]:
# Logloss

Logloss_train = eval_metric(train_data_true, train_data_preds, 'Logloss')
Logloss_test = eval_metric(test_data.get_label(), test_data_preds[:,1], 'Logloss')
print("Train Logloss:{}".format(Logloss_train))
print("Test Logloss:{}".format(Logloss_test))


Train Logloss:[0.6636908444307708]
Test Logloss:[0.6639550261296439]


The model's accuracy turns out to be pretty low.

In [97]:
# Accuracy

Accuracy_train = eval_metric(train_data_true, train_data_preds, 'Accuracy')
Accuracy_test = eval_metric(test_data.get_label(), test_data_preds[:,1], 'Accuracy')
print("Train Accuracy:{}".format(Accuracy_train))
print("Test Accuracy:{}".format(Accuracy_train))

Train Accuracy:[0.293348]
Test Accuracy:[0.293348]


The F1 score for our model has a low value as well.

In [103]:
# F1

F1_train = eval_metric(train_data_true, train_data_preds, 'F1')
F1_test = eval_metric(test_data.get_label(), test_data_preds[:,1], 'F1')
print("Train F1:{}".format(F1_train))
print("Test F1:{}".format(F1_train))

Train F1:[0.45362578362513417]
Test F1:[0.45362578362513417]


As we look at the evaluation metrics, the training and test metrics are fairly close. It means our model is about as good at prediction on test data as it is on the training data. We can conclude that our model does not overfit on the training dataset.