In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import catboost as ctb

pd.set_option('display.max_columns', 100)

## Train Model, get best Hyperparameters and get Test Predictions


This notebook can be used to train the model and get the predictions to the default condition of the samples given.


Steps:

1. Load data previously preprocessed
2. Create model 
3. Hyperparameter tuning
4. Test Prediction gathering

### Import Data

In [5]:
cat_features = ['account_status', 'account_worst_status_0_3m', 'account_worst_status_12_24m', 'account_worst_status_3_6m', 'account_worst_status_6_12m', 'merchant_category', 'merchant_group', 'name_in_email', 'status_last_archived_0_24m', 'status_2nd_last_archived_0_24m', 'status_3rd_last_archived_0_24m', 'status_max_archived_0_6_months', 'status_max_archived_0_12_months', 'status_max_archived_0_24_months', 'worst_status_active_inv']
cols_to_drop = ['default', 'uuid']

train_df = pd.read_csv('../dataset/train.csv')
test_df = pd.read_csv('../dataset/test.csv')

train_df[cat_features] = train_df[cat_features].astype(str)
test_df[cat_features] = test_df[cat_features].astype(str)


### Split Data

In [6]:
X_train = train_df.drop(cols_to_drop, axis=1)
y_train = train_df.default

### Calculate Weight Classes for unbalanced data

In [7]:
pos_class_multiplier = len([x for x in y_train if x == 0])/len([x for x in y_train if x == 1])
round(pos_class_multiplier, 1)

68.9

### Catboost
Catboost is a gradient boosting tree machine model algorithm. It can be used for regression, classification and ranking purpose. We will use it as a binary classification model.

**Model properties**:
- Being it a Gradient Boosting tree model, the data do not need to be normalized.
- Also, being catboost able to handle categorical variables, we will let it learn the best representation to the categories, taking into account what we observed in our dataset in the exploration part (the cardinality of categories, etc.)

**Metric choice**:

Being this a default problem, we have to carefully decide for which metric to optimize, especially because the data has unbalanced classes. In Default prediction problems, we are interested in optimizing the metrics of recall (number of correctly classified defaults divided by the real defaults), hence to capture as many real defaults as possible. 

Of course, we would like to optimize as well the precision for the class(number of correctly classified defaults divided by the predicted defaults), but on a smaller extent. 
- Since we are interested on the **probability** of default as well, as primary metric to be optimized we wil use **AUC (Area Under the Curve).**
- As second metric, however, we use the **Recall**. This means, that a very small improvements of AUC (less than 0.01% inferior to the best AUC), we will prioritize **Recall**.



In [8]:
prop = {
    'iterations': 20,
    'learning_rate': 0.2,
    'max_depth': 5
}

# prop_hyperparams_tuning = 

In [9]:
model = ctb.CatBoostClassifier(**prop)

model.fit(X_train, y_train, cat_features=[X_train.columns.tolist().index(x) for x in cat_features],)

0:	learn: 0.3608925	total: 85.3ms	remaining: 1.62s
1:	learn: 0.2110240	total: 113ms	remaining: 1.01s
2:	learn: 0.1402432	total: 139ms	remaining: 787ms
3:	learn: 0.1025079	total: 163ms	remaining: 652ms
4:	learn: 0.0857504	total: 187ms	remaining: 560ms
5:	learn: 0.0771777	total: 208ms	remaining: 484ms
6:	learn: 0.0709211	total: 230ms	remaining: 428ms
7:	learn: 0.0673333	total: 257ms	remaining: 385ms
8:	learn: 0.0639148	total: 279ms	remaining: 340ms
9:	learn: 0.0619321	total: 307ms	remaining: 307ms
10:	learn: 0.0606679	total: 335ms	remaining: 274ms
11:	learn: 0.0595902	total: 363ms	remaining: 242ms
12:	learn: 0.0590011	total: 390ms	remaining: 210ms
13:	learn: 0.0585283	total: 412ms	remaining: 177ms
14:	learn: 0.0581004	total: 436ms	remaining: 145ms
15:	learn: 0.0577268	total: 459ms	remaining: 115ms
16:	learn: 0.0573642	total: 481ms	remaining: 84.9ms
17:	learn: 0.0567881	total: 505ms	remaining: 56.1ms
18:	learn: 0.0562943	total: 536ms	remaining: 28.2ms
19:	learn: 0.0559334	total: 562ms	rem

<catboost.core.CatBoostClassifier at 0x7f8db15a9c40>

In [10]:
y_preds = model.predict(test_df.drop(cols_to_drop, axis=1))
y_proba = model.predict_proba(test_df.drop(cols_to_drop, axis=1))

In [11]:
[x for x in y_preds if x ==1]

[1.0, 1.0, 1.0]

### Hyperparameter Tuning

In order to reduce Overfitting, we need to perform Hyperparameter Tuning.
For this task, we decide to implement a Random Validator class iterating over a distribution of values over certain parameters. The random validator is based on a **Cross-Validation over N=10 folds.**

Below we create the class able to run Hyperparameter Tuning of the model

In [15]:
import gc
from random import randint
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import recall_score, make_scorer, roc_auc_score, f1_score, classification_report
from sklearn.datasets import make_classification

class RandomValidator:
    """
    searches the parameter space randomly.
    params can be specified directly in the recommender class that I want to validate 
    (check out one that has those already specified)
    convention: range in tuple -> consider all the possible value in between
                range in list  -> consider just the values in the list
    """

    def __init__(self, train_df, labels, fixed_params, searching_params, num_folds=10, num_iterations=10, export=True, cat_features=[], granularity=1):
        self.train_df = train_df
        self.labels = labels
        self.fixed_params = fixed_params
        self.searching_params = searching_params
        self.num_iterations = num_iterations
        self.num_folds = num_folds
        self.export = export
        self.cat_features = cat_features
        self.granularity = granularity
        self.best_params = None
        
        # initialize crossvalidation
        
        # create dataset
        
        
    
    def update_parameter(self, param, value):
        print('Not implemented yet')
        
    def run_cv_iteration(self, model, skf):
        sum_auc = 0
        sum_recall = 0
        sum_f1 = 0
        
        for train_index, test_index in skf.split(self.train_df, self.labels):
            X_train, X_test = self.train_df.iloc[train_index], self.train_df.iloc[test_index]
            y_train, y_test = self.labels[train_index], self.labels[test_index]
            
            # train & measure

            model.fit(X_train, y_train, cat_features=[X_train.columns.tolist().index(x) for x in cat_features], verbose = 0)
            
            # Predict & eval
            y_pred = model.predict(X_test)
            y_proba = model.predict_proba(X_test)[:, 1]
            
            auc = roc_auc_score(y_test, y_proba)
            recall = recall_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average='weighted')
            
            sum_auc += auc
            sum_recall += recall
            sum_f1 += f1
            
            # print(classification_report(y_test, y_pred))
        
        auc_avg = sum_auc/self.num_folds
        recall_avg = sum_recall/self.num_folds
        f1_avg = sum_f1/self.num_folds
        
        print('### FINAL CV average SCORES ###')
        print('AUC avg:', round(auc_avg, 3))
        print('recall at fold:', round(recall_avg, 3))
        print('f1 avg:', round(f1_avg, 3))
        
        return auc_avg, recall_avg
        
    def sample(self, obj):
        if type(obj) == tuple:
            low_bound = obj[0]
            upp_bound = obj[1]
            rand_numb = randint(0, self.granularity)
            step = (upp_bound - low_bound)/self.granularity
            return low_bound + step*rand_numb
        elif type(obj) == list:
            return obj[randint(0, len(obj)-1)] 


    def get_best_params(self):
        return self.best_params
        
        
    def run(self):
        
        # prepare the cross-validation procedure
        skf = StratifiedKFold(n_splits=self.num_folds, shuffle=False, random_state=42)
        
        dict_results = {'params': [], 'auc': [], 'recall': []}
        
        for i in range(self.num_iterations):
            
            # sample a random parameter from the dictionary
            sampled_params = {}
            for key, value in self.searching_params.items():
                sampled_params[key] = self.sample(value)

            params_dict = {**self.fixed_params, **sampled_params}
            
            print('Validating with params_dict', params_dict)
            
            model = ctb.CatBoostClassifier(**params_dict)
            
            auc_avg, recall_avg = self.run_cv_iteration(model, skf)
            
            dict_results['params'].append(params_dict)
            dict_results['auc'].append(auc_avg)
            dict_results['recall'].append(recall_avg)
            
            if (auc_avg >= 0.999 * np.max(dict_results['auc'])):
                # get idx of list with max auc
                idxs_max_auc = [i for i in range(len(dict_results['auc'])) if auc_avg >= 0.999*dict_results['auc'][i]]
                
                recalls_lists = [dict_results['recall'][i] for i in idxs_max_auc]
                
                if recall_avg == np.max(recalls_lists):
                    print('#### Found new best configuration #### ')
                    self.best_params = params_dict
            
            
            #if self.export != None:
            #    self.export.check_if_export(score, params_dict)



We declare the parameters to be searched, and the fixed parameters

In [16]:
searching_params = {
    'max_depth': [7], 
    'reg_lambda': [96], 
    'one_hot_max_size': [35], 
    'model_size_reg': [1], 
    'max_ctr_complexity': [3],
    'learning_rate': [0.015, 0.02, 0.025, 0.03, 0.035],
    'iterations': [200, 225, 250, 275, 300, 350, 400]
    #'simple_ctr': ['Counter', 'Buckets'],
    #'combinations_ctr': ['Counter', 'Buckets'] 
}

fixed_params = {
    'class_weights': (1, pos_class_multiplier)
}

In [17]:
ctb_cv_validator = RandomValidator(X_train, y_train, fixed_params = fixed_params, searching_params = searching_params, num_folds=5, num_iterations=3, granularity=10)

ctb_cv_validator.run()



Validating with params_dict {'class_weights': (1, 68.85714285714286), 'max_depth': 7, 'reg_lambda': 96, 'one_hot_max_size': 35, 'model_size_reg': 1, 'max_ctr_complexity': 3, 'learning_rate': 0.035, 'iterations': 350}
### FINAL CV average SCORES ###
AUC avg: 0.91
recall at fold: 0.816
f1 avg: 0.896
#### Found new best configuration #### 
Validating with params_dict {'class_weights': (1, 68.85714285714286), 'max_depth': 7, 'reg_lambda': 96, 'one_hot_max_size': 35, 'model_size_reg': 1, 'max_ctr_complexity': 3, 'learning_rate': 0.035, 'iterations': 400}
### FINAL CV average SCORES ###
AUC avg: 0.909
recall at fold: 0.808
f1 avg: 0.9
Validating with params_dict {'class_weights': (1, 68.85714285714286), 'max_depth': 7, 'reg_lambda': 96, 'one_hot_max_size': 35, 'model_size_reg': 1, 'max_ctr_complexity': 3, 'learning_rate': 0.03, 'iterations': 400}
### FINAL CV average SCORES ###
AUC avg: 0.91
recall at fold: 0.819
f1 avg: 0.895
#### Found new best configuration #### 


### Train and save final model

After performing Hyperparameter tuning, we observe that the best iteration has been performed over the hyperparameters:
    {
        learning_rate: 0.035,
        iterations: 250,
        max_depth: 7
    }

Thus, we are able to train our final model fitting the entire training dataset and we store the dataset as a pkl file ready for the deployment

In [18]:
# Fit the best model configuration found
best_params = ctb_cv_validator.get_best_params()

final_model = ctb.CatBoostClassifier(**best_params)

final_model.fit(X_train, y_train, cat_features=[X_train.columns.tolist().index(x) for x in cat_features],)

0:	learn: 0.6776589	total: 34.1ms	remaining: 13.6s
1:	learn: 0.6619712	total: 64ms	remaining: 12.7s
2:	learn: 0.6486929	total: 93.1ms	remaining: 12.3s
3:	learn: 0.6349551	total: 119ms	remaining: 11.8s
4:	learn: 0.6238422	total: 148ms	remaining: 11.7s
5:	learn: 0.6117459	total: 180ms	remaining: 11.8s
6:	learn: 0.6026692	total: 206ms	remaining: 11.6s
7:	learn: 0.5931476	total: 239ms	remaining: 11.7s
8:	learn: 0.5842677	total: 272ms	remaining: 11.8s
9:	learn: 0.5757203	total: 306ms	remaining: 11.9s
10:	learn: 0.5676024	total: 337ms	remaining: 11.9s
11:	learn: 0.5591030	total: 367ms	remaining: 11.9s
12:	learn: 0.5519810	total: 397ms	remaining: 11.8s
13:	learn: 0.5454695	total: 429ms	remaining: 11.8s
14:	learn: 0.5390338	total: 462ms	remaining: 11.9s
15:	learn: 0.5341243	total: 494ms	remaining: 11.9s
16:	learn: 0.5283867	total: 523ms	remaining: 11.8s
17:	learn: 0.5232529	total: 553ms	remaining: 11.7s
18:	learn: 0.5163119	total: 584ms	remaining: 11.7s
19:	learn: 0.5117854	total: 613ms	remain

165:	learn: 0.3532898	total: 5.03s	remaining: 7.09s
166:	learn: 0.3529576	total: 5.06s	remaining: 7.06s
167:	learn: 0.3526801	total: 5.09s	remaining: 7.03s
168:	learn: 0.3523101	total: 5.12s	remaining: 7s
169:	learn: 0.3521711	total: 5.15s	remaining: 6.96s
170:	learn: 0.3518436	total: 5.18s	remaining: 6.94s
171:	learn: 0.3517113	total: 5.21s	remaining: 6.9s
172:	learn: 0.3512645	total: 5.24s	remaining: 6.88s
173:	learn: 0.3510171	total: 5.27s	remaining: 6.85s
174:	learn: 0.3508730	total: 5.3s	remaining: 6.81s
175:	learn: 0.3507751	total: 5.32s	remaining: 6.78s
176:	learn: 0.3503695	total: 5.36s	remaining: 6.75s
177:	learn: 0.3500557	total: 5.38s	remaining: 6.71s
178:	learn: 0.3498229	total: 5.41s	remaining: 6.68s
179:	learn: 0.3494684	total: 5.45s	remaining: 6.66s
180:	learn: 0.3492766	total: 5.48s	remaining: 6.63s
181:	learn: 0.3490437	total: 5.51s	remaining: 6.6s
182:	learn: 0.3486843	total: 5.54s	remaining: 6.57s
183:	learn: 0.3483478	total: 5.57s	remaining: 6.54s
184:	learn: 0.3481

330:	learn: 0.3196497	total: 9.94s	remaining: 2.07s
331:	learn: 0.3196469	total: 9.96s	remaining: 2.04s
332:	learn: 0.3193371	total: 9.99s	remaining: 2.01s
333:	learn: 0.3188650	total: 10s	remaining: 1.98s
334:	learn: 0.3185912	total: 10.1s	remaining: 1.95s
335:	learn: 0.3184781	total: 10.1s	remaining: 1.92s
336:	learn: 0.3182470	total: 10.1s	remaining: 1.89s
337:	learn: 0.3179318	total: 10.1s	remaining: 1.86s
338:	learn: 0.3177060	total: 10.2s	remaining: 1.83s
339:	learn: 0.3173543	total: 10.2s	remaining: 1.8s
340:	learn: 0.3170735	total: 10.2s	remaining: 1.77s
341:	learn: 0.3168249	total: 10.3s	remaining: 1.74s
342:	learn: 0.3166285	total: 10.3s	remaining: 1.71s
343:	learn: 0.3164182	total: 10.3s	remaining: 1.68s
344:	learn: 0.3161501	total: 10.4s	remaining: 1.65s
345:	learn: 0.3156999	total: 10.4s	remaining: 1.62s
346:	learn: 0.3155028	total: 10.4s	remaining: 1.59s
347:	learn: 0.3153130	total: 10.4s	remaining: 1.56s
348:	learn: 0.3151041	total: 10.5s	remaining: 1.53s
349:	learn: 0.3

<catboost.core.CatBoostClassifier at 0x7f8d59ab7f10>

In [19]:
# Get prediction probabilities of the test entries
test_preds = model.predict_proba(test_df[X_train.columns])[:, 1]

prediction_df = pd.DataFrame({'uuid': test_df.uuid, 'pd': test_preds})

# save prediction file
prediction_df.to_csv('../output/predictions.csv', index=False)

# display predictions
prediction_df.head()

Unnamed: 0,uuid,pd
0,6f6e6c6a-2081-4e6b-8eb3-4fd89b54b2d7,0.005663
1,f6f6d9f3-ef2b-4329-a388-c6a687f27e70,0.021215
2,e9c39869-1bc5-4375-b627-a2df70b445ea,0.002246
3,6beb88a3-9641-4381-beb6-c9a208664dd0,0.012482
4,bb89b735-72fe-42a4-ba06-d63be0f4ca36,0.050293


### Save Model as a Pickle file


In [20]:
import pickle

with open('final_model.pkl', 'wb') as f:
    pickle.dump(final_model, f)

with open('final_model.pkl', 'rb') as f:
    model_loaded = pickle.load(f)
    
model_loaded

<catboost.core.CatBoostClassifier at 0x7f8dc1bdd3d0>

In [21]:
model_loaded.get_params()

{'iterations': 400,
 'learning_rate': 0.03,
 'model_size_reg': 1,
 'max_ctr_complexity': 3,
 'class_weights': (1, 68.85714285714286),
 'one_hot_max_size': 35,
 'max_depth': 7,
 'reg_lambda': 96}