# Mastering Gradient Boosting with CatBoost

In this tutorial we will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. [Here](https://www.kaggle.com/c/amazon-employee-access-challenge/data) is the link to the challenge, that we will be exploring.

## Libraries installation

In [None]:
#!pip install --user --upgrade catboost
#!pip install --user --upgrade ipywidgets
#!pip install shap
#!pip install sklearn
#!jupyter nbextension enable --py widgetsnbextension

In [1]:
#import modules

import os
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)

import catboost
print(catboost.__version__) #catboos version

1.0.3


## Reading the data

In [2]:
from catboost.datasets import amazon

# If you have "URLError: SSL: CERTIFICATE_VERIFY_FAILED" uncomment next two lines:
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

# If you have any other error:
# Download datasets from http://bit.ly/2ZUXTSv and uncomment next line:
# train_df = pd.read_csv('train.csv', sep=',', header='infer')

(train_df, test_df) = amazon()

In [3]:
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


The column "ACTION" is the one we want to predict: We want to predict if some pearson will be granted with some rol

All the features are numeric but they realle are hashes. By this will handle like categorical features

## Exploring the data

Label values extraction

In [4]:
y = train_df.ACTION #target vector
X = train_df.drop('ACTION', axis=1) #features vector

Categorical features declaration

In [5]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Looking on label balance in dataset

In [6]:
print('Labels: {}'.format(set(y)))
print('Zero count = {}, One count = {}'.format(len(y) - sum(y), sum(y)))

Labels: {0, 1}
Zero count = 1897, One count = 30872


In [7]:
#another way to count target classes

y.value_counts()

1    30872
0     1897
Name: ACTION, dtype: int64

The target are high desbalanced

# Training the first model

In [8]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100) #instantiate the CatBoost class
model.fit(X, y, cat_features=cat_features, verbose=10) #fit the model

Learning rate set to 0.377604
0:	learn: 0.4528598	total: 182ms	remaining: 18s
10:	learn: 0.1744186	total: 343ms	remaining: 2.77s
20:	learn: 0.1676119	total: 503ms	remaining: 1.89s
30:	learn: 0.1652446	total: 658ms	remaining: 1.46s
40:	learn: 0.1633644	total: 847ms	remaining: 1.22s
50:	learn: 0.1621892	total: 1s	remaining: 966ms
60:	learn: 0.1609164	total: 1.18s	remaining: 753ms
70:	learn: 0.1594572	total: 1.35s	remaining: 551ms
80:	learn: 0.1585876	total: 1.5s	remaining: 353ms
90:	learn: 0.1573593	total: 1.67s	remaining: 165ms
99:	learn: 0.1566977	total: 1.81s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x21cdeaca760>

In the cat_features parameters we pass the categorical features indices or names

In [9]:
from catboost import CatBoostClassifier
model_1 = CatBoostClassifier(iterations=100) #instantiate the CatBoost class
model_1.fit(X, y, cat_features=list(train_df.drop('ACTION', axis=1).columns), verbose=10) #fit the model

Learning rate set to 0.377604
0:	learn: 0.4528598	total: 11ms	remaining: 1.09s
10:	learn: 0.1744186	total: 157ms	remaining: 1.27s
20:	learn: 0.1676119	total: 361ms	remaining: 1.36s
30:	learn: 0.1652446	total: 521ms	remaining: 1.16s
40:	learn: 0.1633644	total: 684ms	remaining: 984ms
50:	learn: 0.1621892	total: 843ms	remaining: 810ms
60:	learn: 0.1609164	total: 1.07s	remaining: 687ms
70:	learn: 0.1594572	total: 1.28s	remaining: 523ms
80:	learn: 0.1585876	total: 1.47s	remaining: 345ms
90:	learn: 0.1573593	total: 1.62s	remaining: 160ms
99:	learn: 0.1566977	total: 1.77s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x21cded3d220>

With categorical names it works

In [10]:
#for every instance we hace the probabilitys for each class

model.predict_proba(X)

array([[0.0098, 0.9902],
       [0.0101, 0.9899],
       [0.0579, 0.9421],
       ...,
       [0.0118, 0.9882],
       [0.1891, 0.8109],
       [0.0235, 0.9765]])

# Working with dataset

There are several ways of passing dataset to training - using X,y (the initial matrix) or using Pool class.
Pool class is the class for storing the dataset. In the next few blocks we'll explore the ways to create a Pool object.

You can use Pool class if the dataset has more than just X and y (for example, it has sample weights or groups) or if the dataset is large and it takes long time to read it into python.

In [11]:
from catboost import Pool
pool = Pool(data=X, label=y, cat_features=cat_features)

## Split your data into train and validation

In [12]:
from sklearn.model_selection import train_test_split

data = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_validation, y_train, y_validation = data

#train pool
train_pool = Pool(
    data=X_train, 
    label=y_train, 
    cat_features=cat_features
)

#validation pool
validation_pool = Pool(
    data=X_validation, 
    label=y_validation, 
    cat_features=cat_features
)

## Selecting the objective function

Possible options for binary classification:

`Logloss` for binary target.

`CrossEntropy` for probabilities in target.

In [13]:
model = CatBoostClassifier(
    iterations=5,
    learning_rate=0.1,
    # loss_function='CrossEntropy'
)
model.fit(train_pool, eval_set=validation_pool, verbose=False)

print('Model is fitted: {}'.format(model.is_fitted()))
print('Model params:\n{}'.format(model.get_params()))

Model is fitted: True
Model params:
{'iterations': 5, 'learning_rate': 0.1}


## Stdout of the training

In [17]:
model = CatBoostClassifier(
    iterations=15,
#     verbose=5,
)
model.fit(train_pool, eval_set=validation_pool);

Learning rate set to 0.441257
0:	learn: 0.4226231	test: 0.4217069	best: 0.4217069 (0)	total: 13.7ms	remaining: 192ms
1:	learn: 0.3157972	test: 0.3136469	best: 0.3136469 (1)	total: 30.9ms	remaining: 201ms
2:	learn: 0.2631196	test: 0.2603395	best: 0.2603395 (2)	total: 46ms	remaining: 184ms
3:	learn: 0.2334650	test: 0.2294580	best: 0.2294580 (3)	total: 60.1ms	remaining: 165ms
4:	learn: 0.2077060	test: 0.2017327	best: 0.2017327 (4)	total: 75.2ms	remaining: 150ms
5:	learn: 0.1961364	test: 0.1883112	best: 0.1883112 (5)	total: 89.6ms	remaining: 134ms
6:	learn: 0.1879266	test: 0.1794018	best: 0.1794018 (6)	total: 105ms	remaining: 120ms
7:	learn: 0.1841218	test: 0.1743149	best: 0.1743149 (7)	total: 115ms	remaining: 100ms
8:	learn: 0.1814626	test: 0.1698731	best: 0.1698731 (8)	total: 129ms	remaining: 86ms
9:	learn: 0.1785403	test: 0.1650335	best: 0.1650335 (9)	total: 143ms	remaining: 71.6ms
10:	learn: 0.1771678	test: 0.1634002	best: 0.1634002 (10)	total: 158ms	remaining: 57.3ms
11:	learn: 0.1762

In [18]:
model_1 = CatBoostClassifier(
    iterations=50,
    verbose=10,
)
model_1.fit(train_pool, eval_set=validation_pool)

Learning rate set to 0.26136
0:	learn: 0.5154456	test: 0.5149541	best: 0.5149541 (0)	total: 15ms	remaining: 733ms
10:	learn: 0.1946935	test: 0.1875751	best: 0.1875751 (10)	total: 148ms	remaining: 526ms
20:	learn: 0.1749604	test: 0.1626191	best: 0.1626191 (20)	total: 293ms	remaining: 405ms
30:	learn: 0.1716988	test: 0.1592033	best: 0.1592033 (30)	total: 447ms	remaining: 274ms
40:	learn: 0.1699723	test: 0.1579053	best: 0.1579053 (40)	total: 590ms	remaining: 129ms
49:	learn: 0.1688302	test: 0.1568128	best: 0.1568070 (48)	total: 714ms	remaining: 0us

bestTest = 0.1568069994
bestIteration = 48

Shrink model to first 49 iterations.


<catboost.core.CatBoostClassifier at 0x21cdeaca910>

In [19]:
model_2 = CatBoostClassifier(
    iterations=500,
    verbose=10,
)
model_2.fit(train_pool, eval_set=validation_pool)

Learning rate set to 0.095993
0:	learn: 0.5834608	test: 0.5827359	best: 0.5827359 (0)	total: 27.8ms	remaining: 13.9s
10:	learn: 0.2286866	test: 0.2231498	best: 0.2231498 (10)	total: 261ms	remaining: 11.6s
20:	learn: 0.1784534	test: 0.1657012	best: 0.1657012 (20)	total: 783ms	remaining: 17.9s
30:	learn: 0.1668992	test: 0.1496669	best: 0.1496669 (30)	total: 1.53s	remaining: 23.1s
40:	learn: 0.1614484	test: 0.1424481	best: 0.1424481 (40)	total: 2.12s	remaining: 23.7s
50:	learn: 0.1595253	test: 0.1406902	best: 0.1406902 (50)	total: 2.65s	remaining: 23.4s
60:	learn: 0.1568892	test: 0.1391132	best: 0.1391132 (60)	total: 3.36s	remaining: 24.2s
70:	learn: 0.1555521	test: 0.1384083	best: 0.1383827 (69)	total: 4.05s	remaining: 24.5s
80:	learn: 0.1538197	test: 0.1376592	best: 0.1376441 (79)	total: 4.68s	remaining: 24.2s
90:	learn: 0.1526466	test: 0.1370450	best: 0.1370450 (90)	total: 5.4s	remaining: 24.3s
100:	learn: 0.1515331	test: 0.1366341	best: 0.1366341 (100)	total: 5.95s	remaining: 23.5s
11

<catboost.core.CatBoostClassifier at 0x21cdeacaee0>

As you see the learning rate is adjust automaticaly in function of:number of samples,number of features and the iteration number

## Metrics calculation and graph plotting

By default for binary classification CatBoost select Logloss as loss function

You can indicate another custom functions to plot and evaluate

In [20]:
model_3 = CatBoostClassifier(
    iterations=50,
    learning_rate=0.5,
    custom_loss=['AUC', 'Accuracy'] #new metrics
)

model_3.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False, #argument to verbose
    plot=True #argument to plot
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

## Model comparison

It's possible to plot the metrics before the training

In [21]:
#model with high learning rate: overfitting
model_4 = CatBoostClassifier(
    learning_rate=0.7,
    iterations=100,
    train_dir='learing_rate_0.7'
)

#Model with low learning rate: underfitting
model_5 = CatBoostClassifier(
    learning_rate=0.01,
    iterations=100,
    train_dir='learing_rate_0.01'
)

#fit model1 and model2
model_4.fit(train_pool, eval_set=validation_pool, verbose=20)
model_5.fit(train_pool, eval_set=validation_pool, verbose=20);

0:	learn: 0.3264513	test: 0.3248170	best: 0.3248170 (0)	total: 13.6ms	remaining: 1.35s
20:	learn: 0.1688825	test: 0.1574182	best: 0.1573949 (16)	total: 301ms	remaining: 1.13s
40:	learn: 0.1632884	test: 0.1582531	best: 0.1571533 (23)	total: 588ms	remaining: 846ms
60:	learn: 0.1584388	test: 0.1573279	best: 0.1569712 (52)	total: 873ms	remaining: 558ms
80:	learn: 0.1544282	test: 0.1583794	best: 0.1569712 (52)	total: 1.17s	remaining: 275ms
99:	learn: 0.1510415	test: 0.1583995	best: 0.1569712 (52)	total: 1.47s	remaining: 0us

bestTest = 0.1569712214
bestIteration = 52

Shrink model to first 53 iterations.
0:	learn: 0.6853769	test: 0.6853610	best: 0.6853610 (0)	total: 15.5ms	remaining: 1.53s
20:	learn: 0.5575578	test: 0.5568257	best: 0.5568257 (20)	total: 303ms	remaining: 1.14s
40:	learn: 0.4678112	test: 0.4663769	best: 0.4663769 (40)	total: 566ms	remaining: 815ms
60:	learn: 0.4029225	test: 0.4011544	best: 0.4011544 (60)	total: 787ms	remaining: 503ms
80:	learn: 0.3551621	test: 0.3530433	best:

In [22]:
#Plot metrics

from catboost import MetricVisualizer
MetricVisualizer(['learing_rate_0.7', 'learing_rate_0.01']).start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

## Best iteration

In [23]:
model_6 = CatBoostClassifier(
    iterations=100,
#     use_best_model=False
)
model_6.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [27]:
print('Tree count: ' + str(model_6.tree_count_))

Tree count: 82


In [25]:
model_7 = CatBoostClassifier(
    iterations=100,
    use_best_model=False
)
model_7.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [28]:
print('Tree count: ' + str(model_7.tree_count_))

Tree count: 100


## Cross-validation

In [34]:
from catboost import cv


#params definition
params = {
    'loss_function': 'Logloss',
    'iterations': 80,
    'custom_loss': 'AUC',
    'learning_rate': 0.5,
}

#cross validation proces
cv_data = cv(
    params = params,
    pool = train_pool,
    fold_count=5,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    verbose=False
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.1628504126
bestIteration = 68

Training on fold [1/5]

bestTest = 0.1608272543
bestIteration = 77

Training on fold [2/5]

bestTest = 0.1694535356
bestIteration = 12

Training on fold [3/5]

bestTest = 0.1569498918
bestIteration = 29

Training on fold [4/5]

bestTest = 0.1644437541
bestIteration = 30



In [35]:
cv_data.head(10)

Unnamed: 0,iterations,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std,test-AUC-mean,test-AUC-std
0,0,0.305831,2e-05,0.30581,1.4e-05,0.496175,0.014042
1,1,0.234755,0.00102,0.235427,0.000274,0.569789,0.030202
2,2,0.196036,0.002694,0.202469,0.003171,0.761543,0.025878
3,3,0.182656,0.001401,0.1914,0.001531,0.801902,0.008434
4,4,0.175531,0.001515,0.185466,0.002237,0.817244,0.008916
5,5,0.172046,0.00215,0.182478,0.002121,0.823333,0.006377
6,6,0.170193,0.001849,0.18045,0.00246,0.828977,0.007246
7,7,0.168321,0.002187,0.17814,0.002177,0.834106,0.010527
8,8,0.167416,0.002498,0.176438,0.002126,0.835628,0.011008
9,9,0.166474,0.00301,0.175041,0.002259,0.837496,0.011698


In [36]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, not stratified: 0.1636±0.0046 on step 35


In [37]:
from catboost import cv

params = {
    'loss_function': 'Logloss',
    'iterations': 80,
    'custom_loss': 'AUC',
    'learning_rate': 0.5,
}

cv_data_1 = cv(
    params = params,
    pool = train_pool,
    fold_count=5,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    stratified=True, #stratified target
    verbose=False
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/5]

bestTest = 0.1628504126
bestIteration = 68

Training on fold [1/5]

bestTest = 0.1608272543
bestIteration = 77

Training on fold [2/5]

bestTest = 0.1694535356
bestIteration = 12

Training on fold [3/5]

bestTest = 0.1569498918
bestIteration = 29

Training on fold [4/5]

bestTest = 0.1644437541
bestIteration = 30



In [38]:
best_value = cv_data_1['test-Logloss-mean'].min()
best_iter = cv_data_1['test-Logloss-mean'].values.argmin()

print('Best validation Logloss score, stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data_1['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, stratified: 0.1636±0.0046 on step 35


## Sklearn Grid Search

In [39]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "learning_rate": [0.001, 0.01, 0.5],
}

clf = CatBoostClassifier(
    iterations=20, 
    cat_features=cat_features, 
    verbose=20
)
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=3)
results = grid_search.fit(X_train, y_train)
results.best_estimator_.get_params()

0:	learn: 0.6923673	total: 12.3ms	remaining: 234ms
19:	learn: 0.6778431	total: 190ms	remaining: 0us
0:	learn: 0.6923682	total: 9ms	remaining: 171ms
19:	learn: 0.6778558	total: 150ms	remaining: 0us
0:	learn: 0.6923682	total: 9.43ms	remaining: 179ms
19:	learn: 0.6778568	total: 163ms	remaining: 0us
0:	learn: 0.6853838	total: 11.6ms	remaining: 220ms
19:	learn: 0.5629769	total: 168ms	remaining: 0us
0:	learn: 0.6853928	total: 11.3ms	remaining: 215ms
19:	learn: 0.5630657	total: 191ms	remaining: 0us
0:	learn: 0.6853925	total: 8.77ms	remaining: 167ms
19:	learn: 0.5630556	total: 163ms	remaining: 0us
0:	learn: 0.3972934	total: 9.18ms	remaining: 174ms
19:	learn: 0.1766459	total: 237ms	remaining: 0us
0:	learn: 0.3977266	total: 8.71ms	remaining: 165ms
19:	learn: 0.1774892	total: 250ms	remaining: 0us
0:	learn: 0.3977128	total: 8.72ms	remaining: 166ms
19:	learn: 0.1733642	total: 263ms	remaining: 0us
0:	learn: 0.3971379	total: 17ms	remaining: 324ms
19:	learn: 0.1717590	total: 293ms	remaining: 0us


{'iterations': 20,
 'learning_rate': 0.5,
 'verbose': 20,
 'cat_features': [0, 1, 2, 3, 4, 5, 6, 7, 8]}

with 20 iterations (low value) the best itetion correspond to the higest lerning rate

## Overfitting Detector

In [40]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)

model_with_early_stop.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [41]:
print(model_with_early_stop.tree_count_)

20


### Overfitting Detector with eval metric

In [42]:
model_with_early_stop = CatBoostClassifier(
    eval_metric='AUC',
    iterations=200,
    learning_rate=0.5,
    early_stopping_rounds=20
)
model_with_early_stop.fit(
    train_pool,
    eval_set=validation_pool,
    verbose=False,
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [43]:
print(model_with_early_stop.tree_count_)

51


## Model predictions

In [44]:
model = CatBoostClassifier(iterations=200, learning_rate=0.03)
model.fit(train_pool, verbose=50);

0:	learn: 0.6569432	total: 28.8ms	remaining: 5.72s
50:	learn: 0.1958365	total: 1.13s	remaining: 3.3s
100:	learn: 0.1664598	total: 2.83s	remaining: 2.77s
150:	learn: 0.1595830	total: 5.02s	remaining: 1.63s
199:	learn: 0.1554346	total: 7.34s	remaining: 0us


In [46]:
#first method - Directly give the class

print(model.predict(X_validation))

[1 1 1 ... 1 1 1]


In [47]:
#second method - Give the probability to belong a class

print(model.predict_proba(X_validation))

[[0.0277 0.9723]
 [0.0189 0.9811]
 [0.01   0.99  ]
 ...
 [0.0327 0.9673]
 [0.0555 0.9445]
 [0.0224 0.9776]]


In [48]:
#In this case you see the sum of the trees - Took the leaf an sum the values... as you see this values are not between 0-1

raw_pred = model.predict(
    X_validation,
    prediction_type='RawFormulaVal'
)

print(raw_pred)

[3.5575 3.9472 4.5956 ... 3.3858 2.8338 3.7772]


In [50]:
#Using the values in the leafs and put into the sigmoid function we obtain probabilitys values

from numpy import exp

sigmoid = lambda x: 1 / (1 + exp(-x))

probabilities = sigmoid(raw_pred)

print(probabilities)

[0.9723 0.9811 0.99   ... 0.9673 0.9445 0.9776]


## Select decision boundary

![](https://habrastorage.org/webt/y4/1q/yq/y41qyqfm9mcerp2ziys48phpjia.png)

In [None]:
import matplotlib.pyplot as plt
from catboost.utils import get_roc_curve
from catboost.utils import get_fpr_curve
from catboost.utils import get_fnr_curve

curve = get_roc_curve(model, validation_pool)
(fpr, tpr, thresholds) = curve

(thresholds, fpr) = get_fpr_curve(curve=curve)
(thresholds, fnr) = get_fnr_curve(curve=curve)

In [None]:
plt.figure(figsize=(16, 8))
style = {'alpha':0.5, 'lw':2}

plt.plot(thresholds, fpr, color='blue', label='FPR', **style)
plt.plot(thresholds, fnr, color='green', label='FNR', **style)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.grid(True)
plt.xlabel('Threshold', fontsize=16)
plt.ylabel('Error Rate', fontsize=16)
plt.title('FPR-FNR curves', fontsize=20)
plt.legend(loc="lower left", fontsize=16);

In [None]:
from catboost.utils import select_threshold

print(select_threshold(model, validation_pool, FNR=0.01))
print(select_threshold(model, validation_pool, FPR=0.01))

## Metric evaluation on a new dataset

In [None]:
metrics = model.eval_metrics(
    data=validation_pool,
    metrics=['Logloss','AUC'],
    ntree_start=0,
    ntree_end=0,
    eval_period=1,
    plot=True
)

In [None]:
print('AUC values:\n{}'.format(np.array(metrics['AUC'])))

## Feature importances

### Prediction values change

Default feature importances for binary classification is PredictionValueChange - how much on average does the model change when the feature value changes.
These feature importances are non negative.
They are normalized and sum to 1, so you can look on these values like percentage of importance.

In [None]:
np.array(model.get_feature_importance(prettified=True))

### Loss function change

The non default feature importance approximates how much the optimized loss function will change if the value of the feature changes.
This importances might be negative if the feature has bad influence on the loss function.
The importances are not normalized, the absolute value of the importance has the same scale as the optimized loss value.
To calculate this importance value you need to pass train_pool as an argument.

In [None]:
np.array(model.get_feature_importance(
    train_pool, 
    'LossFunctionChange', 
    prettified=True
))

### Shap values

In [None]:
print(model.predict_proba([X.iloc[1,:]]))
print(model.predict_proba([X.iloc[91,:]]))

In [None]:
shap_values = model.get_feature_importance(
    validation_pool, 
    'ShapValues'
)
expected_value = shap_values[0,-1]
shap_values = shap_values[:,:-1]
print(shap_values.shape)

In [None]:
proba = model.predict_proba([X.iloc[1,:]])[0]
raw = model.predict([X.iloc[1,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))

In [None]:
import shap

shap.initjs()
shap.force_plot(expected_value, shap_values[1,:], X_validation.iloc[1,:])

In [None]:
proba = model.predict_proba([X.iloc[91,:]])[0]
raw = model.predict([X.iloc[91,:]], prediction_type='RawFormulaVal')[0]
print('Probabilities', proba)
print('Raw formula value %.4f' % raw)
print('Probability from raw value %.4f' % sigmoid(raw))

In [None]:
import shap
shap.initjs()
shap.force_plot(expected_value, shap_values[91,:], X_validation.iloc[91,:])

In [None]:
shap.summary_plot(shap_values, X_validation)

## Snapshotting

In [None]:
#!rm 'catboost_info/snapshot.bkp'

model = CatBoostClassifier(
    iterations=100,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    snapshot_interval=1
)

model.fit(train_pool, eval_set=validation_pool, verbose=10);

## Saving the model

In [None]:
model = CatBoostClassifier(iterations=10)
model.fit(train_pool, eval_set=validation_pool, verbose=False)
model.save_model('catboost_model.bin')
model.save_model('catboost_model.json', format='json')

In [None]:
model.load_model('catboost_model.bin')
print(model.get_params())
print(model.learning_rate_)

## Hyperparameter tunning

In [None]:
tunned_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    random_strength=1,
    bagging_temperature=1
)

tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);

# Speeding up the training

In [None]:
fast_model = CatBoostClassifier(
    boosting_type='Plain',
    rsm=0.5,
    one_hot_max_size=50,
    leaf_estimation_iterations=1,
    max_ctr_complexity=1,
    iterations=100,
    learning_rate=0.3,
    bootstrap_type='Bernoulli',
    subsample=0.5
)
fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);

# Reducing model size

In [None]:
small_model = CatBoostClassifier(
    learning_rate=0.03,
    iterations=500,
    model_size_reg=50,
    max_ctr_complexity=1,
    ctr_leaf_count_limit=100
)
small_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    verbose=False,
    eval_set=(X_validation, y_validation),
    plot=True
);