# CatBoost basics

For this homework will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

As a result of this tutorial you need to provide a tsv file with answers.
There are 17 questions in this tutorial. The resulting tsv file should consist of 17 lines, each line should contain the number of the question, an answer to it and a tab separater between them. Questions are numbered from 1 to 17.
See an example of the resulting file here.

## Libraries installation

First you need to install the libraries. To do that run:

`pip install catboost
 pip install shap
 pip install ipywidgets
 jupyter nbextension enable --py widgetsnbextension`

In [1]:
import sys
!{sys.executable} -m pip install catboost
!{sys.executable} -m pip install ipywidgets
!{sys.executable} -m jupyter nbextension enable --py widgetsnbextension
# !{sys.executable} -m pip install shap # ERROR can't install

[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Reading the data

Let's first download the data and put it to folder `amazon`. Now we will read this data from file.

In [2]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
import catboost
from catboost import datasets
from catboost import *

from grader import Grader

In [3]:
train_df, test_df = catboost.datasets.amazon()
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [4]:
train_df.to_csv('../train.csv')
test_df.to_csv('../test.csv')

In [5]:
grader = Grader()

## Preparing your data

Label values extraction

In [6]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

Categorical features declaration

In [7]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Now it makes sense to ananyze the dataset.
First you need to calculate how many positive and negative objects are present in the train dataset.

**Question 1:**

How many negative objects are present in the train dataset X?

In [8]:
zero_count = len(train_df[(train_df.ACTION < 0) | (train_df.RESOURCE < 0) | (train_df.MGR_ID < 0) |
                          (train_df.ROLE_ROLLUP_1 < 0) | (train_df.ROLE_ROLLUP_2 < 0) | (train_df.ROLE_DEPTNAME < 0) | 
                          (train_df.ROLE_TITLE < 0) | (train_df.ROLE_FAMILY_DESC < 0) | (train_df.ROLE_FAMILY < 0) | 
                          (train_df.ROLE_CODE < 0)])
grader.submit_tag('negative_samples', zero_count)

Current answer for task negative_samples is: 0


**Question 2:**

How many positive objects are present in the train dataset X?

In [9]:
one_count = len(train_df[(train_df.ACTION > 0) | (train_df.RESOURCE > 0) | (train_df.MGR_ID > 0) |
                          (train_df.ROLE_ROLLUP_1 > 0) | (train_df.ROLE_ROLLUP_2 > 0) | (train_df.ROLE_DEPTNAME > 0) | 
                          (train_df.ROLE_TITLE > 0) | (train_df.ROLE_FAMILY_DESC > 0) | (train_df.ROLE_FAMILY > 0) | 
                          (train_df.ROLE_CODE > 0)])
grader.submit_tag('positive_samples', one_count)

Current answer for task positive_samples is: 32769


In [10]:
print('Zero count = ' + str(zero_count) + ', One count = ' + str(one_count))

Zero count = 0, One count = 32769


Now for every feature you need to calculate number of unique values of this feature.

**Question 3:**
    
How many unique values has feature RESOURCE?

In [11]:
unique_vals_for_RESOURCE = train_df['RESOURCE'].nunique()
grader.submit_tag('resource_unique_values', unique_vals_for_RESOURCE)

Current answer for task resource_unique_values is: 7518


Now we can create a Pool object. This type is used for datasets in CatBoost. You can also use numpy array or dataframe. Working with Pool class is the most efficient way in terms of memory and speed. We recommend to create Pool from file in case if you have your data on disk or from FeaturesData if you use numpy.

In [12]:
import numpy as np
from catboost import Pool

pool1 = Pool(data=X, label=y, cat_features=cat_features)
pool2 = Pool(data='../train.csv', delimiter=',', has_header=True)
pool3 = Pool(data=X, cat_features=cat_features)

print('Dataset shape')
print('dataset 1:' + str(pool1.shape) + '\ndataset 2:' + str(pool2.shape)  + '\ndataset 3:' + str(pool3.shape))

print('\n')
print('Column names')
print('dataset 1: ')
print(pool1.get_feature_names()) 
print('\ndataset 2:')
print(pool2.get_feature_names())
print('\ndataset 3:')
print(pool3.get_feature_names())

Dataset shape
dataset 1:(32769, 9)
dataset 2:(32769, 10)
dataset 3:(32769, 9)


Column names
dataset 1: 
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 2:
['ACTION', 'RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 3:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


## Split your data into train and validation

When you will be training your model, you will have to detect overfitting and select best parameters. To do that you need to have a validation dataset.
Normally you would be using some random split, for example
`train_test_split` from `sklearn.model_selection`.
But for the purpose of this homework the train part will be the first 80% of the data and the evaluation part will be the last 20% of the data.

In [13]:
train_count = int(X.shape[0] * 0.8)

X_train = X.iloc[:train_count,:]
y_train = y[:train_count]
X_validation = X.iloc[train_count:, :]
y_validation = y[train_count:]

## Train your model

Now we will train our first model.

In [14]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5,
    random_seed=0,
    learning_rate=0.1
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent'
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

Model is fitted: True
Model params:
{'random_seed': 0, 'loss_function': 'Logloss', 'learning_rate': 0.1, 'iterations': 5}


## Stdout of the training

You can see in stdout values of the loss function on each iteration, or on each k-th iteration.
You can also see how much time passed since the start of the training and how much time is left.

In [15]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=15,
    verbose=3
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 62ms	remaining: 867ms
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 546ms	remaining: 1.5s
6:	learn: 0.1706950	test: 0.1549520	best: 0.1549520 (6)	total: 917ms	remaining: 1.05s
9:	learn: 0.1672391	test: 0.1495040	best: 0.1495040 (9)	total: 1.23s	remaining: 614ms
12:	learn: 0.1645499	test: 0.1487789	best: 0.1487789 (12)	total: 1.64s	remaining: 252ms
14:	learn: 0.1630092	test: 0.1469375	best: 0.1469375 (14)	total: 1.94s	remaining: 0us

bestTest = 0.1469374586
bestIteration = 14



<catboost.core.CatBoostClassifier at 0x7f5ee1f57860>

## Random seed

If you don't specify random_seed then random seed will be set to a new value each time.
After the training has finished you can look on the value of the random seed that was set.
If you train again with this random_seed, you will get the same results.

In [16]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 55.5ms	remaining: 222ms
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 208ms	remaining: 312ms
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 348ms	remaining: 232ms
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 492ms	remaining: 123ms
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 649ms	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f5ee1f57518>

In [17]:
random_seed = model.random_seed_
print('Used random seed = ' + str(random_seed))
model = CatBoostClassifier(
    iterations=5,
    random_seed=random_seed
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Used random seed = 0
0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 55.3ms	remaining: 221ms
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 211ms	remaining: 317ms
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 315ms	remaining: 210ms
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 452ms	remaining: 113ms
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 618ms	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f5f1777bac8>

Try training 10 models with parameters and calculate mean and the standart deviation of Logloss error on validation dataset.

**Question 4:**

What is the mean value of the Logloss metric on validation dataset (X_validation, y_validation) after 10 times training `CatBoostClassifier` with different random seeds in the following way:

`model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.1,
    random_seed={my_random_seed}
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)
`

In [18]:
metrics = []
for random_seed in range(10):
    print('Used random seed = ' + str(random_seed))
    model = CatBoostClassifier(
        iterations=300,
        learning_rate=0.1,
        random_seed=random_seed
    )
    model.fit(
        X_train, y_train,
        cat_features=cat_features,
        eval_set=(X_validation, y_validation),
        verbose=50
    )
    metrics.append(model.get_best_score()['validation_0']['Logloss'])

Used random seed = 0
0:	learn: 0.5790122	test: 0.5797377	best: 0.5797377 (0)	total: 77.5ms	remaining: 23.2s
50:	learn: 0.1606281	test: 0.1432391	best: 0.1432391 (50)	total: 6.86s	remaining: 33.5s
100:	learn: 0.1555506	test: 0.1394226	best: 0.1394226 (100)	total: 12.9s	remaining: 25.5s
150:	learn: 0.1515049	test: 0.1384309	best: 0.1384019 (148)	total: 20.3s	remaining: 20.1s
200:	learn: 0.1486424	test: 0.1381162	best: 0.1381162 (200)	total: 27.7s	remaining: 13.6s
250:	learn: 0.1467162	test: 0.1379267	best: 0.1379267 (250)	total: 35.2s	remaining: 6.88s
299:	learn: 0.1452004	test: 0.1379528	best: 0.1377092 (262)	total: 42.9s	remaining: 0us

bestTest = 0.1377092211
bestIteration = 262

Shrink model to first 263 iterations.
Used random seed = 1
0:	learn: 0.5785828	test: 0.5796621	best: 0.5796621 (0)	total: 109ms	remaining: 32.7s
50:	learn: 0.1617489	test: 0.1457523	best: 0.1457523 (50)	total: 6.63s	remaining: 32.4s
100:	learn: 0.1542598	test: 0.1402208	best: 0.1401744 (97)	total: 13.4s	remai

In [19]:
import numpy
metrics = numpy.array(metrics)
mean = numpy.mean(metrics, axis=0)
grader.submit_tag('logloss_mean', mean)

Current answer for task logloss_mean is: 0.138180847126


**Question 5:**

What is the standard deviation of it?

In [20]:
stddev = (numpy.std(metrics, axis=0))
grader.submit_tag('logloss_std', stddev)

Current answer for task logloss_std is: 0.000777781825619


## Metrics calculation and graph plotting

When experimenting with Jupyter notebook you can see graphs of different errors during training.
To do that you need to use `plot=True` parameter.

In [21]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=50,
    random_seed=63,
    learning_rate=0.1,
    custom_loss=['Accuracy']
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee182e6a0>

**Question 6:**

What is the value of the accuracy metric value on evaluation dataset after training with parameters `iterations=50`, `random_seed=63`, `learning_rate=0.1`?

In [22]:
accuracy = 0
grader.submit_tag('accuracy_6', accuracy)

Current answer for task accuracy_6 is: 0


## Model comparison

In [23]:
model1 = CatBoostClassifier(
    learning_rate=0.5,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.5',
    custom_loss = ['Accuracy']
)

model2 = CatBoostClassifier(
    learning_rate=0.05,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.05',
    custom_loss = ['Accuracy']
)
model1.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)
model2.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 66.9ms	remaining: 1m 6s
100:	learn: 0.1416460	test: 0.1404156	best: 0.1402309 (96)	total: 16s	remaining: 2m 22s
200:	learn: 0.1340506	test: 0.1447274	best: 0.1399893 (106)	total: 32.3s	remaining: 2m 8s
300:	learn: 0.1281182	test: 0.1478018	best: 0.1399893 (106)	total: 47.4s	remaining: 1m 50s
400:	learn: 0.1213678	test: 0.1506548	best: 0.1399893 (106)	total: 1m 3s	remaining: 1m 34s
500:	learn: 0.1178627	test: 0.1527704	best: 0.1399893 (106)	total: 1m 17s	remaining: 1m 17s
600:	learn: 0.1145490	test: 0.1531613	best: 0.1399893 (106)	total: 1m 33s	remaining: 1m 1s
700:	learn: 0.1121429	test: 0.1548145	best: 0.1399893 (106)	total: 1m 48s	remaining: 46.1s
800:	learn: 0.1087987	test: 0.1576137	best: 0.1399893 (106)	total: 2m 3s	remaining: 30.6s
900:	learn: 0.1064078	test: 0.1590852	best: 0.1399893 (106)	total: 2m 18s	remaining: 15.2s
999:	learn: 0.1037092	test: 0.1610907	best: 0.1399893 (106)	total: 2m 33s	remaining: 0us

bestTest

<catboost.core.CatBoostClassifier at 0x7f5ee182e0f0>

In [24]:
# from catboost import MetricVisualizer
# MetricVisualizer(['learning_rate_0.05', 'learning_rate_0.5']).start()

**Question 7:**

Try training these models for 1000 iterations. Which model will give better best resulting Accuracy on validation dataset?
By best resulting accuracy we mean accuracy on best iteration, which might be not the last iteration.

In [25]:
best_model_name = 'learning_rate_0.05'
grader.submit_tag('best_model_name', best_model_name)

Current answer for task best_model_name is: learning_rate_0.05


## Best iteration

If a validation dataset is present then after training, the model is shrinked to a number of trees when it got best evaluation metric value on validation dataset.
By default evaluation metric is the optimized metric. But you can set evaluation metric to some other metric.
In the example below evaluation metric is `Accuracy`.

In [26]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy'
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee1f57cf8>

In [27]:
print('Tree count: ' + str(model.tree_count_))

Tree count: 72


If you don't want the model to be shrinked, you can set `use_best_model=False`

In [28]:
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy',
    use_best_model=False
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee1f57eb8>

**Question 8:**
    
What will be the number of trees in the resulting model after training with validation dataset with parameters `iterations=100`, ` learning_rate=0.5`, `eval_metric='Accuracy'` and with parameter `use_best_model=False`

In [29]:
tree_count = model.tree_count_
grader.submit_tag('num_trees', tree_count)

Current answer for task num_trees is: 100


## Cross-validation

The next functionality you need to know about is cross-validation.
For unbalanced datasets stratified cross-validation can be useful.

In [30]:
from catboost import cv

params = {}
params['loss_function'] = 'Logloss'
params['iterations'] = 80
params['custom_loss'] = 'AUC'
params['random_seed'] = 63
params['learning_rate'] = 0.5

cv_data = cv(
    params = params,
    pool = Pool(X, label=y, cat_features=cat_features),
    fold_count=5,
    inverted=False,
    shuffle=True,
    partition_random_seed=0,
#     plot=True,
    stratified=True,
    verbose=False
)

Cross-validation returns specified metric values on every iteration (or every k-th iteration, if you specify so)

In [31]:
print(cv_data[0:4])

   test-AUC-mean  test-AUC-std  test-Logloss-mean  test-Logloss-std  \
0       0.500000      0.000000           0.302197          0.000080   
1       0.625621      0.122336           0.222651          0.014472   
2       0.799508      0.012871           0.179930          0.004739   
3       0.824558      0.013151           0.165090          0.003799   

   train-AUC-mean  train-AUC-std  train-Logloss-mean  train-Logloss-std  
0        0.499984       0.000017            0.302203           0.000050  
1        0.614679       0.109875            0.225825           0.010991  
2        0.758325       0.022924            0.190024           0.004146  
3        0.781285       0.017559            0.178807           0.003176  


Let's look on mean value and standard deviation of Logloss for cv on best iteration.

In [32]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

Best validation Logloss score, not stratified: 0.1409±0.0056 on step 65


**Question 9:**

Try running stratified cross-validation with the same parameters. What will be mean of Logloss metric on test of the stratified cross-validation on the best iteration?

In [33]:
mean_on_best_iteration = np.min(cv_data['test-Logloss-mean'])
grader.submit_tag('mean_logloss_cv', mean_on_best_iteration)

Current answer for task mean_logloss_cv is: 0.140862080899


**Question 10:**

Try running stratified cross-validation with the same parameters. What will be the standard deviation of Logloss metric of the stratified cross-validation on the best iteration?

In [34]:
std_on_best_iteration = np.min(cv_data['test-Logloss-std'])
grader.submit_tag('logloss_std_1', std_on_best_iteration)

Current answer for task logloss_std_1 is: 7.97814926363e-05


## Overfitting detector

A useful feature of the library is overfitting detector.
Let's try training the model with early stopping.

In [35]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    random_seed=63,
    learning_rate=0.5,
    od_type='Iter',
    od_wait=20,
    eval_metric = 'AUC'
)
model_with_early_stop.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee182e320>

**Question 11:**

Now try training the model with the same parameters and with overfitting detector, but with `eval_metric='AUC'`
What will be the number of iterations after which the training will stop?
(Not the number of trees in the resulting model, but the number of iterations that the algorithm will perform before training).

In [36]:
iterations_count = model.get_params()['iterations']
grader.submit_tag('iterations_overfitting', iterations_count)

Current answer for task iterations_overfitting is: 100


## Snapshotting

If you train for long time, for example for several hours, you need to save snapshots.
Otherwise if your laptop or your server will reboot, you will loose all the progress.
To do that you need to specify `snapshot_file` parameter.
Try running the code below and interrupting the kernel after short time.
Then try running the same cell again.
The training will start from the iteration when the training was interrupted.
Note that all additional files are written by default into `catboost_info` directory. It can be changed using `train_dir` parameter. So the snapshot file will be there.

In [37]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)


bestTest = 0.1429637809
bestIteration = 39



<catboost.core.CatBoostClassifier at 0x7f5ee182e8d0>

## Model predictions

There are multiple ways to do predictions.
The easiest one is to call predict or predict_proba.
You also can make predictions using C++ code. For that see [documentation](https://tech.yandex.com/catboost/doc/dg/concepts/c-plus-plus-api-docpage/).

In [38]:
print(model.predict_proba(data=X_validation))

[[ 0.0159  0.9841]
 [ 0.0157  0.9843]
 [ 0.0059  0.9941]
 ..., 
 [ 0.0071  0.9929]
 [ 0.3818  0.6182]
 [ 0.0263  0.9737]]


In [39]:
print(model.predict(data=X_validation))

[ 1.  1.  1. ...,  1.  1.  1.]


For binary classification resulting value is not necessary a value in `[0,1]`. It is some numeric value. To get the probability out of this value you need to calculate sigmoid of that value.

In [40]:
raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')
print(raw_pred)

[ 4.1255  4.1387  5.122  ...,  4.9439  0.4819  3.6114]


In [41]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
probabilities = [sigmoid(x) for x in raw_pred]
print(np.array(probabilities))

[ 0.9841  0.9843  0.9941 ...,  0.9929  0.6182  0.9737]


## Staged prediction

CatBoost also supports staged prediction - when you want to have a prediction on each object on each iteration (or on each k-th iteration). This can be used if you want to calculate the values of some custom metric using the predictions.

In [42]:
predictions_gen = model.staged_predict_proba(data=X_validation, ntree_start=0, ntree_end=5, eval_period=1)
for iteration, predictions in enumerate(predictions_gen):
    print('Iteration ' + str(iteration) + ', predictions:')
    print(predictions)

Iteration 0, predictions:
[[ 0.228  0.772]
 [ 0.228  0.772]
 [ 0.228  0.772]
 ..., 
 [ 0.228  0.772]
 [ 0.228  0.772]
 [ 0.228  0.772]]
Iteration 1, predictions:
[[ 0.1121  0.8879]
 [ 0.1273  0.8727]
 [ 0.1121  0.8879]
 ..., 
 [ 0.1121  0.8879]
 [ 0.221   0.779 ]
 [ 0.1526  0.8474]]
Iteration 2, predictions:
[[ 0.0599  0.9401]
 [ 0.0686  0.9314]
 [ 0.0599  0.9401]
 ..., 
 [ 0.0599  0.9401]
 [ 0.3378  0.6622]
 [ 0.0833  0.9167]]
Iteration 3, predictions:
[[ 0.0424  0.9576]
 [ 0.0486  0.9514]
 [ 0.0424  0.9576]
 ..., 
 [ 0.0424  0.9576]
 [ 0.3799  0.6201]
 [ 0.0594  0.9406]]
Iteration 4, predictions:
[[ 0.0267  0.9733]
 [ 0.0435  0.9565]
 [ 0.0267  0.9733]
 ..., 
 [ 0.0379  0.9621]
 [ 0.3549  0.6451]
 [ 0.0531  0.9469]]


  from ipykernel import kernelapp as app


## Metric evaluation on a new dataset

You can also calculate metrics directly after training.

In [43]:
# metrics = model.eval_metrics(data=pool1, metrics=['Logloss','AUC'], plot=True)
metrics = model.eval_metrics(data=pool1, metrics=['Logloss','AUC'])

In [44]:
print('AUC values:')
print(np.array(metrics['AUC']))

AUC values:
[ 0.4999  0.6183  0.878   0.9372  0.9332  0.9456  0.9505  0.9542  0.9548
  0.9562  0.9561  0.9578  0.9665  0.9667  0.9668  0.9668  0.9703  0.9706
  0.9719  0.9716  0.9726  0.9729  0.9728  0.9729  0.973   0.973   0.973
  0.9744  0.9743  0.9753  0.9782  0.9782  0.9782  0.9782  0.9782  0.9782
  0.9781  0.9781  0.9779  0.9818]


**Question 12:**

Now train a model in the following way:

`
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)
`

What will be the AUC value on 550 iteration if evaluation metrics on the initial X dataset?

In [45]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
#     logging_level='Verbose',
    verbose=100
)

0:	learn: 0.6335962	test: 0.6340206	best: 0.6340206 (0)	total: 159ms	remaining: 2m 38s
100:	learn: 0.1609553	test: 0.1447964	best: 0.1447964 (100)	total: 15.8s	remaining: 2m 20s
200:	learn: 0.1558555	test: 0.1411513	best: 0.1411513 (200)	total: 31.3s	remaining: 2m 4s
300:	learn: 0.1509529	test: 0.1391401	best: 0.1391292 (297)	total: 49.4s	remaining: 1m 54s
400:	learn: 0.1482021	test: 0.1391781	best: 0.1389002 (376)	total: 1m 10s	remaining: 1m 44s
500:	learn: 0.1461119	test: 0.1389435	best: 0.1389002 (376)	total: 1m 28s	remaining: 1m 27s
600:	learn: 0.1447044	test: 0.1389096	best: 0.1388514 (578)	total: 1m 44s	remaining: 1m 9s
700:	learn: 0.1433795	test: 0.1388408	best: 0.1387642 (659)	total: 2m	remaining: 51.6s
800:	learn: 0.1420680	test: 0.1387766	best: 0.1386987 (763)	total: 2m 17s	remaining: 34.1s
900:	learn: 0.1412555	test: 0.1388566	best: 0.1386987 (763)	total: 2m 34s	remaining: 16.9s
999:	learn: 0.1400826	test: 0.1389358	best: 0.1386987 (763)	total: 2m 50s	remaining: 0us

bestTes

<catboost.core.CatBoostClassifier at 0x7f5ee182e5c0>

In [46]:
metrics = model.eval_metrics(data=pool1, metrics=['AUC'])
auc_value = np.array(metrics['AUC'][550])
grader.submit_tag('auc_550', auc_value)

Current answer for task auc_550 is: 0.9849756977745989


## Feature importances

Now we will learn how to understand which features are the most important ones. Let's first train the model that will not use feature combinations. To forbid feature combinations you need to use 'max_ctr_complexity=1'. This will speed up the training by a lot, but it will reduce the resulting quality. 

In [47]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=1,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

0:	learn: 0.5443376	total: 113ms	remaining: 33.9s
50:	learn: 0.1711369	total: 5.54s	remaining: 27s
100:	learn: 0.1671705	total: 11s	remaining: 21.6s
150:	learn: 0.1649220	total: 16.9s	remaining: 16.7s
200:	learn: 0.1632912	total: 22.9s	remaining: 11.3s
250:	learn: 0.1622900	total: 28.9s	remaining: 5.65s
299:	learn: 0.1613767	total: 34.8s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f5ee182e1d0>

Let's see which features are most important for the model without feature combinations.

In [48]:
importances = model.get_feature_importance(prettified=True)
print(importances)

[('MGR_ID', 32.298927496642214), ('RESOURCE', 18.516094148023672), ('ROLE_FAMILY_DESC', 12.987791972755684), ('ROLE_ROLLUP_2', 9.009528339227415), ('ROLE_DEPTNAME', 8.502603783338843), ('ROLE_CODE', 6.465296066682523), ('ROLE_FAMILY', 5.144914966956651), ('ROLE_TITLE', 5.076292845877568), ('ROLE_ROLLUP_1', 1.9985503804954317)]


** Question 13: **

Try training the model without the restriction of combinations, with other parameters set to the same values.
What will be top 3 most important features for this model?

In [49]:
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=4,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

0:	learn: 0.5454508	total: 147ms	remaining: 44.1s
50:	learn: 0.1562605	total: 7.14s	remaining: 34.9s
100:	learn: 0.1493646	total: 14.8s	remaining: 29.1s
150:	learn: 0.1451569	total: 23.1s	remaining: 22.8s
200:	learn: 0.1432905	total: 31.1s	remaining: 15.3s
250:	learn: 0.1417401	total: 39.1s	remaining: 7.63s
299:	learn: 0.1402619	total: 47s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f5ee182e4e0>

In [50]:
importances = model.get_feature_importance(prettified=True)
print(importances)

[('RESOURCE', 24.73509920590257), ('MGR_ID', 17.449161787258667), ('ROLE_DEPTNAME', 15.316223709876839), ('ROLE_ROLLUP_2', 11.490154799409593), ('ROLE_TITLE', 10.71183545081703), ('ROLE_FAMILY_DESC', 8.946143168072846), ('ROLE_FAMILY', 4.379723768290924), ('ROLE_CODE', 3.772023536810539), ('ROLE_ROLLUP_1', 3.199634573560977)]


In [51]:
top3 = ['RESOURCE', 'MGR_ID', 'ROLE_DEPTNAME']
grader.submit_tag('feature_importance_top3', top3)

Current answer for task feature_importance_top3 is: ['RESOURCE', 'MGR_ID', 'ROLE_DEPTNAME']


## Shap values

Let's train the model one more time.

In [52]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=1,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

0:	learn: 0.5443376	total: 138ms	remaining: 41.2s
50:	learn: 0.1711369	total: 5.19s	remaining: 25.3s
100:	learn: 0.1671705	total: 10.4s	remaining: 20.5s
150:	learn: 0.1649220	total: 15.8s	remaining: 15.6s
200:	learn: 0.1632912	total: 21.3s	remaining: 10.5s
250:	learn: 0.1622900	total: 26.9s	remaining: 5.25s
299:	learn: 0.1613767	total: 32.3s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f5ee1828860>

The library provides a way to understand which features are important for a given object.
Let's take a look on the whole dataset X and analyze the influence of different features on the objects from this dataset.
We will now calculate importances for each object. After that we will visualize these importances.

In [53]:
pool1 = Pool(data=X, label=y, cat_features=cat_features)
shap_values = model.get_feature_importance(data=pool1, fstr_type='ShapValues', verbose=10000)
print(shap_values.shape)

Processing trees...
128/300 trees processed	passed time: 43.3ms	remaining time: 58.2ms sec
300/300 trees processed	passed time: 127ms	remaining time: 0us sec
Processing documents...
128/32769 documents processed	passed time: 5.51ms	remaining time: 1.41s sec
10112/32769 documents processed	passed time: 256ms	remaining time: 573ms sec
20096/32769 documents processed	passed time: 499ms	remaining time: 315ms sec
30080/32769 documents processed	passed time: 758ms	remaining time: 67.8ms sec
(32769, 10)


Let's look on the prediction of the model for 0-th object. The raw prediction is not the probability, to calculate probability from raw prediction you need to calculate sigmoid(raw_prediction).

In [54]:
test_objects = [X.iloc[0:1]]

for obj in test_objects:
    print('Probability of class 1 = {:.4f}'.format(model.predict_proba(obj)[0][1]))
    print('Formula raw prediction = {:.4f}'.format(model.predict(obj, prediction_type='RawFormulaVal')[0]))
    print('\n')

Probability of class 1 = 0.9899
Formula raw prediction = 4.5822




Sum of all shap values are equal to the resulting raw formula predition.
We can see on the graph that will be output below that there is a base value, which is equal for all the objects.
And almost all the feature have positive influence on this object. The biggest step to the right is because of the feature called 'MGR_ID'.

In [55]:
# import shap
# shap.initjs()
# shap.force_plot(shap_values[0,:], X.iloc[0,:])

** Question 14: **

What is the most important feature for 91-th object

In [56]:
most_important_feature = 'FEATURE_NAME'
grader.submit_tag('most_important', most_important_feature)

Current answer for task most_important is: FEATURE_NAME


** Question 15: **

Does it have positive or negative influence? Answer 1 if positive and -1 if negative.

In [57]:
influence_sign = 0
grader.submit_tag('shap_influence', influence_sign)

Current answer for task shap_influence is: 0


You can also view aggregated information about the influences on the whole dataset.

In [58]:
# shap.summary_plot(shap_values, X)

From this graph you can see that values of MGR_ID and RESOURCE features have a large negative impact for many objects.
You can also see that RESOURCE has largest positive impact for many objects.

## Saving the model

You can save your model as a binary file. It is also possible to save the model as Python or C++ code.
If you save the model as a binary file you can then look on the parameters with which the model was trained, including learning_rate and random_seed that are set automatically if you don't specify them.

In [59]:
my_best_model = CatBoostClassifier(iterations=10)
my_best_model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=False
)
my_best_model.save_model('catboost_model.bin')

In [60]:
my_best_model.load_model('catboost_model.bin')
print(my_best_model.get_params())
print(my_best_model.random_seed_)
print(my_best_model.learning_rate_)

{'loss_function': 'Logloss', 'iterations': 10, 'logging_level': 'Silent', 'verbose': 0}
0
0.5


## Hyperparameter tunning

You can tune the parameters to get better speed or better quality.
Here is the list of parameters that are important for speed and accuracy.

### Training speed

Here is the list of parameters that are important for speeding up the training.
Note that changing this parameters might decrease the quality.
1. iterations + learning rate
By default we train for 1000 iterations. You can decrease this number, but if you decrease the number of iterations you need to increase learning rate so that the process converges. We set learning rate by default dependent on number of iterations and on your dataset, so you might just use default learning rate. But if you want to tune it, you need to know - the more iterations you have, the less should be the learning rate.

2. boosting_type
By default we use Ordered boosting for smaller datasets where we want to fight overfitting. This is expensive in terms of computations. You can set boosting_type to Plain to disable this.

3. bootstrap_type
By default we sample weights from exponential distribution. It is faster to use sampling from Bernoulli distribution. To enable that use bootstrap_type='Bernoulli' + subsample={some value < 1}

4. one_hot_max_size
By default we use one-hot encoding only for categorical features with little amount of different values. For all other categorical features we calculate statistics. This is expensive, and one-hot encoding is cheap. So you can speed up the training by setting one_hot_max_size to some bigger value

5. rsm
This parameter is very important, because it speeds up the training and does not affect the quality. So you should definitely use it, but only in case if you have hundreds of features.
If you have little amount of features it's better not to use this parameter.
If you have many features then the rule is the following: you decrease rsm, for example, you set rsm=0.1. With this rsm value the training needs more iterations to converge. Usually you need about 20% more iterations. But each iteration will be 10x faster. So the resulting training time will be faster even though you will have more trees in the resulting model.

6. leaf_estimation_iterations
This parameter is responsible for calculating leaf values after you have already selected tree structure.
If you have little amount of features, for example 8 or 10 features, then this place starts to be the bottle-neck.
Default value for this parameter depends on the training objective, you can try setting it to 1 or 5, and if you have little amount of features, this might speed up the training.

7. max_ctr_complexity
By default catboost generates categorical feature combinations in a greedy way.
This is time consuming, you can disable that by setting max_ctr_complexity=1 or by allowing only combinations of 2 features by setting max_ctr_complexity=2.
This will speed up the training only if you have categorical features.

8. If you are training the model on GPU, you can try decreasing border_count. This is the number of splits considered for each feature. By default it's set to 128, but you can try setting it to 32. In many cases it will not degrade the quality of the model and will speed up the training by a lot. 

In [61]:
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32)

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee17a8c50>

** Question 16: **

Try tunning the speed of the algorithm. What is the maximum speedup you could get by changing these parameters without decreasing of AUC on best iteration on eval dataset compared to AUC on best iteration after training with default parameters and random seed = 0?
The answer shoud be a number, for example 2.7 means you got 2.7 times speedup.

In [62]:
speedup = 0
grader.submit_tag('speedup', speedup)

Current answer for task speedup is: 0


### Accuracy

The parameters listed below are important to get the best quality of the model. Try changing this parameters to improve the quality of the resulting model

In [63]:
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
#     plot=True
)

<catboost.core.CatBoostClassifier at 0x7f5ee17a8278>

** Question 17: **

Try tunning these parameters to make AUC on eval dataset as large as possible. What is the maximum AUC value you have reached?

In [64]:
final_auc = 0
grader.submit_tag('final_auc', final_auc)

Current answer for task final_auc is: 0


In [None]:
STUDENT_EMAIL = # EMAIL HERE
STUDENT_TOKEN = # TOKEN HERE
grader.status()

In [None]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)