Based on [official catboost tutorials](https://github.com/catboost/tutorials/tree/master)

### Catboost basics

In [21]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from catboost import CatBoostClassifier, Pool, metrics, cv
from sklearn.metrics import accuracy_score
from catboost import datasets


np.set_printoptions(precision=4)

In [10]:
train_df, test_df = datasets.titanic()

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


*Feature Preparation*

First of all let's check how many absent values do we have:

In [11]:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

Age         177
Cabin       687
Embarked      2
dtype: int64

As we can see, `Age`, `Cabin` and `Embarked` indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

In [12]:
train_df.fillna(-999, inplace=True)
test_df.fillna(-999, inplace=True)

In [13]:
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

Pay attention that our features are of different types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation).

But in our case we could treat these string features just as categorical one - all the heavy lifting is done inside CatBoost. How cool is that? :)

In [18]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != float)[0]
print(f"Num categorical features: {len(categorical_features_indices)}")

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
Num categorical features: 9


In [20]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)

X_test = test_df

*Model Training*

Now let's create the model itself. We will go here with default parameters, as they provide a really good baseline almost all the time. The only thing we would like to specify here is custom_loss parameter, as this would give us an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size.

In [25]:
model = CatBoostClassifier(
    custom_loss=[metrics.Accuracy()],
    random_seed=42,
    logging_level='Silent'
)

In [26]:
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
    # logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

*Model Cross-Validation*

It is good to validate your model, but to cross-validate it - even better. And also with plots! So with no more words:

In [27]:
cv_params = model.get_params()
cv_params.update({
    'loss_function': metrics.Logloss()
})
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [29]:
cv_data

Unnamed: 0,iterations,test-Logloss-mean,test-Logloss-std,train-Logloss-mean,train-Logloss-std,test-Accuracy-mean,test-Accuracy-std,train-Accuracy-mean,train-Accuracy-std
0,0,0.676936,0.001133,0.676477,0.003152,0.794613,0.003367,0.798541,0.020778
1,1,0.660661,0.000697,0.659381,0.003172,0.795735,0.030365,0.812009,0.010286
2,2,0.646543,0.001920,0.645228,0.004168,0.803591,0.028636,0.812009,0.014119
3,3,0.632857,0.003376,0.631048,0.004247,0.804714,0.026725,0.812570,0.012179
4,4,0.619750,0.004936,0.617523,0.005041,0.803591,0.026153,0.813692,0.011459
...,...,...,...,...,...,...,...,...,...
995,995,0.454106,0.056041,0.122992,0.010388,0.820426,0.016947,0.976992,0.003504
996,996,0.454059,0.056043,0.122951,0.010359,0.820426,0.016947,0.976992,0.003504
997,997,0.453869,0.055734,0.122882,0.010412,0.820426,0.016947,0.976992,0.003504
998,998,0.453916,0.055660,0.122825,0.010381,0.820426,0.016947,0.976992,0.003504


In [30]:
print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])
))

Best validation accuracy score: 0.83±0.02 on step 528


*Model Applying*

In [31]:
predictions = model.predict(X_test)
predictions_probs = model.predict_proba(X_test)
print(predictions[:10])
print(predictions_probs[:10])

[0 0 0 0 1 0 1 0 1 0]
[[0.8501 0.1499]
 [0.7579 0.2421]
 [0.8753 0.1247]
 [0.8782 0.1218]
 [0.2907 0.7093]
 [0.8921 0.1079]
 [0.337  0.663 ]
 [0.7877 0.2123]
 [0.3932 0.6068]
 [0.9494 0.0506]]


**More features**

Let's define some params and create Pool for more convenience. It stores all information about dataset (features, labeles, categorical features indices, weights and and much more).

In [43]:
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': metrics.Accuracy(),
    'random_seed': 42,
    'logging_level': 'Silent',
    'use_best_model': False
}
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)

*Early Stopping*

If you essentially have a validation set, it's always easier and better to use early stopping. This feature is similar to the previous one, but only in addition to improving the quality it still saves time.

In [44]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool);

CPU times: user 4.59 s, sys: 2.23 s, total: 6.81 s
Wall time: 1.25 s


In [45]:
%%time
earlystop_params = params.copy()
earlystop_params.update({
    'od_type': 'Iter',
    'od_wait': 40
})
earlystop_model = CatBoostClassifier(**earlystop_params)
earlystop_model.fit(train_pool, eval_set=validate_pool);

CPU times: user 456 ms, sys: 232 ms, total: 688 ms
Wall time: 165 ms


In [47]:
print('Simple model tree count: {}'.format(model.tree_count_))
print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')

print('Early-stopped model tree count: {}'.format(earlystop_model.tree_count_))
print('Early-stopped model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model.predict(X_validation))
))

Simple model tree count: 500
Simple model validation accuracy: 0.7848

Early-stopped model tree count: 64
Early-stopped model validation accuracy: 0.8117


*Using Baseline*

It is posible to use pre-training results (baseline) for training.

In [48]:
current_params = params.copy()
current_params.update({
    'iterations': 10
})
model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(X_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

*User Defined Objective Function*

It is possible to create your own objective function. Let's create logloss objective function.

In [53]:
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        # approxes, targets, weights are indexed containers of floats
        # (containers which have only __len__ and __getitem__ defined).
        # weights parameter can be None.
        #
        # To understand what these parameters mean, assume that there is
        # a subset of your dataset that is currently being processed.
        # approxes contains current predictions for this subset,
        # targets contains target values you provided with the dataset.
        #
        # This function should return a list of pairs (der1, der2), where
        # der1 is the first derivative of the loss function with respect
        # to the predicted value, and der2 is the second derivative.
        #
        # In our case, logloss is defined by the following formula:
        # target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        # where sigmoid(x) = 1 / (1 + e^(-x)).
        
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

In [54]:
model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=LoglossObjective(), 
    eval_metric=metrics.Logloss()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

0:	learn: 0.6827074	total: 354ms	remaining: 3.19s
1:	learn: 0.6723302	total: 356ms	remaining: 1.43s
2:	learn: 0.6619449	total: 357ms	remaining: 834ms
3:	learn: 0.6521466	total: 360ms	remaining: 540ms
4:	learn: 0.6435227	total: 361ms	remaining: 361ms
5:	learn: 0.6353848	total: 362ms	remaining: 242ms
6:	learn: 0.6277210	total: 364ms	remaining: 156ms
7:	learn: 0.6210282	total: 364ms	remaining: 91ms
8:	learn: 0.6141958	total: 365ms	remaining: 40.6ms
9:	learn: 0.6073236	total: 367ms	remaining: 0us


*Feature Importances*

Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a get_feature_importance method.

In [56]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))

Sex: 59.00409201426859
Pclass: 16.34088716974706
Ticket: 6.028107169932189
Cabin: 3.834724220256021
Fare: 3.7129696679343884
Age: 3.4844512041824807
Parch: 3.3780897403558634
Embarked: 2.3139994072899537
SibSp: 1.9026794060334498
PassengerId: 0.0
Name: 0.0


*Eval Metrics*

The CatBoost has a eval_metrics method that allows to calculate a given metrics on a given dataset.

In [57]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
eval_metrics = model.eval_metrics(validate_pool, [metrics.AUC()], plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

*Learning Processes Comparison*

You can also compare different models learning process on a single plot.

In [59]:
model1 = CatBoostClassifier(iterations=100, depth=1, train_dir='model_depth_1/', logging_level='Silent')
model1.fit(train_pool, eval_set=validate_pool)
model2 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent')
model2.fit(train_pool, eval_set=validate_pool);

In [60]:
from catboost import MetricVisualizer
widget = MetricVisualizer(['model_depth_1', 'model_depth_5'])
widget.start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

**Parameters Tuning**

While you could always select optimal number of iterations (boosting steps) by cross-validation and learning curve plots, it is also important to play with some of model parameters, and we would like to pay some special attention to `l2_leaf_reg` and `learning_rate`.

We'll select these parameters using the `hyperopt` package.

In [62]:
# !pip install hyperopt

In [64]:
import hyperopt

def hyperopt_objective(params):
    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=500,
        eval_metric=metrics.Accuracy(),
        random_seed=42,
        verbose=False,
        loss_function=metrics.Logloss(),
    )
    
    cv_data = cv(
        Pool(X, y, cat_features=categorical_features_indices),
        model.get_params(),
        logging_level='Silent',
    )
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])
    
    return 1 - best_accuracy # as hyperopt minimises

In [85]:
# params_space = {
#     'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
#     'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
# }

# trials = hyperopt.Trials()

# best = hyperopt.fmin(
#     hyperopt_objective,
#     space=params_space,
#     algo=hyperopt.tpe.suggest,
#     max_evals=50,
#     trials=trials,
#     rstate=np.random.default_rng(123)
# )

# print(best)

---

More tutorials in the [repo](https://github.com/catboost/tutorials/tree/master)