# Tutorial of base functions 

For classification you will need to use `CatBoostClassifier`.

### Binary classification

There are several `loss_function`s to use:

* `Logloss`,
* `CrossEntropy`,
* or a custom objective function.

In [1]:
from catboost import CatBoostClassifier

In [2]:
# Initialize data
cat_features = [0,1,2] # indices for cat_features
train_data = [["a","b",1,4,5,6],["a","b",4,5,6,7],["c","d",30,40,50,60]] # cat_features should be string or int
train_labels = [1,1,-1]
test_data = [["a","b",2,4,6,8],["a","d",1,4,50,60]]

In [3]:
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function='Logloss')
# Fit model
model.fit(train_data, train_labels, cat_features)

<catboost.core._CatBoostBase at 0x112f14c50>

In [4]:
# Get predicted classes
preds_class = model.predict(test_data)
# Get predicted probabilities for each class
preds_proba = model.predict_proba(test_data)
# Get predicted RawFormulaVal
preds_raw = model.predict(test_data, prediction_type='RawFormulaVal')
print(preds_class)

[ 1.  1.]


For Multiclassification use `loss_function='MultiClass'`.

In [5]:
from sklearn.datasets import load_iris
iris = load_iris()

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

In [7]:
# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=100, learning_rate=1, depth=6, loss_function='MultiClass', classes_count=3)
# Fit model
model.fit(x_train, y_train)

<catboost.core._CatBoostBase at 0x115d7c890>

In [8]:
from sklearn.metrics import accuracy_score, log_loss
# Get predicted classes
preds_class = model.predict(x_test)
print("accuracy = {}".format(accuracy_score(y_test, preds_class)))
# Get predicted probabilities for each class
preds_proba = model.predict_proba(x_test)
print("log_loss = {}".format(log_loss(y_test, preds_proba)))

accuracy = 1.0
log_loss = 0.015072778287


Also you can use pre-training results (baseline).

The results (only `raw_values`, not probability or class) can be set as baseline for the new model.
The shape of the baseline depends on the machine learning problem being solved:

* Classification: a two-dimensional array with `shape = (length of data, number of classes)`
* Regression: a one-dimensional array.

In [9]:
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(x_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(x_train, y_train, baseline=baseline)

<catboost.core._CatBoostBase at 0x115d7c890>

### Using object weights

The weight for each object in the input data can be set in the form of a one-dimensional array like data (length = data length).
The weight is used for calculating the final values of the trees. By default, it is set to 1 for all objects.

In [10]:
import numpy as np
weight = np.random.random(len(x_train))
model.fit(x_train, y_train, sample_weight=weight)

<catboost.core._CatBoostBase at 0x115d7c890>

### Using best model

If this parameter is set, the number of trees that are saved in the resulting model is defined as follows:
Build the number of trees defined by the training parameters.
Identify the iteration with the optimal loss function value.
No trees are saved after this iteration.
The `eval_set` parameter is obligatory for the `fit` method if the best model mode is on.
`eval_set` should have the same values for the following training dataset parameters:

* `cat_features`
* `baseline` (if set).

In [11]:
eval_set = (x_test, y_test)
model.fit(x_train, y_train, use_best_model=True, eval_set=eval_set)

<catboost.core._CatBoostBase at 0x115d7c890>

### Using `staged_predict`

The values of the model can be output for each i-th tree of the model by taking into consideration only the trees in the range [1;i].
This feature is implemented via the `staged_predict` method (available in the `CatBoostClassifier` and `CatBoostRegressor` classes).

In [12]:
staged_predictions = model.staged_predict(x_test)
# It is equivalent to use predict() with `ntree_limit` in loop
staged_predictions = []
for i in range(1, model.tree_count_ + 1):
    staged_predictions.append(model.predict(x_test, ntree_limit=i))

### Custom objective function

A custom objective function can be used by specifying a python object as the value for the `loss_function` parameter. In this case the objective is always maximized.
Depending on the machine learning problem the python object should have one of the following functions defined:

* `calc_ders_range`
* `calc_ders_multi(approxes, target, weight)` (for multiclassification)

In [13]:
import math
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        # approxes, targets, weights are indexed containers of floats
        # (containers which have only __len__ and __getitem__ defined).
        # weights parameter can be None.
        #
        # To understand what these parameters mean, assume that there is
        # a subset of your dataset that is currently being processed.
        # approxes contains current predictions for this subset,
        # targets contains target values you provided with the dataset.
        #
        # This function should return a list of pairs (der1, der2), where
        # der1 is the first derivative of the loss function with respect
        # to the predicted value, and der2 is the second derivative.
        #
        # In our case, logloss is defined by the following formula:
        # target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        # where sigmoid(x) = 1 / (1 + e^(-x)).
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)

        result = []
        for index in xrange(len(targets)):
            e = math.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))

        return result

In [14]:
model = CatBoostClassifier(random_seed=0, loss_function=LoglossObjective(), eval_metric="Logloss")
# Fit model
model.fit(x_train, y_train)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(x_test, prediction_type='RawFormulaVal') 

### Custom metric function for overfitting detector and best model selection

To set a custom metric function for overfitting detector and best model selection, create an object that implements the following interface:

In [15]:
class CustomMetric(object):
    def get_final_error(self, error, weight):
        return 0.0

    def is_max_optimal(self):
        return True

    def evaluate(self, approxes, target, weight):
        # approxes - list of list-like objects (one object per approx dimension)
        # target - list-like object
        # weight - list-like object, can be None
        return 0.0, 0.0

The following is an example of an implementation of the Logloss function:

In [16]:
import math
class LoglossMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return True

    def evaluate(self, approxes, target, weight):
        # approxes is a list of indexed containers
        # (containers with only __len__ and __getitem__ defined),
        # one container per approx dimension.
        # Each container contains floats.
        # weight is a one dimensional indexed container.
        # target is float.
        
        # weight parameter can be None.
        # Returns pair (error, weights sum)
        
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        error_sum = 0.0
        weight_sum = 0.0

        for i in xrange(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error_sum += w * (target[i] * approx[i] - math.log(1 + math.exp(approx[i])))

        return error_sum, weight_sum


In [17]:
model = CatBoostClassifier(iterations=5, random_seed=0, eval_metric=LoglossMetric())
model.fit(x_train, y_train, use_best_model=True, eval_set=(x_test, y_test))

<catboost.core._CatBoostBase at 0x115d7cb50>

Pass the created object to the `eval_metric` parameter:

In [18]:
model = CatBoostClassifier(eval_metric=CustomMetric())
model.fit(x_train, y_train, use_best_model=True, eval_set=(x_test, y_test))

<catboost.core._CatBoostBase at 0x115d7c650>

Use cross validation:

In [19]:
# From catboost
from catboost import cv, Pool
pool = Pool(x_train, y_train)
params = {'iterations': 100, 'depth': 2, 'loss_function': 'MultiClass', 'classes_count': 3}
scores = cv(params, pool)

`cv` returns the following structure:

In [20]:
scores.keys()

['MultiClass_test_stddev',
 'MultiClass_test_avg',
 'MultiClass_train_stddev',
 'MultiClass_train_avg']

Each of the arrays contains loss in each iteration (like `staged_predict`):

In [21]:
# From sklearn
from sklearn.model_selection import cross_val_score

model = CatBoostClassifier(iterations=400, loss_function='MultiClass', custom_loss='Accuracy')
scores = cross_val_score(model, x_train, y_train, scoring='accuracy', n_jobs=-1, fit_params=None)

Also catboost support `GridSearchCV` from `sklearn`:

In [22]:
from sklearn.model_selection import GridSearchCV

# Initialize
model = CatBoostClassifier(loss_function='MultiClass')
# Init params dict
params = {'iterations': [10, 20, 30], 'learning_rate': [0.01, 0.03, 0.1], 'depth': [4, 6, 8]}
# Get GridSearchCV
model = GridSearchCV(model, params)
model.fit(x_train, y_train)
print(sorted(model.cv_results_.keys()))

['mean_fit_time', 'mean_score_time', 'mean_test_score', 'mean_train_score', 'param_depth', 'param_iterations', 'param_learning_rate', 'params', 'rank_test_score', 'split0_test_score', 'split0_train_score', 'split1_test_score', 'split1_train_score', 'split2_test_score', 'split2_train_score', 'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score']


Save model to file:

In [23]:
model = CatBoostClassifier(iterations=400, loss_function='MultiClass', custom_loss='Accuracy')
model.fit(x_train, y_train)
model.save_model('model')

Load model:

In [24]:
model.load_model('model')

<catboost.core._CatBoostBase at 0x115d7c610>

Copy model:

In [25]:
model_copy = model.copy()

Get `feature_importance`:

In [26]:
# for get attr feature_importance_ you should add parameter `calc_feature_importance=True` to model
model = CatBoostClassifier(iterations=400, loss_function='MultiClass', calc_feature_importance=True)
model.fit(x_train, y_train)
print(model.feature_importance_)

[2.972921562304851, 5.810795080778263, 45.1154229990345, 46.10086035788239]


Get `feature_importances` for a different pool:

In [27]:
model.get_feature_importance(x_test, y_test)

[2.9038241071085666, 7.283882675993855, 43.69750305687429, 46.11479016002329]

Get model params:

In [28]:
# params used for fitting the model
model.get_params()

{'calc_feature_importance': True,
 'depth': 6,
 'eval_metric': u'MultiClass',
 'feature_border_type': u'MinEntropy',
 'gradient_iterations': 1,
 'has_time': False,
 'iterations': 400,
 'l2_leaf_reg': 3,
 'leaf_estimation_method': u'Newton',
 'learning_rate': 0.03,
 'loss_function': u'MultiClass',
 'name': u'experiment',
 'random_seed': 118685519094188,
 'random_strength': 1,
 'rsm': 1,
 'store_all_simple_ctr': False,
 'thread_count': 4,
 'use_best_model': False,
 'verbose': False}

Get model init params:

In [29]:
# params used for initializing the model
model.get_init_params()

{'calc_feature_importance': True,
 'depth': 6,
 'feature_border_type': 'MinEntropy',
 'has_time': False,
 'iterations': 400,
 'l2_leaf_reg': 3,
 'learning_rate': 0.03,
 'loss_function': 'MultiClass',
 'name': 'experiment',
 'random_strength': 1,
 'rsm': 1,
 'store_all_simple_ctr': False,
 'use_best_model': False,
 'verbose': False}