# $$CatBoost\ Feature\ Importance\ Tutorial$$

#### Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a get_feature_importance method.

In [3]:
import numpy as np
from catboost import CatBoost, Pool, datasets
from sklearn.model_selection import train_test_split

#### First, let's prepare the dataset:

In [4]:
train_df, _ = datasets.higgs()

In [5]:
X, y = np.array(train_df.drop(0, axis=1))[:1000], np.array(train_df[0])[:1000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

#### Let's train CatBoost:

In [6]:
cb = CatBoost({'iterations': 20, 'verbose': False, 'random_seed': 42, 'grow_policy': 'Lossguide'})
cb.fit(train_pool);

#### Catboost provides several types of feature importances. One of them is PredictionDiff: A vector with contributions of each feature to the RawFormulaVal difference for each pair of objects.

#### Let's find two objects with incorrect labels on test data:

In [16]:
prediction = np.argmax(cb.predict(X_test, prediction_type='Probability'), axis=1)

In [20]:
wrong_prediction_idxs = np.arange(prediction.size)[y_test != prediction]
test_pool_slice = test_pool.slice(wrong_prediction_idxs[:2])

#### Let's calculate PredictionDiff for these two objects:

In [37]:
prediction_diff = cb.get_feature_importance(type='PredictionDiff', data=test_pool_slice)

for feature_id, diff in np.ndenumerate(prediction_diff):
    if diff > 0.:
        print('{}: {}'.format(feature_id[0], diff))

22: 0.590958854452
25: 0.706977071538


#### As you can see, feature 25  is most important for getting the right prediction.