# $$CatBoost\ Object\ Importance\ Tutorial$$

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/model_analysis/object_importance_tutorial.ipynb)

#### In this tutorial we show how you can detect noisy objects in your dataset. 

In [1]:
import numpy as np
from catboost import CatBoost, Pool, datasets
from sklearn.model_selection import train_test_split

#### First, let's load the dataset:

In [2]:
train_df, _ = datasets.amazon()
X, y = np.array(train_df.drop(['ACTION'], axis=1)), np.array(train_df.ACTION)
cat_features = np.arange(9) # indices of categorical features

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.25, random_state=42)
train_pool = Pool(X_train, y_train, cat_features=cat_features)
validation_pool = Pool(X_validation, y_validation, cat_features=cat_features)

print(train_pool.shape, validation_pool.shape)

(24576, 9) (8193, 9)


#### Let's train CatBoost on clear data and take a look at the quality:

In [3]:
cb = CatBoost({'iterations': 100, 'verbose': False, 'random_seed': 42})
cb.fit(train_pool);
print(cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1])

0.2157984851490331


#### Let's inject random noise into 10% of training labels:

In [4]:
np.random.seed(42)
perturbed_idxs = np.random.choice(len(y_train), size=int(len(y_train) * 0.1), replace=False)
y_train_noisy = y_train.copy()
y_train_noisy[perturbed_idxs] = 1 - y_train_noisy[perturbed_idxs]

train_pool_noisy = Pool(X_train, y_train_noisy, cat_features=cat_features)

#### And train CatBoost on noisy data and take a look at the quality:

In [5]:
cb.fit(train_pool_noisy);
print(cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1])

0.25915746122622113


#### Now let's sample random 500 validate objects (because counting object importance on the entire validation dataset can take a long time) and calculate the train objects importance for these validation objects:

In [6]:
np.random.seed(42)
test_idx = np.random.choice(np.arange(y_validation.shape[0]), size=500, replace=False)
validation_pool_sampled = Pool(X_validation[test_idx], y_validation[test_idx], cat_features=cat_features)

indices, scores = cb.get_object_importance(
    validation_pool_sampled,
    train_pool_noisy,
    importance_values_sign='Positive' # Positive values means that the optimized metric
                                      # value is increase because of given train objects.
                                      # So here we get the indices of bad train objects.
)

#### Finally, in a loop, let's remove noisy objects in batches, retrain the model, and see how the quality on the test dataset improves:

In [8]:
def train_and_print_score(train_indices, remove_object_count):
    cb.fit(X_train[train_indices], y_train_noisy[train_indices], cat_features=cat_features)
    metric_value = cb.eval_metrics(validation_pool, ['RMSE'])['RMSE'][-1]
    s = 'RMSE on validation datset when {} harmful objects from train are dropped: {}'
    print(s.format(remove_object_count, metric_value))

batch_size = 250
train_indices = np.full(X_train.shape[0], True)
train_and_print_score(train_indices, 0)
for batch_start_index in range(0, 2000, batch_size):
    train_indices[indices[batch_start_index:batch_start_index + batch_size]] = False
    train_and_print_score(train_indices, batch_start_index + batch_size)

RMSE on validation datset when 0 harmful objects from train are dropped: 0.25915746122622113
RMSE on validation datset when 250 harmful objects from train are dropped: 0.25601149050939825
RMSE on validation datset when 500 harmful objects from train are dropped: 0.25158044983631966
RMSE on validation datset when 750 harmful objects from train are dropped: 0.24570533776587475
RMSE on validation datset when 1000 harmful objects from train are dropped: 0.24171376432589384
RMSE on validation datset when 1250 harmful objects from train are dropped: 0.23716221792112202
RMSE on validation datset when 1500 harmful objects from train are dropped: 0.23352830055657348
RMSE on validation datset when 1750 harmful objects from train are dropped: 0.23035731488436903
RMSE on validation datset when 2000 harmful objects from train are dropped: 0.2275943109556251


#### Therefore, we have the following RMSE values on the validation dataset:
    
||RMSE on the validation dataset|
|-|-|
|Clear train dataset: | 0.215798485149|
|Noisy train dataset: | 0.259157461226|
|Purified train dataset: | 0.227594310956|

#### $$So\ now\ you\ can\ try\ to\ clear\ the\ train\ dataset\ of\ noisy\ objects\ and\ get\ better\ quality!$$