# Object Importance Tutorial

In this tutorial we show how you can detect noisy objects in your dataset. 

In [1]:
import numpy as np
from catboost import CatBoost, Pool, datasets
from sklearn.model_selection import train_test_split

First, let's load the dataset:

In [2]:
train_df, _ = datasets.amazon()

X, y = np.array(train_df.drop(['ACTION'], axis=1)), np.array(train_df.ACTION)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
cat_features = np.arange(9)

print(X_train.shape, X_test.shape)

((24576, 9), (8193, 9))


Let's inject random noise into 30% of training labels:

In [3]:
np.random.seed(42)
perturbed_idxs = np.random.choice(len(y_train), size=int(len(y_train) * 0.3), replace=False)
y_train[perturbed_idxs] = 1 - y_train[perturbed_idxs]

train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)

And train CatBoost on noisy data:

In [4]:
cb = CatBoost({'iterations': 100, 'verbose': False, 'random_seed': 42})
cb.fit(train_pool);

Now let's sample random 500 test objects and calculate the train objects importance for these tests objects:

In [5]:
np.random.seed(42)
test_idx = np.random.choice(np.arange(y_test.shape[0]), size=500, replace=False)
test_pool_sampled = Pool(X_test[test_idx], y_test[test_idx], cat_features=cat_features)

indices, scores = cb.get_object_importance(
    test_pool_sampled,
    train_pool,
    importance_values_sign='Positive' # Positive values means that the optimized metric
                                      # value is increase because of given train objects.
)

Finally, in a loop, let's remove noisy objects in batches, retrain the model, and see how the quality on the test dataset improves:

In [6]:
def train_and_print_score(train_indices, remove_object_count):
    cb.fit(X_train[train_indices], y_train[train_indices], cat_features=cat_features)
    metric_value = cb.eval_metrics(test_pool, ['RMSE'])['RMSE'][-1]
    s = 'RMSE on test object when {} harmful objects from train are dropped: {}'
    s = s.format(remove_object_count, metric_value)
    print(s)

batch_size = 250
train_indices = np.full(X_train.shape[0], True)
train_and_print_score(train_indices, 0)
for batch_start_index in range(0, 2000, batch_size):
    train_indices[indices[batch_start_index:batch_start_index + batch_size]] = False
    train_and_print_score(train_indices, batch_start_index + batch_size)

RMSE on test object when 0 harmful objects from train are dropped: 0.378438571394
RMSE on test object when 250 harmful objects from train are dropped: 0.37313097885
RMSE on test object when 500 harmful objects from train are dropped: 0.370256643651
RMSE on test object when 750 harmful objects from train are dropped: 0.364329880946
RMSE on test object when 1000 harmful objects from train are dropped: 0.35813052939
RMSE on test object when 1250 harmful objects from train are dropped: 0.354199434733
RMSE on test object when 1500 harmful objects from train are dropped: 0.347473535051
RMSE on test object when 1750 harmful objects from train are dropped: 0.343148120213
RMSE on test object when 2000 harmful objects from train are dropped: 0.338461459648
