## Практическое задание 6. Градиентный бустинг ~~своими руками~~

Поздравляю! Это финальное упражнение в нашем курсе. Проявите все своё старание, терпение и опыт, чтобы выполнить его.  
Теперь вы многое знаете из машинного обучения и для вас не составит сложности попробовать разные алгоритмы, новые библиотеки и применить их к реальной задаче.

__Задание 1. (0.5 балла)__

Мы будем использовать данные соревнования [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk/data).  

* Загрузите таблицу **application_train.csv**;
* Запишите в Y столбец с целевой переменной;
* Удалите ненужные столбцы (для этого воспользуйтесь описанием);
* Определите тип столбцов и заполните пропуски - стратегия произвольная;
* Разбейте выборку в соотношении 70:30 с random_state=0.

Так как в данных значительный дисбаланс классов, в качестве метрики качества везде будем использовать площадь под precision-recall кривой.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
X_bare = pd.read_csv("application_train.csv")

In [3]:
X_bare.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
6,100009,0,Cash loans,F,Y,Y,1,171000.0,1560726.0,41301.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,2.0
7,100010,0,Cash loans,M,Y,Y,0,360000.0,1530000.0,42075.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,100011,0,Cash loans,F,N,Y,0,112500.0,1019610.0,33826.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
9,100012,0,Revolving loans,M,N,Y,0,135000.0,405000.0,20250.0,...,0,0,0,0,,,,,,


In [4]:
for column in X_bare:
    if X_bare[column].dtype == object:
        X_bare[column] = X_bare[column].fillna('')
    elif X_bare[column].dtype == int:
        X_bare[column] = X_bare[column].fillna(0)
    else:
        X_bare[column] = X_bare[column].fillna(0.)

In [5]:
Y = X_bare.TARGET.values
X = X_bare.drop(['TARGET', 'SK_ID_CURR'], axis=1)

In [6]:
X_num = X.loc[:, X.dtypes != object]
num_columns = X_num.columns

In [7]:
x_train, x_test, y_train, y_test = train_test_split(np.array(X_num), Y, test_size=0.3, random_state=0)

__Задание 2. (1.5 балла)__

Также мы будем использовать две реализации градиентного бустинга: [LightGBM](https://lightgbm.readthedocs.io/en/stable/Python-API.html) и [Catboost](https://catboost.ai/en/docs/), которые вам необходимо самостоятельно изучить и установить, используя команды:  
`!pip install lightgb`  
`!pip install catboost`  
Обучите реализации градиентного бустинга LightGBM и Catboost на вещественных признаках без подбора параметров. 
Почему получилась заметная разница в качестве? 

В этом и последующих экспериментах необходимо измерять время обучения моделей.

In [8]:
#! pip install lightgbm catboost --user 

In [9]:
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import average_precision_score
from time import time

In [10]:
lgbm_model = LGBMClassifier()
lgbm_model.fit(x_train, y_train)
y_lgbm = lgbm_model.predict_proba(x_test)[:, 1]
print('lgbm classifier score:', average_precision_score(y_test, y_lgbm))

[LightGBM] [Info] Number of positive: 17485, number of negative: 197772
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.053833 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 11115
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 99
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.081228 -> initscore=-2.425771
[LightGBM] [Info] Start training from score -2.425771
lgbm classifier score: 0.23159230793933594


In [11]:
cat_model = CatBoostClassifier(task_type='GPU', loss_function='Logloss')
cat_model.fit(x_train, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test)[:, 1]
print('catboost classifier score:', average_precision_score(y_test, y_cat))

CatBoostError: C:/Go_Agent/pipelines/BuildMaster/catboost.git/catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 35: CUDA driver version is insufficient for CUDA runtime version

__Задание 3. (2 балла)__

Подберите с CV=3 оптимальные параметры алгоритмов, изменяя:

* глубину деревьев;
* количество деревьев;
* темп обучения;
* оптимизируемый функционал.

Проанализируйте соотношения глубины и количества деревьев в зависимости от алгоритма.

In [None]:
#!pip install --upgrade catboost
#!pip install xgboost


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)],
    'loss_function': ['CrossEntropy'], #Скорость обучения
    'depth': range(4, 8)
}

cat_model = CatBoostClassifier(iterations=100, verbose=False)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train, y_train)


In [None]:
gs.best_params_

In [None]:
#Поиск по логарифмической шкале

params = {
    'learning_rate': [10 ** x for x in range(-3, 1)]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False, 
                               loss_function='CrossEntropy', depth=6)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train, y_train)

In [None]:
gs.best_params_

In [None]:
#Поиск по линейной шкале в найденном диапазоне

params = {
    'learning_rate': [0.01, 0.04, 0.08]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False, 
                               loss_function='CrossEntropy', depth=6)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train, y_train)

In [None]:
#Вычисление точности на тестовой выборке, используя найденные значения

start = time()
cat_model = CatBoostClassifier(task_type='GPU', loss_function='CrossEntropy', depth=6, 
                              iterations=1000, learning_rate=0.04)
cat_model.fit(x_train, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test)[:, 1]
end = time()
print('catboost classifier score:', average_precision_score(y_test, y_cat))
print('time:', end - start)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)]
}
lgbm_model = LGBMClassifier( n_estimators=1000, metric='binary_logloss', max_depth=5)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04, 0.08]
}

lgbm_model = LGBMClassifier( n_estimators=1000, metric='binary_logloss', max_depth=5)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
lgbm_model = LGBMClassifier(n_estimators=1000, metric='binary_logloss', max_depth=5, 
                            learning_rate=0.01)
lgbm_model.fit(x_train, y_train)
y_lgbm = lgbm_model.predict_proba(x_test)[:, 1]
end = time()
print('lgbm classifier score:', average_precision_score(y_test, y_lgbm))
print('time:', end - start)

__Задание 4. (3.5 балла)__

Добавьте категориальные признаки к вещественным следующими способами:

* как OHE признаки;
* как счетчики со сглаживанием.

При подсчете счетчиков запрещается использование циклов. 

На получившихся датасетах подберите параметры у каждого из алгоритмов. Как меняется время, необходимое для обучения модели в зависимости от способа кодирования? Сравните полученные результаты с встроенными методами обработки категориальных признаков. 

In [None]:
#Преобразование категориальной переменной к вещественным переменные

one_hot_X = pd.get_dummies(X, drop_first=True)

In [None]:
x_train_ohe, x_test_ohe = train_test_split(one_hot_X, test_size=0.3, random_state=0)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)],
    'loss_function': ['CrossEntropy'],
    'depth': range(4, 8)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'depth': range(7, 10)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False, 
                              learning_rate=0.1, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 0)]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False, 
                              depth=8, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04, 0.08]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False, 
                              depth=8, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
cat_model = CatBoostClassifier(task_type='GPU', loss_function='CrossEntropy', depth=8, 
                              iterations=1000, learning_rate=0.04)
cat_model.fit(x_train_ohe, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test_ohe)[:, 1]
end = time()
print('catboost classifier score:', average_precision_score(y_test, y_cat))
print('time:', end - start)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)],
    'metric': ['binary_logloss'],
    'max_depth': range(4, 8)
}

lgbm_model = LGBMClassifier(n_estimators=100)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)


In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 0)]
}

lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=6, metric='binary_logloss')
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04, 0.08]
}

lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=6, metric='binary_logloss')
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_ohe, y_train)


In [None]:
gs.best_params_

In [None]:
start = time()
lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=6, metric='binary_logloss',
                           learning_rate=0.01)
lgbm_model.fit(x_train_ohe, y_train)
y_lgbm = lgbm_model.predict_proba(x_test_ohe)[:, 1]
end = time()
print('lgbm classifier score:', average_precision_score(y_test, y_lgbm))
print('time:', end - start)

In [None]:
X_counts = X_bare.copy()

for column in X_bare:
    if(X_bare[column].dtypes == object):
        X_counts[column] = X_bare[column].map((X_bare.groupby(column)['TARGET'].sum() + 1) / 
                              (X_bare.groupby(column).size() + 1))

In [None]:
X_counts = np.array(X_counts.drop(['TARGET', 'SK_ID_CURR'], axis=1))

In [None]:
x_train_counts, x_test_counts = train_test_split(X_counts, test_size=0.3, random_state=0)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)],
    'loss_function': ['CrossEntropy'],
    'depth': range(4, 8)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'depth': range(7, 10)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False,
                              learning_rate=0.1, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 0)]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False,
                              depth=7, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04, 0.08]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False,
                              depth=7, loss_function='CrossEntropy')
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
cat_model = CatBoostClassifier(task_type='GPU', loss_function='CrossEntropy', depth=7, 
                              iterations=1000, learning_rate=0.04)
cat_model.fit(x_train_counts, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test_counts)[:, 1]
end = time()
print('catboost classifier score:', average_precision_score(y_test, y_cat))
print('time:', end - start)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)],
    'metric': ['binary_logloss'],
    'max_depth': range(4, 8)
}

lgbm_model = LGBMClassifier(n_estimators=100)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 1)]
}

lgbm_model = LGBMClassifier(n_estimators=1000, metric='binary_logloss', max_depth=5)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04]
}

lgbm_model = LGBMClassifier(n_estimators=1000, metric='binary_logloss', max_depth=5)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_counts, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=5, metric='binary_logloss',
                           learning_rate=0.04)
lgbm_model.fit(x_train_counts, y_train)
y_lgbm = lgbm_model.predict_proba(x_test_counts)[:, 1]
end = time()
print('lgbm classifier score:', average_precision_score(y_test, y_lgbm))
print('time:', end - start)

In [None]:
obj_matrix = (X.dtypes == object).values
cat_list = np.arange(obj_matrix.size)[obj_matrix]

In [None]:
x_train_all, x_test_all = train_test_split(X.values, test_size=0.3,
                                                            random_state=0)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-2, 0)],
    'loss_function': ['CrossEntropy'],
    'depth': range(6, 9)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False, 
                               cat_features=cat_list)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_all, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'depth': range(8, 10)
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=100, verbose=False, 
                               cat_features=cat_list, loss_function='CrossEntropy', learning_rate=0.1)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_all, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-2, 0)]
}

cat_model = CatBoostClassifier(task_type='GPU', iterations=1000, verbose=False, 
                               cat_features=cat_list, loss_function='CrossEntropy',depth=8)
gs = GridSearchCV(cat_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_train_all, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
cat_model = CatBoostClassifier(task_type='GPU', loss_function='CrossEntropy', depth=8, 
                              iterations=1000, learning_rate=0.01, cat_features=cat_list)
cat_model.fit(x_train_all, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test_all)[:, 1]
end = time()
print('catboost classifier score:', average_precision_score(y_test, y_cat))
print('time:', end - start)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
X_int_cat = X.copy()

for column in X:
    if(X[column].dtypes == object):
        le = LabelEncoder()
        X_int_cat[column] = le.fit_transform(X[column])

In [None]:
x_int_cat_train, x_int_cat_test = train_test_split(X_int_cat, test_size=0.3, random_state=0)

In [None]:
params = {
    'learning_rate': [10 ** x for x in range(-3, 0)],
    'metric': ['binary_logloss'],
    'max_depth': range(4, 8)
}

lgbm_model = LGBMClassifier(n_estimators=100)
lgbm_model.set_params(cat_features=cat_list)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_int_cat_train, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'max_depth': range(7, 9)
}

lgbm_model = LGBMClassifier(n_estimators=100, learning_rate=0.1, metric='binary_logloss')
lgbm_model.set_params(cat_features=cat_list)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_int_cat_train, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.1]
}

lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=7, metric='binary_logloss')
lgbm_model.set_params(cat_features=cat_list)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_int_cat_train, y_train)

In [None]:
gs.best_params_

In [None]:
params = {
    'learning_rate': [0.01, 0.04]
}

lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=7, metric='binary_logloss')
lgbm_model.set_params(cat_features=cat_list)
gs = GridSearchCV(lgbm_model, params, cv=3, scoring='average_precision', verbose=0)
gs.fit(x_int_cat_train, y_train)

In [None]:
gs.best_params_

In [None]:
start = time()
lgbm_model = LGBMClassifier(n_estimators=1000, max_depth=7, metric='binary_logloss',
                           learning_rate=0.04)
lgbm_model.set_params(cat_features=cat_list)
lgbm_model.fit(x_int_cat_train, y_train)
y_lgbm = lgbm_model.predict_proba(x_int_cat_test)[:, 1]
end = time()
print('lgbm classifier score:', average_precision_score(y_test, y_lgbm))
print('time:', end - start)


__Задание 5. (1 балл)__

Реализуйте блендинг подобранных в предыдущем задании моделей и сравните качество.

In [None]:
cat_model_ohe = CatBoostClassifier(task_type='CPU', loss_function='CrossEntropy', depth=8, 
                              iterations=1000, learning_rate=0.04)
cat_model_ohe.fit(x_train_ohe, y_train, verbose=False)
y_cat_ohe = cat_model_ohe.predict_proba(x_test_ohe)[:, 1]

In [None]:
lgbm_model_ohe = LGBMClassifier(n_estimators=1000, max_depth=6, metric='binary_logloss',
                           learning_rate=0.01)
lgbm_model_ohe.fit(x_train_ohe, y_train)
y_lgbm_ohe = lgbm_model_ohe.predict_proba(x_test_ohe)[:, 1]

In [None]:
cat_model_counts = CatBoostClassifier(task_type='CPU', loss_function='CrossEntropy', depth=7, 
                              iterations=1000, learning_rate=0.04)
cat_model_counts.fit(x_train_counts, y_train, verbose=False)
y_cat_counts = cat_model_counts.predict_proba(x_test_counts)[:, 1]

In [None]:
lgbm_model_counts = LGBMClassifier(n_estimators=1000, max_depth=5, metric='binary_logloss',
                           learning_rate=0.04)
lgbm_model_counts.fit(x_train_counts, y_train)
y_lgbm_counts = lgbm_model_counts.predict_proba(x_test_counts)[:, 1]

In [None]:
y_blend = (y_cat_ohe + y_lgbm_ohe + y_cat_counts + y_lgbm_counts) / 4.0
print('blending score:', average_precision_score(y_test, y_blend))

__Задание 6. (1.5 балла)__

В задании 3 вы подобрали гиперпараметры для LightGBM и CatBoost на вещественных признаках. Визуализируйте важности признаков, посчитанные этими алгоритмами, в виде горизонтального bar-plot (отсортируйте признаки по убыванию важности, подпишите названия признаков по оси y).

Для каждого из двух алгоритмов удалите неважные признаки (обычно по bar-plot хорошо видно порог на важность, с которого начинается хвост неважных признаков) и обучите ту же модель на получившихся данных. Сильно ли упало качество при удалении признаков, которые модель считает неважными?

In [None]:
cat_model = CatBoostClassifier(task_type='CPU', loss_function='CrossEntropy', depth=6, 
                              iterations=1000, learning_rate=0.04)
cat_model.fit(x_train, y_train, verbose=False)
feat_imp = pd.DataFrame({'imp': cat_model.feature_importances_, 'col': num_columns})
feat_imp = feat_imp.sort_values(by=['imp'])

In [None]:
plt.figure(figsize=(10, 25))
plt.title("Feature importances")
plt.barh(range(X_num.shape[1]), feat_imp['imp'],
       color="r", align="center")
plt.yticks(range(X.shape[1]), feat_imp['col'])
plt.ylim([-1, X_num.shape[1]])
plt.show()

In [None]:
X_num_drop = X_num.drop(feat_imp['col'][:15], axis=1)

In [None]:
x_train_drop, x_test_drop = train_test_split(X_num_drop, test_size=0.3, random_state=0)

In [None]:
cat_model = CatBoostClassifier(task_type='CPU', loss_function='CrossEntropy', depth=6, 
                              iterations=1000, learning_rate=0.04)
cat_model.fit(x_train_drop, y_train, verbose=False)
y_cat = cat_model.predict_proba(x_test_drop)[:, 1]
print('score:', average_precision_score(y_test, y_cat))

In [None]:
lgbm_model = LGBMClassifier(n_estimators=1000, metric='binary_logloss', max_depth=5, 
                            learning_rate=0.01)
lgbm_model.fit(x_train, y_train)
feat_imp_lgbm = pd.DataFrame({'imp': lgbm_model.feature_importances_, 'col': num_columns})
feat_imp_lgbm = feat_imp_lgbm.sort_values(by=['imp'])

In [None]:
plt.figure(figsize=(10, 25))
plt.title("Feature importances")
plt.barh(range(X_num.shape[1]), feat_imp_lgbm['imp'],
       color="r", align="center")
plt.yticks(range(X.shape[1]), feat_imp_lgbm['col'])
plt.ylim([-1, X_num.shape[1]])
plt.show()

In [None]:
X_num_drop_lgbm = X_num.drop(feat_imp_lgbm['col'][:18], axis=1)

In [None]:
x_train_drop_lgbm, x_test_drop_lgbm = train_test_split(X_num_drop_lgbm, test_size=0.3, random_state=0)

In [None]:
lgbm_model = LGBMClassifier(n_estimators=1000, metric='binary_logloss', max_depth=5, 
                            learning_rate=0.01)
lgbm_model.fit(x_train_drop_lgbm, y_train)
y_lgbm = lgbm_model.predict_proba(x_test_drop_lgbm)[:, 1]
print('score:', average_precision_score(y_test, y_lgbm))