# Задача 

Мы хотим обучить lightgbm и catboost как с использованием их встроенных методов для обработки категориальных признаков, так и предварительно обработав нашими методами. И сравнить полученное качество.

Для оценки качества будем использовать кросс-валидацию.

Для упрощения не будем отвлекаться на подбор гиперпараметров, для более корректной оценки не будем делать early_stopping.


После этого мы сохраним предсказания для теста для этих четырёх вариантов. И также сохраним лучшую модель.

# Импорты

In [21]:
import os
import logging
from pprint import pprint

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from tqdm import tqdm_notebook

from module.prepare_data import (
    BinFeaturesTransformer,
    CatFeaturesTransformer,
    DropTransformer,
)
from module.model import (
    OurLossCBObjective,
    OurLossCBMetric,
    our_loss_function,
    our_loss_lgbm_objective,
    CatboostWrapper,
    LightgbmWrapper,
    save_pipeline,
    load_pipeline,
)

In [2]:
# MODE = 'DEBUG'
MODE = 'FULL'

logging.basicConfig(
    level=('DEBUG' if MODE == 'DEBUG' else 'INFO'),
    format='%(asctime)s %(levelname)s:%(module)s %(message)s'
)

In [3]:
logger = logging.getLogger()

# Загрузка данных

In [4]:
DATA_PATH = '../data'
TRAIN_FILE = 'TRAIN_DATA.csv'
TEST_FILE = 'TEST_DATA.csv'

In [5]:
train_data = pd.read_csv(os.path.join(DATA_PATH, TRAIN_FILE))

In [6]:
train_data.head()

Unnamed: 0,row_id,target,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,...,feat_120,feat_121,feat_122,feat_123,feat_124,feat_125,feat_126,feat_127,feat_128,feat_129
0,2,10024.24,A,D,A,A,A,0.6984,A,A,...,A,L,A,BI,A,0.464228,T,A,A,B
1,3,5887.65,A,A,A,A,A,0.24564,A,A,...,A,K,A,BI,A,0.330514,P,A,A,A
2,4,21015.57,A,D,A,A,A,0.28768,A,A,...,A,G,A,BI,A,0.40028,T,A,A,A
3,6,6251.75,A,D,A,A,A,0.34987,A,A,...,A,A,A,BI,A,0.438385,T,A,A,A
4,7,1899.61,A,D,A,A,B,0.49462,A,A,...,A,I,A,BI,A,0.485918,T,A,A,B


In [7]:
train_data = train_data[:10000] if MODE == 'DEBUG' else train_data

In [8]:
len(train_data)

143121

# Подготовка pipeline'ов.

Рассмотрим следующие варианты:
* Преобразовывать ли бинарные признаки в числа?
* Предобрабатывать ли категориальные признаки своими силами?
* Если нет -- отдавать их как категориальные или вообще выбросить?
* Какую из моделей применять?

In [9]:
bin_as_num_features = BinFeaturesTransformer(bin_as_numeric=True).get_features()
bin_as_cat_features = BinFeaturesTransformer(bin_as_numeric=False).get_features()

In [10]:
pipeline_name_pattern = 'bin:{},prepare:{},model:{},metric:{}'
bin_features_params = {
    'num': {
        'bin_as_numeric': True,
        'cat_features': bin_as_num_features['cat_features'],
    },
    'cat': {
        'bin_as_numeric': False,
        'cat_features': bin_as_cat_features['cat_features'],
    },
}
prepare_params = {
    'exp': {
        'expanding': True,
        'alpha': 0,
    },
    'alpha_0': {
        'alpha': 0,
        'expanding': False,
    },
    'alpha_1': {
        'alpha': 10,
        'expanding': False,
    },
    'alpha_10': {
        'alpha': 100,
        'expanding': False,
    },
    'alpha_100': {
        'alpha': 1000,
        'expanding': False,
    },
    'no': {},
    'drop': {}
}
models = ['cb', 'lgbm']
metrics = ['our', 'default']

In [11]:
pipelines = dict()
for bin_type in bin_features_params:
    for prepare_type in prepare_params:
        for model in models:
            for metric in metrics:
                name = pipeline_name_pattern.format(bin_type,
                                                    prepare_type,
                                                    model,
                                                    metric)
                bin_params = bin_features_params[bin_type].copy()
                prep_params = prepare_params[prepare_type]
                pipeline_list = []
                pipeline_list.append(('bin_transform',
                                      BinFeaturesTransformer(bin_as_numeric=bin_params.pop('bin_as_numeric'))))
                model_params = {'n_estimators': 100}
                if MODE == 'DEBUG':
                    model_params['verbose'] = 1
                if prepare_type == 'drop':
                    pipeline_list.append(('drop_transform',
                                          DropTransformer(drop_columns=bin_params['cat_features'])))
                elif prepare_type == 'no': 
                    model_params.update(bin_params)
                else:
                    pipeline_list.append(('cat_transform',
                                          CatFeaturesTransformer(**{**bin_params, **prep_params})))
                if model == 'cb':
                    if metric == 'our':
                        model_params.update({
                            'loss_function': OurLossCBObjective(),
                            'eval_metric': OurLossCBMetric(),
                        })
                    pipeline_list.append(('regressor',
                                          CatboostWrapper(**model_params)))
                else:
                    if metric == 'our':
                        model_params.update({
                            'objective': our_loss_lgbm_objective,
                        })
                    pipeline_list.append(('regressor',
                                          LightgbmWrapper(**model_params)))
                pipelines[name] = Pipeline(pipeline_list)

In [12]:
len(pipelines)

56

In [13]:
pprint(list(pipelines.keys()))

['bin:num,prepare:exp,model:cb,metric:our',
 'bin:num,prepare:exp,model:cb,metric:default',
 'bin:num,prepare:exp,model:lgbm,metric:our',
 'bin:num,prepare:exp,model:lgbm,metric:default',
 'bin:num,prepare:alpha_0,model:cb,metric:our',
 'bin:num,prepare:alpha_0,model:cb,metric:default',
 'bin:num,prepare:alpha_0,model:lgbm,metric:our',
 'bin:num,prepare:alpha_0,model:lgbm,metric:default',
 'bin:num,prepare:alpha_1,model:cb,metric:our',
 'bin:num,prepare:alpha_1,model:cb,metric:default',
 'bin:num,prepare:alpha_1,model:lgbm,metric:our',
 'bin:num,prepare:alpha_1,model:lgbm,metric:default',
 'bin:num,prepare:alpha_10,model:cb,metric:our',
 'bin:num,prepare:alpha_10,model:cb,metric:default',
 'bin:num,prepare:alpha_10,model:lgbm,metric:our',
 'bin:num,prepare:alpha_10,model:lgbm,metric:default',
 'bin:num,prepare:alpha_100,model:cb,metric:our',
 'bin:num,prepare:alpha_100,model:cb,metric:default',
 'bin:num,prepare:alpha_100,model:lgbm,metric:our',
 'bin:num,prepare:alpha_100,model:lgbm,m

# Кросс-валидация

Будем проверять по семи фолдам.

In [14]:
scorer = make_scorer(our_loss_function, greater_is_better=False)

In [15]:
pipelines_scores = dict()
n_folds = 7

In [16]:
for name in tqdm_notebook(pipelines):
    pipelines_scores[name] = cross_val_score(
        estimator=pipelines[name],
        X=train_data[train_data.columns[2:]],
        y=train_data['target'],
        scoring=scorer,
        n_jobs=-1,
        cv=n_folds,
    )

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=56.0), HTML(value='')))




In [17]:
scores_df = pd.DataFrame(pipelines_scores).applymap(lambda x: -x)

In [18]:
scores_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
"bin:num,prepare:exp,model:cb,metric:our",7.0,2723.525,30.214774,2690.815,2700.037,2713.438,2744.31,2771.73
"bin:num,prepare:exp,model:cb,metric:default",7.0,1273785.0,97186.068195,1158812.0,1221359.0,1226264.0,1321857.0,1444987.0
"bin:num,prepare:exp,model:lgbm,metric:our",7.0,2725.612,20.887954,2702.406,2706.794,2726.786,2742.556,2751.39
"bin:num,prepare:exp,model:lgbm,metric:default",7.0,1226540.0,68778.010103,1144889.0,1164473.0,1244009.0,1276138.0,1315657.0
"bin:num,prepare:alpha_0,model:cb,metric:our",7.0,2752.941,31.088917,2715.471,2727.547,2754.355,2775.736,2794.193
"bin:num,prepare:alpha_0,model:cb,metric:default",7.0,1263716.0,100006.899397,1117587.0,1198391.0,1268033.0,1330064.0,1403482.0
"bin:num,prepare:alpha_0,model:lgbm,metric:our",7.0,2724.017,22.270102,2699.706,2703.539,2725.047,2742.627,2751.036
"bin:num,prepare:alpha_0,model:lgbm,metric:default",7.0,1207495.0,58912.976285,1141269.0,1154280.0,1217154.0,1246090.0,1293306.0
"bin:num,prepare:alpha_1,model:cb,metric:our",7.0,2745.216,32.7245,2707.549,2714.66,2756.087,2766.434,2790.687
"bin:num,prepare:alpha_1,model:cb,metric:default",7.0,1320064.0,180275.085221,1185870.0,1209692.0,1281181.0,1322704.0,1708605.0


In [19]:
scores_mean_df = (
    pd
    .DataFrame(scores_df.T.mean(axis=1))
    .reset_index(drop=False)
    .rename(columns={'index': 'name', 0: 'mean_score'}))

In [20]:
columns = ['bin', 'prepare', 'model', 'metric']

In [21]:
def get_pipeline_type(name, level):
    pipeline_types = [level_value.split(':')
                      for level_value in name.split(',')]
    pipeline_types = {level_value[0]: level_value[1]
                      for level_value in pipeline_types}
    return pipeline_types.get(level)

In [22]:
for col in columns:
    scores_mean_df[col] = scores_mean_df['name'].apply(lambda x: get_pipeline_type(x, col))

#### Функция потерь

In [23]:
scores_mean_df.groupby(by='metric')['mean_score'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
default,28.0,1314880.0,201651.662458,1207495.0,1217793.0,1260472.0,1285672.0,2059432.0
our,28.0,2737.169,15.139612,2710.41,2724.43,2732.881,2748.138,2771.265


Мы видим, что при использовании не нашей метрики при оптимизации (а MSE) всё совсем плохо (из-за тяжёлого хвоста).

In [32]:
scores_mean_best_df = scores_mean_df[scores_mean_df['metric']=='our'].copy()

#### Обработка бинарных признаков

In [33]:
scores_mean_best_df.groupby(by='bin')['mean_score'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cat,14.0,2737.060348,15.581563,2710.41017,2724.514965,2737.774911,2745.63509,2769.531453
num,14.0,2737.276967,15.272088,2723.52526,2724.15017,2730.028778,2748.656592,2771.264858


Разницы никакой.

#### Модель и подготовка признаков

In [34]:
scores_mean_best_df.groupby(by='model')['mean_score'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cb,14.0,2746.196518,15.349337,2710.41017,2744.81128,2746.9656,2749.931475,2771.264858
lgbm,14.0,2728.140796,8.056149,2723.758005,2724.18724,2724.517892,2727.466756,2753.994763


In [35]:
scores_mean_best_df.groupby(by='prepare')['mean_score'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
prepare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
alpha_0,4.0,2736.847609,14.805317,2724.017091,2724.397082,2735.216397,2747.666924,2752.94055
alpha_1,4.0,2734.676007,11.938445,2724.165421,2724.425384,2734.661336,2744.911959,2745.215932
alpha_10,4.0,2735.836401,13.71175,2723.758005,2724.305833,2734.650828,2746.181395,2750.285943
alpha_100,4.0,2734.983621,12.707793,2724.145087,2724.225792,2733.460663,2744.218492,2748.868073
drop,4.0,2755.491895,20.411584,2727.176505,2747.290199,2761.763108,2769.964804,2771.264858
exp,4.0,2721.777636,7.755624,2710.41017,2720.246487,2724.568434,2726.099583,2727.563507
no,4.0,2740.567432,8.877403,2732.88105,2732.881155,2740.45167,2748.137948,2748.485337


In [36]:
scores_mean_best_df.sort_values(by='mean_score', ascending=True)

Unnamed: 0,name,mean_score,bin,prepare,model,metric
28,"bin:cat,prepare:exp,model:cb,metric:our",2710.41017,cat,exp,cb,our
0,"bin:num,prepare:exp,model:cb,metric:our",2723.52526,num,exp,cb,our
14,"bin:num,prepare:alpha_10,model:lgbm,metric:our",2723.758005,num,alpha_10,lgbm,our
6,"bin:num,prepare:alpha_0,model:lgbm,metric:our",2724.017091,num,alpha_0,lgbm,our
18,"bin:num,prepare:alpha_100,model:lgbm,metric:our",2724.145087,num,alpha_100,lgbm,our
10,"bin:num,prepare:alpha_1,model:lgbm,metric:our",2724.165421,num,alpha_1,lgbm,our
46,"bin:cat,prepare:alpha_100,model:lgbm,metric:our",2724.252694,cat,alpha_100,lgbm,our
42,"bin:cat,prepare:alpha_10,model:lgbm,metric:our",2724.488442,cat,alpha_10,lgbm,our
38,"bin:cat,prepare:alpha_1,model:lgbm,metric:our",2724.512038,cat,alpha_1,lgbm,our
34,"bin:cat,prepare:alpha_0,model:lgbm,metric:our",2724.523746,cat,alpha_0,lgbm,our


Разброс не очень велик и сравним с разнице между фолдами. Но выберем лучший вариант.

In [37]:
scores_mean_best_df['with_preproc'] = scores_mean_best_df['prepare'].apply(lambda x: x not in ['no', 'drop'])

In [44]:
best_pipelines = (
    scores_mean_best_df
    .groupby(by=['model', 'with_preproc'], as_index=False)
    .apply(lambda x: x.sort_values(by='mean_score').head(1)))

In [50]:
best_pipelines

Unnamed: 0,Unnamed: 1,name,mean_score,bin,prepare,model,metric,with_preproc
0,20,"bin:num,prepare:no,model:cb,metric:our",2748.022152,num,no,cb,our,False
1,28,"bin:cat,prepare:exp,model:cb,metric:our",2710.41017,cat,exp,cb,our,True
2,26,"bin:num,prepare:drop,model:lgbm,metric:our",2727.176505,num,drop,lgbm,our,False
3,14,"bin:num,prepare:alpha_10,model:lgbm,metric:our",2723.758005,num,alpha_10,lgbm,our,True


# Получение предсказаний
Итак, выберем 4 лучших модели -- lgbm, cb, с предобработкой и без.

In [14]:
final_pipelines = {
    'cb_with_preproc': Pipeline([
        ('bin_transform', BinFeaturesTransformer(bin_as_numeric=False)),
        ('cat_transform', CatFeaturesTransformer(
            cat_features=bin_as_cat_features['cat_features'],
            expanding=True,
            alpha=0,
        )),
        ('regressor', CatboostWrapper(
            n_estimators=100,
            loss_function=OurLossCBObjective(),
            eval_metric=OurLossCBMetric(),
        )),
    ]),
    'cb': Pipeline([
        ('bin_transform', BinFeaturesTransformer(bin_as_numeric=True)),
        ('regressor', CatboostWrapper(
            n_estimators=100,
            cat_features=bin_as_num_features['cat_features'],
            loss_function=OurLossCBObjective(),
            eval_metric=OurLossCBMetric(),
        )),
    ]),
    'lgbm_with_preproc': Pipeline([
        ('bin_transform', BinFeaturesTransformer(bin_as_numeric=True)),
        ('cat_transform', CatFeaturesTransformer(
            cat_features=bin_as_num_features['cat_features'],
            expanding=False,
            alpha=10,
        )),
        ('regressor', LightgbmWrapper(
            n_estimators=100,
            objective=our_loss_lgbm_objective,
        )),
    ]),
    'lgbm': Pipeline([
        ('bin_transform', BinFeaturesTransformer(bin_as_numeric=True)),
        ('drop_transform', DropTransformer(
            drop_columns=bin_as_num_features['cat_features']
        )),
        ('regressor', LightgbmWrapper(
            n_estimators=100,
            objective=our_loss_lgbm_objective,
        )),
    ]),
}

In [15]:
for name in final_pipelines:
    logger.info(f'Training of {name}')
    final_pipelines[name].fit(
        X=train_data[train_data.columns[2:]],
        y=train_data['target'],
    )

2020-05-16 22:57:26,938 INFO:<ipython-input-15-c052d415860f> Training of cb_with_preproc


0:	learn: 2917.2320868	total: 186ms	remaining: 18.4s
1:	learn: 2883.9070921	total: 323ms	remaining: 15.8s
2:	learn: 2870.8287182	total: 460ms	remaining: 14.9s
3:	learn: 2859.5051747	total: 591ms	remaining: 14.2s
4:	learn: 2850.9599050	total: 731ms	remaining: 13.9s
5:	learn: 2843.7570431	total: 862ms	remaining: 13.5s
6:	learn: 2837.0634626	total: 996ms	remaining: 13.2s
7:	learn: 2831.5042679	total: 1.13s	remaining: 13s
8:	learn: 2823.4010622	total: 1.26s	remaining: 12.8s
9:	learn: 2818.5723454	total: 1.4s	remaining: 12.6s
10:	learn: 2814.1595940	total: 1.53s	remaining: 12.4s
11:	learn: 2809.7272746	total: 1.66s	remaining: 12.2s
12:	learn: 2806.4218708	total: 1.79s	remaining: 12s
13:	learn: 2802.3048802	total: 1.93s	remaining: 11.9s
14:	learn: 2797.6381890	total: 2.06s	remaining: 11.7s
15:	learn: 2794.9436923	total: 2.2s	remaining: 11.5s
16:	learn: 2791.3134629	total: 2.33s	remaining: 11.4s
17:	learn: 2788.8661155	total: 2.46s	remaining: 11.2s
18:	learn: 2785.7323201	total: 2.6s	remainin

2020-05-16 22:58:01,298 INFO:<ipython-input-15-c052d415860f> Training of cb


99:	learn: 2691.9039336	total: 13.5s	remaining: 0us
0:	learn: 9501.3643520	total: 269ms	remaining: 26.6s
1:	learn: 9226.5072558	total: 546ms	remaining: 26.8s
2:	learn: 8959.2895905	total: 771ms	remaining: 24.9s
3:	learn: 8707.4267903	total: 1.01s	remaining: 24.3s
4:	learn: 8462.4326075	total: 1.26s	remaining: 24s
5:	learn: 8231.2437359	total: 1.51s	remaining: 23.6s
6:	learn: 8013.2934327	total: 1.75s	remaining: 23.2s
7:	learn: 7802.0284362	total: 2s	remaining: 22.9s
8:	learn: 7599.1926659	total: 2.23s	remaining: 22.5s
9:	learn: 7407.3837333	total: 2.47s	remaining: 22.2s
10:	learn: 7222.3932921	total: 2.71s	remaining: 21.9s
11:	learn: 7041.6461978	total: 2.95s	remaining: 21.6s
12:	learn: 6871.6553295	total: 3.19s	remaining: 21.4s
13:	learn: 6708.7915668	total: 3.43s	remaining: 21.1s
14:	learn: 6552.8192237	total: 3.67s	remaining: 20.8s
15:	learn: 6402.5944795	total: 3.9s	remaining: 20.5s
16:	learn: 6259.2924178	total: 4.13s	remaining: 20.2s
17:	learn: 6121.3914221	total: 4.36s	remaining

2020-05-16 22:58:50,086 INFO:<ipython-input-15-c052d415860f> Training of lgbm_with_preproc


99:	learn: 2869.9723082	total: 23.2s	remaining: 0us


New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
2020-05-16 22:59:39,225 INFO:<ipython-input-15-c052d415860f> Training of lgbm
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


In [16]:
test_data = pd.read_csv(os.path.join(DATA_PATH, TEST_FILE))

In [17]:
test_data.head()

Unnamed: 0,row_id,feat_0,feat_1,feat_2,feat_3,feat_4,feat_5,feat_6,feat_7,feat_8,...,feat_120,feat_121,feat_122,feat_123,feat_124,feat_125,feat_126,feat_127,feat_128,feat_129
0,0,A,D,I,A,A,0.36083,A,B,A,...,A,B,A,BI,B,0.298041,T,A,A,B
1,1,A,E,E,A,A,0.5245,A,A,A,...,B,H,A,BI,B,0.66248,R,A,A,B
2,5,A,A,E,A,A,0.49462,A,A,A,...,A,O,A,AB,A,0.677861,P,A,A,D
3,13,A,D,A,A,A,0.82252,A,A,A,...,A,L,A,BI,A,0.586522,T,A,A,B
4,14,A,A,A,A,A,0.64027,A,A,A,...,A,I,A,AB,A,0.284869,P,A,A,B


In [18]:
PREDICTIONS_PATH = '../predictions/'

In [19]:
for name in final_pipelines:
    prediction_file = os.path.join(PREDICTIONS_PATH, f'{name}_test_prediction.csv')
    prediction_df = test_data[['row_id']].copy()
    prediction_df['prediction'] = final_pipelines[name].predict(X=test_data[test_data.columns[1:]])
    prediction_df.to_csv(prediction_file, index=False)

In [22]:
MODELS_PATH = '../models/'

In [23]:
name = 'cb_with_preproc'

save_pipeline(
    final_pipelines[name],
    os.path.join(MODELS_PATH, f'{name}_model.pkl.gz') 
)

2020-05-16 23:02:59,540 INFO:model Pipeline saved as ../models/cb_with_preproc_model.pkl.gz


In [24]:
restored_pipeline = load_pipeline(os.path.join(MODELS_PATH, f'{name}_model.pkl.gz') )

2020-05-16 23:03:18,255 INFO:model Pipeline loaded from ../models/cb_with_preproc_model.pkl.gz


In [30]:
restored_prediction = restored_pipeline.predict(X=test_data[test_data.columns[1:]])
original_prediction = final_pipelines[name].predict(X=test_data[test_data.columns[1:]])
assert (restored_prediction == original_prediction).all()

Итак, мы сохранили предсказания наших моделей -- а также одну модель, которая по кросс-валидации показалась самой эффективной.