# Введение

## Описание задачи
[источник](https://contest.yandex.ru/contest/20144/problems/)

### RUS

Каждый пользователь, размещая объявление об аренде квартиры, хочет понимать, сколько времени потребуется для сдачи объекта. В первую очередь такая информация нужна на форме подачи объявлений, что заметно ограничивает набор возможных признаков. Вам предлагается построить модель, прогнозирующую длительность экспозиции объявлений на Яндекс.Недвижимости.
Для построения целевой переменной срок экспозиции разбит на несколько классов, каждому из которых соответствует целое число: "меньше 7 дней"(1), "7-14 дней"(2), "15-30 дней"(3), "30-70 дней"(4), "более 70 дней"(5).
Метрика, по которой оцениваются решения, записывается следующим образом:
$$
metric = -\frac{1}{l}\sum\limits_{i=1}^l \left(exp^{|prediction_i - target_i|} - 1\right)
$$

При отправке решения метрика вычисляется на публичной части тестовой выборки. Финальное значение будет определено по скрытой части тестовой выборки, чтобы исключить переобучение. Это значение будет опубликовано после окончания хакатона.
Оценивается только последняя посылка. Убедитесь, что на момент окончания хакатона ваша последняя посылка содержит именно тот прогноз, для которого вы хотите, чтобы было посчитано финальное значение метрики на скрытой части тестовой выборки.

## Описание датасета

### RUS

* **id** - id объявления про аренду квартиры
* **building_id** - id дома
* **unified_address** - адрес дома
* **building_series_id** - id серии дома
* **build_year** - год постройки дома
* **site_id** - id жилого комплекса, если у недавно построенных домов
* **parking** - тип парковки дома
* **expect_demolition** - дом входит в программу реновации и ожидает сноса
* **flats_count** - количество квартир в доме	
* **building_type** - тип стен в доме
* **main_image** - часть урла главной фотографии в объявлении, добавив "http:" можно получить урл (http://avatars.mds.yandex.net/get-realty/1702013/add.d352c435e10b2c47092f43a332bedb13.realty-api-vos/main/)
* **latitude, longitude** - координаты дома
* **total_area** - площадь квартиры в кв. м.
* **ceiling_height** - высота потолков в квартире
* **rooms** - количество комнат в квартире
* **floors_total** - количество этажей в доме
* **floor** - этаж квартиры
* **living_area** - жилая площадь в квартире
* **kitchen_area** - площадь кухни
* **is_apartment** - квартира юридически оформлена как апартаменты
* **studio** - квартиры является студией
* **has_elevator** - наличие лифта в доме	
* **day** - первый день экспозиции квартиры
* **balcony** - тип и наличие балкона
* **renovation** - качество ремонта
* **lolality_name** - имя населенного пункта
* **price** - цена аренды квартиры за 1 месяц
* **target** - срок экспозиции, который надо научиться предсказывать
* **target_string** - строковое представление срока экспозиции

# Подготовка ноутбука

## Первичные константы 

In [2]:
PROJECT_NAME = "hack_the_realty_exposition"
MOUNT_DIR = '/content/drive' # In case Colab Usage
VALIDATE_RATIO = 0.2

## Дополнительные установки

In [3]:
!pip install catboost



## Библиотеки

In [4]:
import os

from datetime import datetime

from collections import Counter

import math

import numpy as np

import pandas as pd

import catboost as ctb
from catboost import CatBoostRegressor, Pool
from catboost.utils import get_roc_curve

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import roc_auc_score

from plotly.subplots import make_subplots
import plotly.graph_objects as go

import matplotlib.pyplot as plt

%matplotlib inline

## Обработка случая работы в Google.Colab

### Подключение библиотек

In [5]:
try:
    from google.colab import files, drive
    
    USE_COLAB = True
except:
    USE_COLAB = False

if USE_COLAB:
    print("Don't forget to avoid disconnections:")
    print("""
function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    """)

Don't forget to avoid disconnections:

function ClickConnect(){
    console.log("Clicking"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)
    


### Подключение к Google.Drive

In [6]:
if USE_COLAB:
    drive.mount(MOUNT_DIR)
    DRIVE_DIR = os.path.join(MOUNT_DIR, 'My Drive')
    print(f"Drive directory is {DRIVE_DIR}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Drive directory is /content/drive/My Drive


## Объявление рабочей директории

Подключение к Google.Drive в случае работы c Google.Colab

In [7]:
PROJECT_DIR = os.path.join(DRIVE_DIR, 'projects', PROJECT_NAME) if USE_COLAB else './'
WORK_DIR = '/content' if USE_COLAB else PROJECT_DIR
print(f"Project directory is {PROJECT_DIR}")
print(f"Working directory is {WORK_DIR}")

Project directory is /content/drive/My Drive/projects/hack_the_realty_exposition
Working directory is /content


# Обработка данных

На выходе должны быть объявлены переменные:

* **X_train_pure**: pd.DataFrame\
id | building_series_id | site_id | parking | build_year | expect_demolition | ~~main_image~~ | latitude | ~~total_area~~ | ceiling_height | rooms | floors_total | living_area | floor | is_apartment | building_id | has_elevator | studio | ~~unified_address~~ | area | kitchen_area | ~~day~~ | longitude | price | flats_count | building_type | balcony | locality_name | renovation | *Month* | *Day* | *DayOfWeek* | *Season*

* **y_train**: pd.Series\
id | target
* **X_valid_pure**: pd.DataFrame\
id | building_series_id | site_id | parking | build_year | expect_demolition | latitude | total_area | ceiling_height | rooms | floors_total | living_area | floor | is_apartment | building_id | has_elevator | studio | unified_address | area | kitchen_area | day | longitude | price | flats_count | building_type | balcony | locality_name | renovation 
* **y_valid**: pd.Series\
id | target
* **X_test_pure**: pd.DataFrame\
id | building_series_id | site_id | parking | build_year | expect_demolition | latitude | total_area | ceiling_height | rooms | floors_total | living_area | floor | is_apartment | building_id | has_elevator | studio | unified_address | area | kitchen_area | day | longitude | price | flats_count | building_type | balcony | locality_name | renovation 

## Загрузка

Загрузка train/test датасета в датафреймы

На выходе должны быть объявлены две переменные:
* train_dataset: pd.Dataframe
* test_dataset: pd.Dataframe

### Объявление путей

In [8]:
src_data_dir_path = os.path.join(PROJECT_DIR, 'data')
dist_data_dir = os.path.join(WORK_DIR, 'data')

data_archive_path = os.path.join(src_data_dir_path, 'E.zip')

### Разархивация

In [9]:
!unzip -o "$data_archive_path" -d "$dist_data_dir" > /dev/null

### Считывание

In [10]:
train_file_path = os.path.join(dist_data_dir, 'E', 'exposition_train.tsv')
test_file_path = os.path.join(dist_data_dir, 'E', 'exposition_test.tsv')

train_dataset = pd.read_csv(train_file_path, index_col='id', sep='\t')
test_dataset = pd.read_csv(test_file_path, index_col='id', sep='\t')

## Визуализация

### Просмотр первых строк

In [11]:
train_dataset.head()

Unnamed: 0_level_0,building_series_id,site_id,target,parking,target_string,build_year,expect_demolition,main_image,latitude,total_area,ceiling_height,rooms,floors_total,living_area,floor,is_apartment,building_id,has_elevator,studio,unified_address,area,kitchen_area,day,longitude,price,flats_count,building_type,balcony,locality_name,renovation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
5677548107212057955,1564812,0,1,OPEN,LESS_7,2005,False,//avatars.mds.yandex.net/get-realty/903734/add...,55.645313,105.0,3.0,3,20,50.0,14,False,7969879732878112812,True,False,"Россия, Москва, Пролетарский проспект, 7",105.0,15.0,2018-07-15,37.65749,95000,407,MONOLIT,BALCONY,Москва,EURO
155646401125694364,1564812,0,2,CLOSED,7_14,2010,False,//avatars.mds.yandex.net/get-realty/1702013/ad...,55.537102,40.0,3.0,1,3,0.0,1,False,7667415960903930340,False,False,"Россия, Москва, посёлок Первомайское, Централь...",40.0,10.0,2019-01-18,37.155632,25000,40,MONOLIT,UNKNOWN,посёлок Первомайское,COSMETIC_DONE
9186198458182518100,663302,0,2,OPEN,7_14,1995,False,//avatars.mds.yandex.net/get-realty/924080/add...,55.662956,37.599998,2.64,0,17,0.0,4,False,7166215405310646476,True,True,"Россия, Москва, улица Намёткина, 13к1",37.599998,0.0,2018-04-24,37.555466,26000,472,PANEL,LOGGIA,Москва,GOOD
10844743366553352344,1564812,0,2,OPEN,7_14,2018,False,//avatars.mds.yandex.net/get-realty/1521999/ad...,55.669151,80.0,0.0,3,27,49.0,23,False,2039402855860137453,True,False,"Россия, Московская область, Одинцово, Верхне-П...",80.0,20.0,2019-02-19,37.285,35000,156,PANEL,UNKNOWN,Одинцово,GOOD
3712912186792420056,1564812,0,3,UNKNOWN,14_30,2004,False,//avatars.mds.yandex.net/get-realty/50286/f5c8...,55.828518,100.0,3.0,3,4,0.0,3,False,4638454967482853510,True,False,"Россия, Москва, улица Рословка, 12к1",100.0,0.0,2017-08-08,37.361897,80000,31,MONOLIT,UNKNOWN,Москва,EURO


In [12]:
test_dataset.head()

Unnamed: 0_level_0,building_series_id,site_id,parking,build_year,expect_demolition,main_image,latitude,total_area,ceiling_height,rooms,floors_total,living_area,floor,is_apartment,building_id,has_elevator,studio,unified_address,area,kitchen_area,day,public,longitude,price,flats_count,building_type,balcony,locality_name,renovation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
13762887891614807236,663294,0,UNKNOWN,1971,False,//avatars.mds.yandex.net/get-realty/1900763/ad...,55.795704,36.0,2.64,1,12,0.0,10,False,1470199257376425951,True,False,"Россия, Москва, Стрелецкая улица, 8",36.0,0.0,2020-01-25,True,37.602478,40000,80,PANEL,UNKNOWN,Москва,UNKNOWN
14654451946329972059,712125,0,UNKNOWN,1986,False,//avatars.mds.yandex.net/get-realty/1583116/ad...,55.605583,40.0,2.48,1,16,20.0,9,False,2797222858463679248,True,False,"Россия, Москва, Ясеневая улица, 36/2",40.0,10.0,2019-11-19,True,37.743679,25000,222,PANEL,LOGGIA,Москва,COSMETIC_DONE
17449292585625593873,0,0,UNKNOWN,2014,False,//avatars.mds.yandex.net/get-realty/2124710/ad...,55.92556,25.0,0.0,0,16,12.0,16,False,3589071626016088122,True,True,"Россия, Московская область, Королёв, улица Мар...",25.0,0.0,2020-01-11,True,37.862965,19000,179,MONOLIT,LOGGIA,Королёв,COSMETIC_DONE
15597282206699587329,0,0,UNKNOWN,2001,False,//avatars.mds.yandex.net/get-realty/2958378/ad...,55.432522,42.0,0.0,1,10,20.0,4,False,7320605563669842994,True,False,"Россия, Московская область, Подольск, Комсомол...",42.0,10.0,2020-01-27,True,37.544224,20000,0,PANEL,LOGGIA,Подольск,COSMETIC_DONE
3718201047023531068,1564812,0,UNKNOWN,2019,False,//avatars.mds.yandex.net/get-realty/2732616/ad...,55.91753,73.300003,2.8,3,16,45.799999,6,False,3672377569291994545,True,False,"Россия, Московская область, Химки, улица Герма...",73.300003,10.2,2020-03-04,False,37.411098,68000,0,MONOLIT,TWO_LOGGIA,Химки,EURO


### Изучение различных полей

* total_area и area

In [13]:
assert len(train_dataset[train_dataset.total_area != train_dataset.area]) == 0
assert len(test_dataset[test_dataset.total_area != test_dataset.area]) == 0

print("Поля total_area и area дубликаты")

Поля total_area и area дубликаты


* day

In [14]:
print(train_dataset['day'].min())
print(train_dataset['day'].max())

2017-01-01
2019-10-31


In [15]:
print(test_dataset['day'].min())
print(test_dataset['day'].max())

2019-11-01
2020-03-31


## Финальная обработка


На выходе должны быть объявлены переменные:

* **X_train_pure**: pd.DataFrame

* **y_train**: pd.Series

* **X_valid_pure**: pd.DataFrame

* **y_valid**: pd.Series

* **X_test_pure**: pd.DataFrame


### Извлечение выборки и таргета


In [16]:
train_dataset = train_dataset.sort_values(by='day')

partition = int(len(train_dataset) * (1 - VALIDATE_RATIO))
train_train_dataset = train_dataset.iloc[:partition]
train_valid_dataset = train_dataset.iloc[partition:]

X_train_train_pure = train_train_dataset.drop('target', axis=1)
X_train_train_pure = X_train_train_pure.drop('target_string', axis=1)
y_train_train = train_train_dataset['target']

X_train_valid_pure = train_valid_dataset.drop('target', axis=1)
X_train_valid_pure = X_train_valid_pure.drop('target_string', axis=1)
y_train_valid = train_valid_dataset['target']
X_test_pure = test_dataset

### Добавление дополнительных признаков и удаление ненужных

In [17]:
def add_time_features(X):
    datetime_series = pd.to_datetime(X['day'], format='%Y-%m-%d')
    X['Month'] = pd.DatetimeIndex(datetime_series).month
    X['Day'] = pd.DatetimeIndex(datetime_series).day
    X['DayOfWeek'] = pd.DatetimeIndex(datetime_series).dayofweek
    X['Season'] = (X['Month'] % 12 + 3) // 3
    return X

def drop_columns(X):
    dropped_columns = [
        'day', 'total_area', 'main_image', 'unified_address'
    ]
    return X.drop(dropped_columns, axis=1)

def prepare_sample(X):
    X = X.copy()
    X = add_time_features(X)
    X = drop_columns(X)
    return X

In [18]:
X_train_train = prepare_sample(X_train_train_pure)
X_train_valid = prepare_sample(X_train_valid_pure)
X_test = prepare_sample(X_test_pure)

In [19]:
X_train_valid.columns

Index(['building_series_id', 'site_id', 'parking', 'build_year',
       'expect_demolition', 'latitude', 'ceiling_height', 'rooms',
       'floors_total', 'living_area', 'floor', 'is_apartment', 'building_id',
       'has_elevator', 'studio', 'area', 'kitchen_area', 'longitude', 'price',
       'flats_count', 'building_type', 'balcony', 'locality_name',
       'renovation', 'Month', 'Day', 'DayOfWeek', 'Season'],
      dtype='object')

# Эксперименты с обучением

На выходе должны быть объявлены переменные

* **y_test_predicted**: pd.Series \
Id | Target
* **SUBMISSTION_FILE_NAME**: string\
Названия файла для сохранения

## Catboost

### Объявление loss функции

In [20]:
class LossFunction(object):
    def calc_ders_range(self, approxes, targets, weights):
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        for index in range(len(targets)):
            der1 = targets[index] - approxes[index]
            sign = -1 if  approxes[index] < targets[index] else 1
            der2 = math.exp(abs(targets[index] - approxes[index]))
            der1 = der2 * sign

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

In [21]:
class EvalMetric(LossFunction):
    def is_max_optimal(self):
        return True

    def evaluate(self, approxes, target, weight):
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        error_sum = 0.0
        weight_sum = 0.0

        for i in range(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error = -(math.exp(abs(approx[i] - target[i])) - 1)
            error_sum += error

        return error_sum, weight_sum
    
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

### Объявление пула и параметров

In [49]:
X_test.head()

Unnamed: 0_level_0,building_series_id,site_id,parking,build_year,expect_demolition,latitude,ceiling_height,rooms,floors_total,living_area,floor,is_apartment,building_id,has_elevator,studio,area,kitchen_area,public,longitude,price,flats_count,building_type,balcony,locality_name,renovation,Month,Day,DayOfWeek,Season
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
13762887891614807236,663294,0,UNKNOWN,1971,False,55.795704,2.64,1,12,0.0,10,False,1470199257376425951,True,False,36.0,0.0,True,37.602478,40000,80,PANEL,UNKNOWN,Москва,UNKNOWN,1,25,5,1
14654451946329972059,712125,0,UNKNOWN,1986,False,55.605583,2.48,1,16,20.0,9,False,2797222858463679248,True,False,40.0,10.0,True,37.743679,25000,222,PANEL,LOGGIA,Москва,COSMETIC_DONE,11,19,1,4
17449292585625593873,0,0,UNKNOWN,2014,False,55.92556,0.0,0,16,12.0,16,False,3589071626016088122,True,True,25.0,0.0,True,37.862965,19000,179,MONOLIT,LOGGIA,Королёв,COSMETIC_DONE,1,11,5,1
15597282206699587329,0,0,UNKNOWN,2001,False,55.432522,0.0,1,10,20.0,4,False,7320605563669842994,True,False,42.0,10.0,True,37.544224,20000,0,PANEL,LOGGIA,Подольск,COSMETIC_DONE,1,27,0,1
3718201047023531068,1564812,0,UNKNOWN,2019,False,55.91753,2.8,3,16,45.799999,6,False,3672377569291994545,True,False,73.300003,10.2,False,37.411098,68000,0,MONOLIT,TWO_LOGGIA,Химки,EURO,3,4,2,2


In [22]:
cat_features = [
    'building_series_id', 'site_id', 'parking', 'expect_demolition',
    'is_apartment', 'building_id', 'has_elevator', 'studio', 'building_type',
    'balcony', 'locality_name', 'renovation'
]
ignored_features = [
]
train_train_pool = Pool(
    X_train_train,
    y_train_train,
    cat_features=cat_features
)
train_valid_pool = Pool(
    X_train_valid,
    y_train_valid,
    cat_features=cat_features
)
params = {
    'iterations': 200,
    'learning_rate': 0.02,
    'eval_metric': EvalMetric(),
    'loss_function': LossFunction(),
#     'random_seed': 113,
    # 'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}

### Обучение с валидацией

In [23]:
%%time
model = CatBoostRegressor(**params)
model.fit(train_train_pool, eval_set=train_valid_pool)

0:	learn: -43.9252061	test: -43.1081960	best: -43.1081960 (0)	total: 1.73s	remaining: 5m 43s
1:	learn: -43.0356195	test: -42.2347624	best: -42.2347624 (1)	total: 3.39s	remaining: 5m 35s
2:	learn: -42.1636534	test: -41.3786537	best: -41.3786537 (2)	total: 4.95s	remaining: 5m 25s
3:	learn: -41.3089461	test: -40.5395164	best: -40.5395164 (3)	total: 6.58s	remaining: 5m 22s
4:	learn: -40.4711668	test: -39.7169766	best: -39.7169766 (4)	total: 8.14s	remaining: 5m 17s
5:	learn: -39.6499752	test: -38.9107176	best: -38.9107176 (5)	total: 9.77s	remaining: 5m 15s
6:	learn: -38.8450393	test: -38.1204179	best: -38.1204179 (6)	total: 11.3s	remaining: 5m 12s
7:	learn: -38.0560497	test: -37.3457669	best: -37.3457669 (7)	total: 12.9s	remaining: 5m 8s
8:	learn: -37.2826815	test: -36.5864640	best: -36.5864640 (8)	total: 14.5s	remaining: 5m 7s
9:	learn: -36.5246267	test: -35.8421968	best: -35.8421968 (9)	total: 16.1s	remaining: 5m 5s
10:	learn: -35.7815848	test: -35.1126693	best: -35.1126693 (10)	total: 17

### Визуализация результатов

In [29]:
def visualize_learning_results(metrics):
    """
    Plot one graph for each metric

    :param metrics: dic of metrics
    """

    n = len(metrics)
    
    fig = make_subplots(
        rows=n,
        cols=1,
        subplot_titles=list(metrics.keys())
    )

    for i, (metric_name, curves) in enumerate(metrics.items()):
        for dataset_type, curve in curves.items():
            m = len(curve)
            fig.add_trace(
                go.Scatter(
                    x=np.arange(m),
                    y=curve,
                    mode='lines',
                    name=f'{dataset_type} {metric_name}'
                ),
                row=i + 1,
                col=1
            )

    fig.update_layout(
        title_text="Learning results",
        width=297. * 3,
        height=210. * 3 * n
    )
    fig.show()

In [30]:
eval_results = model.get_evals_result()

In [31]:
eval_results = model.get_evals_result()
metrics = dict(
    Loss=dict(
        validation=eval_results['validation']['EvalMetric'],
        learn=eval_results['learn']['EvalMetric']
    ),
    Metric=dict(
        validation=eval_results['validation']['EvalMetric']
    )
)

In [32]:
visualize_learning_results(metrics)

### Объявление финальных параметров

In [42]:
cat_features = [
    'building_series_id', 'site_id', 'parking', 'expect_demolition',
    'is_apartment', 'building_id', 'has_elevator', 'studio', 'building_type',
    'balcony', 'locality_name', 'renovation'
]
ignored_features = [
]
train_pool = Pool(
    pd.concat([X_train_train, X_train_valid]),
    pd.concat([y_train_train, y_train_valid]),
    cat_features=cat_features
)
params = {
    'iterations': 100,
    'learning_rate': 0.04,
    'eval_metric': EvalMetric(),
    'loss_function': LossFunction(),
#     'random_seed': 113,
    # 'logging_level': 'Silent',
    'use_best_model': False,
    'ignored_features': ignored_features
}
n_models = 10

### Обучение

In [43]:
%%time
models = []
for i in range(n_models):
    model = CatBoostRegressor(**params)
    model.fit(train_pool)
    models.append(model)

0:	learn: -42.8754417	total: 1.75s	remaining: 2m 52s
1:	learn: -41.1550451	total: 3.41s	remaining: 2m 47s
2:	learn: -39.5021085	total: 5.05s	remaining: 2m 43s
3:	learn: -37.9139968	total: 6.87s	remaining: 2m 44s
4:	learn: -36.3881553	total: 8.41s	remaining: 2m 39s
5:	learn: -34.9221332	total: 10s	remaining: 2m 37s
6:	learn: -33.5135981	total: 11.8s	remaining: 2m 37s
7:	learn: -32.1602908	total: 13.5s	remaining: 2m 35s
8:	learn: -30.8600506	total: 15.3s	remaining: 2m 34s
9:	learn: -29.6107845	total: 16.9s	remaining: 2m 31s
10:	learn: -28.4105080	total: 18.4s	remaining: 2m 29s
11:	learn: -27.2573042	total: 20s	remaining: 2m 26s
12:	learn: -26.1493069	total: 21.8s	remaining: 2m 25s
13:	learn: -25.0847571	total: 23.5s	remaining: 2m 24s
14:	learn: -24.0619535	total: 25.1s	remaining: 2m 22s
15:	learn: -23.0792600	total: 26.6s	remaining: 2m 19s
16:	learn: -22.1350861	total: 28.4s	remaining: 2m 18s
17:	learn: -21.2279325	total: 30.1s	remaining: 2m 16s
18:	learn: -20.3563492	total: 31.6s	remain

### Получение предсказаний

In [68]:
y_test_predicted = None
X_test_test = X_test.drop('public', axis=1)
for model in models:
    predicted = model.predict(X_test_test)
    y_test_cur_predicted = pd.Series(
        predicted,
        index=X_test_test.index,
        name='target'
    ).sort_index()
    
    if y_test_predicted is None:
        y_test_predicted = y_test_cur_predicted
    else:
        y_test_predicted += y_test_cur_predicted

y_test_predicted /= n_models
y_test_predicted = y_test_predicted.round().astype('int64')
SUBMISSTION_FILE_NAME = 'exposition_sample_submission.tsv'

# Отправка результатов

## Определение пути

In [74]:
(X_test.public).sum() / len(X_test.public)

0.7594535763123378

In [69]:
submission_folder_path = os.path.join(PROJECT_DIR, 'submissions')
file_path = os.path.join(submission_folder_path, SUBMISSTION_FILE_NAME)
print(f"File will be saved to {file_path}")

File will be saved to /content/drive/My Drive/projects/hack_the_realty_exposition/submissions/exposition_sample_submission.tsv


## Сохранение

In [70]:
y_test_predicted.to_csv(file_path, sep='\t')

# Юнит тестирование


## Подключение библиотек

In [None]:
import unittest

## Объявление тестирующего класса

In [None]:
class TestNotebook(unittest.TestCase):
    def test_add(self):
        self.assertEqual(2 + 2, 4)

## Запуск тестов

In [None]:
unittest.main(argv=[''], verbosity=2, exit=False)

test_add (__main__.TestNotebook) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


<unittest.main.TestProgram at 0x7f929c006dd8>