# Tutorial 1: Basics

<img src="../../imgs/LightAutoML_logo_big.png" alt="LightAutoML logo" style="width:100%;"/>

Official LightAutoML github repository is [here](https://github.com/AILab-MLTools/LightAutoML)


In this tutorial you will learn how to:
* run LightAutoML training on tabular data
* obtain feature importances and reports
* configure resource usage in LightAutoML

## 0. Prerequisites

### 0.0. install LightAutoML

In [1]:
#!pip install -U lightautoml

### 0.1. Import libraries

Here we will import the libraries we use in this kernel:
- Standard python libraries for timing, working with OS and HTTP requests etc.
- Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell)
- LightAutoML modules: presets for AutoML, task and report generation module

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Standard python libraries
import os
import json
import requests

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, recall_score, precision_score, classification_report, \
    mean_absolute_error
from sklearn.model_selection import train_test_split
from copy import deepcopy

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDeco
from tqdm.notebook import tqdm
from tqdm import notebook
from natsort import natsort_keygen

### 0.2. Constants

Here we setup some parameters to use in the kernel:
- `N_THREADS` - number of vCPUs for LightAutoML model creation
- `N_FOLDS` - number of folds in LightAutoML inner CV
- `RANDOM_STATE` - random seed for better reproducibility
- `TEST_SIZE` - houldout data part size 
- `TIMEOUT` - limit in seconds for model to train
- `TARGET_NAME` - target column name in dataset

In [3]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.1
TIMEOUT = 3000
TARGET_NAME = 'target'

### 0.3. Imported models setup

For better reproducibility fix numpy random seed with max number of threads for Torch (which usually try to use all the threads on server):

In [4]:
np.random.seed(RANDOM_STATE)

### 0.4. Data loading
Let's check the data we have:

In [5]:
data = pd.read_csv('dataset_hackaton_ksg__v2__23062023__1710_GMT3.csv', sep=';', index_col=0,
                  parse_dates = ["ДатаНачалаЗадачи", "ДатаОкончанияЗадачи", "ДатаначалаБП0", "ДатаокончанияБП0", "date_report"])
tasks = pd.read_excel('Перечень задач 1.xlsx')

  data = pd.read_csv('dataset_hackaton_ksg__v2__23062023__1710_GMT3.csv', sep=';', index_col=0,


In [6]:
data = data[(~((data['ПроцентЗавершенияЗадачи']==0) & ((data['ДатаНачалаЗадачи'] - data['date_report']).dt.days > 8)))] \
    .sort_values(['obj_key', 'Кодзадачи', 'НазваниеЗадачи', 'date_report', 'ДатаОкончанияЗадачи', 'ПроцентЗавершенияЗадачи']) \
    .drop_duplicates(['obj_key', 'Кодзадачи', 'НазваниеЗадачи', 'date_report', 'ДатаОкончанияЗадачи'], keep='last') \
    .dropna(subset=['ДатаОкончанияЗадачи'])
data["target"] = (data.groupby(["obj_key", "Кодзадачи", "НазваниеЗадачи"])["ДатаОкончанияЗадачи"].transform(pd.Series.diff).dt.days).shift(-1)
data['binary_target'] = (data['target'] > 0).astype(int)

In [7]:
train = data[data.date_report < np.datetime64('2023-05-19')].dropna(subset=['target'])
test = data[data.date_report == np.datetime64('2023-04-24')].dropna(subset=['target'])

In [8]:
attr = pd.read_csv('data_mgz_attributes__24062023__1000_GMT3.csv',sep=';', parse_dates=['date_report'], index_col=0) \
    .rename(columns={'Код ДС': 'obj_key'})

In [9]:
check = pd.read_excel('Тест.xlsx').dropna(subset=['Кодзадачи'])
check['date_report'] = np.datetime64('2023-06-19')
check

Unnamed: 0,№ п/п,obj_prg,obj_subprg,obj_key,obj_shortName,Кодзадачи,НазваниеЗадачи,ПроцентЗавершенияЗадачи,ДатаНачалаЗадачи,ДатаОкончанияЗадачи,ДатаначалаБП0,ДатаокончанияБП0,Статуспоэкспертизе,Экспертиза,date_report
0,36,Образование,Дошкольные учреждения,020-0684,"ДОУ на 125, ТПУ ""Мневники""",1,Предпроектные работы,0.0,2020-11-03,2022-02-01,2020-11-03,2021-12-29,,,2023-06-19
1,35,Образование,Дошкольные учреждения,019-0589,"ДОУ на 225, ТПУ ""Мневники""",1,Предпроектные работы,0.0,2020-11-03,2022-05-16,2020-11-03,2021-12-29,,,2023-06-19
2,61,Образование,Общеобразовательные учреждения,019-0594,"Школа на 800, ТПУ ""Мневники""",1,Предпроектные работы,0.0,2021-05-04,2021-12-15,2021-05-04,2021-12-15,,,2023-06-19
3,89,Культура,Культурные центры,021-0458,"КСЦ ""Печатники"", Полбина",1,Предпроектные работы,100.0,2021-10-12,2023-05-29,2021-10-12,2023-05-12,,,2023-06-19
4,34,Образование,Дошкольные учреждения,017-0520,"ДОУ на 350, ул. 6-я Радиальная и ул. Дуговая",1,Предпроектные работы,100.0,2018-11-01,2022-02-15,2018-11-01,2022-02-15,,,2023-06-19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64646,80,Образование,Общеобразовательные учреждения,022-0522,"Школа на 600 мест, ул. Новохохловская",9.6.2,Акт приемки законченного строительством объект...,0.0,2025-08-14,2025-08-14,2025-08-11,2025-08-11,,,2023-06-19
64647,84,Образование,Общеобразовательные учреждения,022-0527,"ЗОО на 375 мест, р-н Северное Измайлово, мкр. ...",9.6.2,Акт приемки законченного строительством объект...,0.0,2024-07-12,2024-07-12,2024-08-07,2024-08-07,,,2023-06-19
64648,83,Образование,Общеобразовательные учреждения,022-0526,"ЗОО на 700 мест, ул. 11-я Парковая",9.6.2,КC-11 оформлена,0.0,2025-08-11,2025-08-11,2025-08-11,2025-08-11,,,2023-06-19
64649,79,Образование,Общеобразовательные учреждения,022-0354,"Школа 1175, ул. 6-я Радиальная и Дуговая",9.6.2,КC-11 оформлена,0.0,2025-07-21,2025-07-21,2025-07-21,2025-07-21,,,2023-06-19


In [11]:
def add_features(df, attr, historical_data=None):
    data = deepcopy(df)
    for col in ["ДатаНачалаЗадачи", "ДатаОкончанияЗадачи", "ДатаначалаБП0", "ДатаокончанияБП0", "date_report"]:
        data[col] = pd.to_datetime(data[col])
    data = pd.merge(data, attr, on=['obj_key', 'date_report'], how='left')
    data = data.dropna(subset=['Кодзадачи'])
    data["Статуспоэкспертизе"] = data["Статуспоэкспертизе"].fillna(0)
    data["состояние площадки"] = data["состояние площадки"].fillna("Свободна, передана")
    data["НазваниеЗадачи"] = data["НазваниеЗадачи"].fillna("unknow")
    data["Экспертиза"] = data["Экспертиза"].fillna("unknow")
    data["diff_start"] = (((data["ДатаначалаБП0"] - data["ДатаНачалаЗадачи"]).dt.days).fillna(0)).astype(np.int32)
    data["diff_end"] = (((data["ДатаокончанияБП0"] - data["ДатаОкончанияЗадачи"]).dt.days).fillna(0)).astype(np.int32)
    data["early_start"] = (data["diff_start"] > 0).astype(np.uint8)
    data["start_on_time"] = (data["diff_start"] == 0).astype(np.uint8)
    data["late_end"] = (data["diff_end"] < 0).astype(np.uint8)
    data["end_on_time"] = (data["diff_end"] == 0).astype(np.uint8)
    data["is_known_Генпроектировщик"] = data["Генпроектировщик"].isna().astype(np.uint8)
    data["is_known_Генподрядчик"] = data["Генподрядчик"].isna().astype(np.uint8)
    data["is_known_Площадь"] = data["Площадь"].isna().astype(np.uint8)
    data["encoded_amount_Площадь"] = np.digitize(
        data["Площадь"],
        bins=[
            data["Площадь"].quantile(q)
            for q in np.arange(0.0, 1.0, 0.1)
        ]
    )
    data["is_known_Кол-во рабочих"] = data["Кол-во рабочих"].isna().astype(np.uint8)
    data["encoded_amount_Кол-во рабочих"] = np.digitize(
        data["Кол-во рабочих"],
        bins=[
            data["Кол-во рабочих"].quantile(q)
            for q in np.arange(0.0, 1.0, 0.1)
        ]
    )
    data["specific_area1"] = data["Площадь"] / data["Кол-во рабочих"]
    data.loc[data['specific_area1'] == np.inf, 'specific_area1'] = 0.0
    data["specific_area2"] = data["Площадь"] / data["Генподрядчик"]
    data.loc[data['specific_area2'] == np.inf, 'specific_area2'] = 0.0

    data["speed1"] = 100 / (data["ДатаокончанияБП0"] - data["ДатаначалаБП0"]).dt.days.fillna(0)
    data["speed2"] = 100 / (data["ДатаОкончанияЗадачи"] - data["ДатаНачалаЗадачи"]).dt.days.fillna(0)
    data["spped3"] = data["ПроцентЗавершенияЗадачи"] / (data["date_report"] - data["ДатаНачалаЗадачи"]).dt.days.fillna(0)
    data["reserve"] = (data["ДатаОкончанияЗадачи"] - data["date_report"]).dt.days
    data["no_reserve"] = (data["reserve"] < 0).astype(np.int8)
    
    data["time_task_plan"] = (((data["ДатаокончанияБП0"] - data["ДатаначалаБП0"]).dt.days).fillna(0)).astype('int')
    data["time_task_fact"] = (((data["ДатаОкончанияЗадачи"] - data["ДатаНачалаЗадачи"]).dt.days).fillna(0)).astype('int')
    data["diff_time_plan_fact"] = data["time_task_fact"] - data["time_task_plan"]

    # Считаем средние значения плана по контракту и сравниваем каждое значение со средним (в срезе по этопу)
    avg_time_task_plan = data.groupby(["Кодзадачи", "НазваниеЗадачи"])["time_task_plan"].mean().reset_index().rename(columns={"time_task_plan": "avg_time_task_plan"})
    data = data.merge(avg_time_task_plan[["Кодзадачи", "НазваниеЗадачи", "avg_time_task_plan"]], how='left', on=["Кодзадачи", "НазваниеЗадачи"])
    data["diff_avg_time_plan"] = data["time_task_plan"] - data["avg_time_task_plan"]

    # Считаем средние значения факта по контракту и сравниваем каждое значение со средним (в срезе по этопу)
    avg_time_task_fact = data.groupby(["Кодзадачи", "НазваниеЗадачи"])["time_task_fact"].mean().reset_index().rename(columns={"time_task_fact": "avg_time_task_fact"})
    data = data.merge(avg_time_task_fact, how='left', on=["Кодзадачи", "НазваниеЗадачи"])
    data["diff_avg_time_fact"] = data["time_task_fact"] - data["avg_time_task_fact"]

    # Считаем средние значения факта по контракту и сравниваем каждое значение со средним (в срезе по этопу)
    avg_time_task_fact = data.groupby(["Кодзадачи", "НазваниеЗадачи"])["diff_time_plan_fact"].mean().reset_index().rename(columns={"diff_time_plan_fact": "avg_diff_time_plan_fact"})
    data = data.merge(avg_time_task_fact, how='left', on=["Кодзадачи", "НазваниеЗадачи"])
    data["diff_avg_time_plan_fact"] = data["diff_time_plan_fact"] - data["avg_diff_time_plan_fact"]
    
    # Нормализация планового времени относительно площади объекта (Сколько заложено времени на такой объем). 
    # Чем больше проект тем больше надо времени. Если изначально времени выделено меньше, то возможна просрочка. 
    data["time_by_square"] = (data["time_task_fact"]+1)/(data["Площадь"].fillna(0)+1)
    data["time_by_worker"] = (data["time_task_fact"]+1)/(data["Кол-во рабочих"].fillna(0)+1)

    # Удаляем временые колонки
    data = data.drop(columns=['time_task_plan', 'time_task_fact' ,'avg_time_task_fact', 'avg_time_task_plan', 'diff_time_plan_fact', 'avg_diff_time_plan_fact'])
    if historical_data is None:
        data["просрок"] = (data.groupby(["obj_key", "Кодзадачи", "НазваниеЗадачи"])["ДатаОкончанияЗадачи"].transform(pd.Series.diff).dt.days)
        data['cumsum_просрок'] = data.groupby(["obj_key", "Кодзадачи", "НазваниеЗадачи"])["просрок"].transform(pd.Series.cumsum).fillna(0)
    else:
        full = pd.concat([data, historical_data]).sort_values(["obj_key", "Кодзадачи", "НазваниеЗадачи",'date_report'])
        full["просрок"] = (full.groupby(["obj_key", "Кодзадачи", "НазваниеЗадачи"])["ДатаОкончанияЗадачи"].transform(pd.Series.diff).dt.days).fillna(0)
        full['cumsum_просрок'] = full.groupby(["obj_key", "Кодзадачи", "НазваниеЗадачи"])["просрок"].transform(pd.Series.cumsum).fillna(0)
        data = full[full.date_report.isin(data.date_report.unique())]
#         data = data.merge(full, on=["obj_key", "Кодзадачи", "НазваниеЗадачи"], how="left")
    return data


In [12]:
train_data = add_features(train, attr)
test_data = add_features(test, attr, historical_data=train.drop(columns=['target', 'binary_target']))
check_data = add_features(check, attr, historical_data=train.drop(columns=['target', 'binary_target']))

In [13]:
set(train_data.columns) - set(check_data.columns)

{'binary_target', 'target'}

In [14]:
train_data.columns

Index(['№ п/п', 'obj_prg', 'obj_subprg', 'obj_key', 'Кодзадачи',
       'НазваниеЗадачи', 'ПроцентЗавершенияЗадачи', 'ДатаНачалаЗадачи',
       'ДатаОкончанияЗадачи', 'ДатаначалаБП0', 'ДатаокончанияБП0',
       'Статуспоэкспертизе', 'Экспертиза', 'date_report', 'target',
       'binary_target', 'состояние площадки', 'Площадь', 'Генпроектировщик',
       'Генподрядчик', 'Кол-во рабочих', 'diff_start', 'diff_end',
       'early_start', 'start_on_time', 'late_end', 'end_on_time',
       'is_known_Генпроектировщик', 'is_known_Генподрядчик',
       'is_known_Площадь', 'encoded_amount_Площадь', 'is_known_Кол-во рабочих',
       'encoded_amount_Кол-во рабочих', 'specific_area1', 'specific_area2',
       'speed1', 'speed2', 'spped3', 'reserve', 'no_reserve',
       'diff_avg_time_plan', 'diff_avg_time_fact', 'diff_avg_time_plan_fact',
       'time_by_square', 'time_by_worker', 'просрок', 'cumsum_просрок'],
      dtype='object')

1) разница ДатаначалаБП0 и ДатаНачалаЗадачи
2) разница ДатаокончанияБП0 и ДатаОкончанияЗадачи
3) начались ли работы раньше срока
4) начались ли работы точно в срок
5) завершились ли работы позднее обозначенного дедлайна
6) завершились ли работы точно в срок
7) известен ли генпроектировщик
8) один ли генпроектировщик
9) известен ли генподрядчик
10) отсутствует ли генподрядчик
11) более ли одного генподрядчика
12) закодированное кол-во генпроектировщиков исходя из децильного распределения
13) закодированное кол-во генподрядчиков исходя из децильного распределения
14) известна ли площадь
15) закодированная площадь исходя из децильного распределения
16) известно ли кол-во работников
17) закодированное кол-во работников исходя из децильного распределения
18) дамми фичи по состоянию площадки (+5)
23) удельная площадь 1 - площадь, покрываемая одним работников
24) удельная площадь 2 - площадь, покрываемая одним генподрядчиков
25) процент выполнения задачи в день (от момента текущего репорта до следующего, т.е. скорость выполнения задачи)

## 1. Task definition

### 1.1. Task type

First we need to create ```Task``` object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task) in our documentation).

The following task types are available:

- ```'binary'``` - for binary classification.

- ```'reg’``` - for regression.

- ```‘multiclass’``` - for multiclass classification.

- ```'multi:reg``` - for multiple regression.

- ```'multilabel'``` - for multi-label classification.

In this example we will consider a binary classification:

In [13]:
task = Task('binary')

Note: only logloss loss is available for binary task and it is the default loss. Default metric for binary classification is ROC-AUC. See more info about available and default losses and metrics [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task). 

**Depending on the task, you can and shold choose exactly those metrics and losses that you want and need to optimize.**

### 1.2. Feature roles setup

To solve the task, we need to setup columns roles. LightAutoML can automatically define types and roles of data columns, but it is possible to specify it directly through the dictionary parameter ```roles``` when training AutoML model (see next section "AutoML training"). Specific roles can be specified using a string with the name (any role can be set like this).  So the key in dictionary must be the name of the role, the value must be a list of the names of the corresponding columns in dataset. The **only role you must setup is** ```'target'``` **role** (that is column with target variable obviously), everything else (```'drop', 'numeric', 'categorical', 'group', 'weights'``` etc) is up to user:

In [14]:
roles = {
    'target': 'binary_target',
    'drop': [ 'target'],
    'datetime': ["ДатаНачалаЗадачи",'ДатаОкончанияЗадачи', "ДатаначалаБП0", "ДатаокончанияБП0", "date_report"],
}

You can also optionally specify the following roles:

- ```'numeric'``` - numerical feature

- ```'category'``` - categorical feature

- ```'text'``` - text data

- ```'datetime'``` - features with date and time 

- ```'date'``` - features with date only

- ```'group'``` - features by which the data can be divided into groups and which can be taken into account for group k-fold validation (so the same group is not represented in both testing and training sets)

- ```'drop'``` - features to drop, they will not be used in model building

- ```'weights'``` - object weights for the loss and metric

- ```'path'``` - image file paths (for CV tasks)

- ```'treatment'``` - object group in uplift modelling tasks: treatment or control

Note: role name can be written in any case. Also it is possible to pass individual objects of role classes with specific arguments instead of strings with role names for specific tasks and more optimal pipeline construction ([more details](https://github.com/sb-ai-lab/LightAutoML/blob/master/lightautoml/dataset/roles.py)).

For example, to set the date role, you can use the ```DatetimeRole``` class. 

### 1.3. LightAutoML model creation - TabularAutoML preset

In [15]:
automl = TabularAutoML(
    task = task, 
    timeout = TIMEOUT,
    cpu_limit = N_THREADS+2,
    general_params = {"use_algos": [['lgb']]},
    reader_params = {'n_jobs': N_THREADS+2, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)

## 2. Classification

To run autoML training use ```fit_predict``` method. 

Main arguments:

- `train_data` - dataset to train.
- `roles` - column roles dict.
- `verbose` - controls the verbosity: the higher, the more messages:
        <1  : messages are not displayed;
        >=1 : the computation process for layers is displayed;
        >=2 : the information about folds processing is also displayed;
        >=3 : the hyperparameters optimization process is also displayed;
        >=4 : the training process for every algorithm is displayed;

Note: out-of-fold prediction is calculated during training and returned from the fit_predict method

In [16]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 4)

[08:40:33] Stdout logging level is DEBUG.
[08:40:33] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[08:40:33] Task: binary

[08:40:33] Start automl preset with listed constraints:
[08:40:33] - time: 3000.00 seconds
[08:40:33] - CPU: 6 cores
[08:40:33] - memory: 16 GB

[08:40:33] [1mTrain data shape: (479083, 47)[0m

[08:40:42] Feats was rejected during automatic roles guess: []
[08:40:42] Layer [1m1[0m train process start. Time left 2991.72 secs
[08:40:44] Training until validation scores don't improve for 100 rounds
[08:40:47] [100]	valid's auc: 0.994779
[08:40:50] [200]	valid's auc: 0.996364
[08:40:54] [300]	valid's auc: 0.996902
[08:40:58] [400]	valid's auc: 0.997119
[08:41:01] [500]	valid's auc: 0.997253
[08:41:05] [600]	valid's auc: 0.997317
[08:41:08] [700]	valid's auc: 0.997367
[08:41:12] [800]	valid's auc: 0.997412
[08:41:15] [900]	valid's auc: 0.997432
[08:41:19] [1000]	valid's auc: 0.99746
[08:41:23] [1100]	valid's auc:

In [19]:
print(roc_auc_score(train_data['binary_target'], out_of_fold_predictions.data[:,0]))
print(classification_report(train_data['binary_target'], out_of_fold_predictions.data[:,0]>0.01))

0.99530511297723
              precision    recall  f1-score   support

           0       1.00      0.94      0.97    329776
           1       0.60      0.99      0.74     30372

    accuracy                           0.94    360148
   macro avg       0.80      0.96      0.86    360148
weighted avg       0.96      0.94      0.95    360148



In [22]:
test_pred = automl.predict(test_data).data[:,0]
print(roc_auc_score(test_data['binary_target'], test_pred))
print(classification_report(test_data['binary_target'], test_pred>0.01))

0.971191049328297
              precision    recall  f1-score   support

         0.0       0.99      0.92      0.95     26802
         1.0       0.56      0.91      0.69      3051

    accuracy                           0.92     29853
   macro avg       0.78      0.91      0.82     29853
weighted avg       0.95      0.92      0.93     29853



In [17]:
import pickle
with open('classifier.pkl', 'wb') as handle:
    pickle.dump(automl, handle)

In [19]:
with open('classifier.pkl', 'rb') as handle:
    model = pickle.load(handle)

In [20]:
test_pred = model.predict(test_data).data[:,0]
check_pred = model.predict(check_data).data[:,0]

After training we can see logs with all the progress, final scores, weights assigned to the models in the final prediction etc.

Note that in this `fit_predict` you receive the model with only 3 out of 5 LightGBM models (you can see that from the log line in the end `0.25685 * (3 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM)`) - to fix it you can set the bigger timeout to make LightAutoML train all the models.

## 3. Regression

In [21]:
task = Task('reg', loss='mae', metric='mae')
roles = {
    'target': 'target',
    'drop': ['binary_target'],
    'datetime': ["ДатаНачалаЗадачи",'ДатаОкончанияЗадачи', "ДатаначалаБП0", "ДатаокончанияБП0", "date_report"],
}

In [22]:
automl = TabularAutoML(
    task = task, 
    timeout = TIMEOUT * 10,
    cpu_limit = N_THREADS+2,
    general_params = {"use_algos": [['lgb','cb']]},
    reader_params = {'n_jobs': N_THREADS+2, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)

In [23]:
train_data['pred_prev'] = out_of_fold_predictions.data[:,0]
train_reg = train_data[train_data.binary_target!=0]
out_of_fold_reg = automl.fit_predict(train_reg, roles = roles, verbose = 4)

[08:50:07] Stdout logging level is DEBUG.
[08:50:07] Task: reg

[08:50:07] Start automl preset with listed constraints:
[08:50:07] - time: 30000.00 seconds
[08:50:07] - CPU: 6 cores
[08:50:07] - memory: 16 GB

[08:50:07] [1mTrain data shape: (40917, 48)[0m

[08:50:13] Feats was rejected during automatic roles guess: []
[08:50:13] Layer [1m1[0m train process start. Time left 29994.31 secs
[08:50:13] Training until validation scores don't improve for 200 rounds
[08:50:14] [100]	valid's l1: 15.6463
[08:50:14] [200]	valid's l1: 13.1984
[08:50:14] [300]	valid's l1: 11.9365
[08:50:15] [400]	valid's l1: 11.3192
[08:50:15] [500]	valid's l1: 10.8776
[08:50:15] [600]	valid's l1: 10.4852
[08:50:16] [700]	valid's l1: 10.1474
[08:50:16] [800]	valid's l1: 9.88724
[08:50:16] [900]	valid's l1: 9.66502
[08:50:17] [1000]	valid's l1: 9.4438
[08:50:17] [1100]	valid's l1: 9.19582
[08:50:17] [1200]	valid's l1: 8.99743
[08:50:17] Did not meet early stopping. Best iteration is:
[1200]	valid's l1: 8.99743


[08:51:01] 0:	learn: 29.7774254	test: 29.8883971	best: 29.8883971 (0)	total: 4.69ms	remaining: 9.37s
[08:51:02] 100:	learn: 14.1535035	test: 14.1546205	best: 14.1546205 (100)	total: 454ms	remaining: 8.54s
[08:51:02] 200:	learn: 12.7795024	test: 12.7786842	best: 12.7786842 (200)	total: 917ms	remaining: 8.2s
[08:51:03] 300:	learn: 11.7815390	test: 11.7711714	best: 11.7711714 (300)	total: 1.4s	remaining: 7.88s
[08:51:03] 400:	learn: 11.1932192	test: 11.2003812	best: 11.2003812 (400)	total: 1.88s	remaining: 7.5s
[08:51:04] 500:	learn: 10.7835314	test: 10.8163406	best: 10.8163406 (500)	total: 2.35s	remaining: 7.03s
[08:51:04] 600:	learn: 10.4814266	test: 10.5510762	best: 10.5510762 (600)	total: 2.83s	remaining: 6.6s
[08:51:05] 700:	learn: 10.2203392	test: 10.3153896	best: 10.3153896 (700)	total: 3.29s	remaining: 6.09s
[08:51:05] 800:	learn: 9.9265774	test: 10.0611564	best: 10.0611564 (800)	total: 3.75s	remaining: 5.62s
[08:51:06] 900:	learn: 9.7305053	test: 9.8691564	best: 9.8691564 (900)	t

[08:51:37] 1200:	learn: 8.9483374	test: 9.7551700	best: 9.7551700 (1200)	total: 5.9s	remaining: 3.93s
[08:51:38] 1300:	learn: 8.7779486	test: 9.6157912	best: 9.6157912 (1300)	total: 6.36s	remaining: 3.42s
[08:51:38] 1400:	learn: 8.6666037	test: 9.5285810	best: 9.5285810 (1400)	total: 6.81s	remaining: 2.91s
[08:51:38] 1500:	learn: 8.5551842	test: 9.4462510	best: 9.4462510 (1500)	total: 7.27s	remaining: 2.42s
[08:51:39] 1600:	learn: 8.4408910	test: 9.3603041	best: 9.3602805 (1599)	total: 7.73s	remaining: 1.93s
[08:51:39] 1700:	learn: 8.3025568	test: 9.2597999	best: 9.2597999 (1700)	total: 8.22s	remaining: 1.45s
[08:51:40] 1800:	learn: 8.1922676	test: 9.1681809	best: 9.1681809 (1800)	total: 8.69s	remaining: 961ms
[08:51:40] 1900:	learn: 8.1218560	test: 9.1212908	best: 9.1209378 (1898)	total: 9.19s	remaining: 479ms
[08:51:41] 1999:	learn: 8.0108166	test: 9.0187932	best: 9.0187932 (1999)	total: 9.67s	remaining: 0us
[08:51:41] bestTest = 9.018793151
[08:51:41] bestIteration = 1999
[08:51:41]

In [31]:
threshold = 0.08
test_data['pred_prev'] = test_pred
test_reg = test_data[test_data.binary_target!=0]
test_valid_reg = test_data[test_data['pred_prev'] > threshold]
pred_test_reg = automl.predict(test_reg).data
pred_test_valid_reg = automl.predict(test_valid_reg).data

In [32]:
print('train mae {}'.format(mean_absolute_error(train_reg.target, out_of_fold_reg.data)))
print('valid filter 0 mae {}'.format(mean_absolute_error(test_reg.target, pred_test_reg)))
print('test filter {} mae {}'.format(threshold, mean_absolute_error(test_valid_reg.target, pred_test_valid_reg)))

train mae 6.622193717051493
valid filter 0 mae 14.255385861401479
test filter 0.1 mae 16.918317801952362


In [24]:
import pickle
with open('regressor.pkl', 'wb') as handle:
    pickle.dump(automl, handle)

In [25]:
with open('regressor.pkl', 'rb') as handle:
    model = pickle.load(handle)

In [26]:
check_data['pred_prev'] = check_pred
reg_check_pred = model.predict(check_data).data[:,0]

In [27]:
check['Кол-во Дней'] = reg_check_pred.astype(int)
check.loc[check_data['pred_prev'] < 0.08, 'Кол-во Дней'] = 0

In [29]:
check.to_excel('Тест_with_result_full.xlsx',
            index=False, columns=['obj_key', 'Кодзадачи', 'НазваниеЗадачи', 'Кол-во Дней'])