Evraz AI Challenge
---

29-31 октября 2021 г

**Трек 1: Продуйте металл через Data Science**. 

Разработайте модель, прогнозирующую содержание углерода и температуру чугуна во время процесса продувки металла

https://hackathon.evraz.com/

**Исходные данные**

В рамках этой задачи вам будут даны данные о продувки чугуна в цехе:

- **produv** – Таблица содержит основные параметры продувки - мгновенный расход кислорода и положение (наклон) фурмы
- **lom** – Вместе с чугуном в фурму засыпают лом - это часть технологического процесса. Таблица содержит вес и тип ломов, использованных в каждой плавке
- **plavki** – Основная информация по плавке - характеристики плавки (марка металла, направление разливки) и оборудования
- **sip** – Сыпучие добавки, используемые в технологическом процессе
- **chronom** – хронометраж - время начала и конца различных операций во время плавки
- **chugun** – Химический состав и характеристики чугуна
- **gas** – Информация об анализе отходящих газов
- **target** – целевые значения


- [описание задачи](https://russianhackers.notion.site/1-Data-Science-4cc89ba42de1429bbac316f59bf07a3b)
- [атрибутный состав данных](https://russianhackers.notion.site/a685453e4fde41a098d9ad704d906e21?v=c482eaeb8f3143d58763b4b9008f1fec)

### Light Auto ML

Light Auto ML House prices regression: https://www.kaggle.com/alexryzhkov/lightautoml-houseprices-love

Light Auto ML Titanic classification: https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936 и https://www.kaggle.com/alexryzhkov/lightautoml-extreme-short-titanic-solution

In [12]:
import pandas as pd
import numpy as np
from pathlib2 import Path
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional
import re
import lightgbm as lgb
from datetime import datetime

from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task

## Функции

# Загрузка данных

In [2]:
path = Path('../../../data/2021_evraz')

In [3]:
target_train = pd.read_pickle(path.joinpath('target_train_all.pkl'))
print(target_train.shape)
target_train.head(3)

(2063, 167)


Unnamed: 0,NPLV,TST,C,VES,T_x,SI,MN,S,P,CR,...,NMSYP_Dolomsyr,NMSYP_Ugol_TO,NMSYP_FLUMAG,NMSYP_FlusFOMI,NMSYP_agl_ofl_s,NMSYP_dolom_syr,NMSYP_izvotsev,NMSYP_izv_ZOI,NMSYP_izv_otsev,NMSYP_koks_25_40
0,510008,1690,0.06,263700.0,1396.0,0.44,0.22,0.023,0.097,0.03,...,0,2950,2960,980,0,0,0,14080,0,0
1,510009,1683,0.097,264500.0,1419.0,0.68,0.2,0.017,0.087,0.02,...,0,2930,0,960,0,0,1060,18830,0,0
2,510010,1662,0.091,263800.0,1384.0,0.56,0.26,0.017,0.096,0.03,...,0,2990,2960,1050,0,0,990,16080,0,0


In [4]:
test = pd.read_pickle(path.joinpath('test_all.pkl'))
print(test.shape)
test.head(3)

(780, 165)


Unnamed: 0,NPLV,VES,T_x,SI,MN,S,P,CR,NI,CU,...,NMSYP_Dolomsyr,NMSYP_Ugol_TO,NMSYP_FLUMAG,NMSYP_FlusFOMI,NMSYP_agl_ofl_s,NMSYP_dolom_syr,NMSYP_izvotsev,NMSYP_izv_ZOI,NMSYP_izv_otsev,NMSYP_koks_25_40
0,512324,240100.0,1355.0,0.46,0.33,0.027,0.079,0.01,0.01,0.02,...,0,1310,1670,0,0,0,0,13960,0,0
1,512327,266400.0,1390.0,0.3,0.33,0.032,0.099,0.01,0.0,0.0,...,0,0,0,0,0,0,0,15290,0,50
2,512328,270200.0,1373.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,20010,0,1400


In [5]:
target_train.columns

Index(['NPLV', 'TST', 'C', 'VES', 'T_x', 'SI', 'MN', 'S', 'P', 'CR',
       ...
       'NMSYP_Dolomsyr', 'NMSYP_Ugol_TO', 'NMSYP_FLUMAG', 'NMSYP_FlusFOMI',
       'NMSYP_agl_ofl_s', 'NMSYP_dolom_syr', 'NMSYP_izvotsev', 'NMSYP_izv_ZOI',
       'NMSYP_izv_otsev', 'NMSYP_koks_25_40'],
      dtype='object', length=167)

https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936

As you can see, use_algos are lists in the list — this is the notation to create ML pipelines with as many levels of algorithms as you want. For example, `[['linear_l2', 'lgb', 'cb'], ['lgb_tuned', 'cb']]` stands for 3 algorithms on the first level and 2 on the second. After the second level will be fully trained, predictions from the 2 algorithms are weighted averaged to construct the final prediction. The full set of parameters (not only general ones), which can be provided for the TabularAutoML customization, can be found in its [YAML config](https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/automl/presets/tabular_config.yml).

Base algorithms, which are currently available to be in general_params use_algos:
- Linear model (called `linear_l2`)
- LightGBM model with expert params based on dataset (`lgb`)
- LightGBM with tuned params using Optuna (`lgb_tuned`)
- CatBoost model with expert params (`cb`) and
- CatBoost with params from Optuna (`cb_tuned`)

In [13]:
automl_t = TabularUtilizedAutoML(task = Task('reg', metric = 'mse'), 
                                 timeout = 600, # 600 seconds = 10 minutes
                                 cpu_limit = 4, # Optimal for Kaggle kernels
                                 general_params = {'use_algos': [['linear_l2', 'lgb_tuned']]})

automl_c = TabularUtilizedAutoML(task = Task('reg', metric = 'mse'), 
                                 timeout = 600, # 600 seconds = 10 minutes
                                 cpu_limit = 4, # Optimal for Kaggle kernels
                                 general_params = {'use_algos': [['linear_l2', 'lgb', 'cb'], ['lgb_tuned', 'cb']]})

In [14]:
oof_pred_t = automl_t.fit_predict(target_train,  roles = {'target': 'TST', 'drop': 'C'})
oof_pred_c = automl_c.fit_predict(target_train,  roles = {'target': 'C', 'drop': 'TST'})

In [15]:
test_pred_t = automl_t.predict(test)
test_pred_c = automl_c.predict(test)

In [16]:
pd.DataFrame({'NPLV': test['NPLV'],
              'TST':test_pred_t.data[:, 0], 
              'C': test_pred_c.data[:, 0]
             })

Unnamed: 0,NPLV,TST,C
0,512324,1660.386475,0.052478
1,512327,1662.991943,0.078322
2,512328,1658.475098,0.109283
3,512331,1655.760498,0.085248
4,512333,1662.438110,0.109976
...,...,...,...
775,513369,1662.905273,0.076655
776,513370,1664.241577,0.115325
777,513371,1667.691284,0.091230
778,513372,1670.125854,0.091940


In [17]:
# получаем текущие дату и время
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# создаём путь и имя файла с датой и временем
# file_name = f'../../data/kaggle/gb_competitive_data_analysis/lgb_predictions_{now}.csv'
file_name = f'lama_l2_lgb_cb_optuna_{now}.csv'
print('File name: ', file_name)

# сохраняем в csv
pd.DataFrame({'NPLV': test['NPLV'],
              'TST':test_pred_t.data[:, 0], 
              'C': test_pred_c.data[:, 0]
             }).to_csv(file_name, index=False, encoding='utf-8')
print('\n File saved to disk!')

File name:  lama_l2_lgb_cb_optuna_2021-11-01_21-45-58.csv

 File saved to disk!


In [11]:
def metric(answers, user_csv):

    delta_c = np.abs(np.array(answers['C']) - np.array(user_csv['C']))
    hit_rate_c = np.int64(delta_c < 0.02)

    delta_t = np.abs(np.array(answers['TST']) - np.array(user_csv['TST']))
    hit_rate_t = np.int64(delta_t < 20)

    N = np.size(answers['C'])

    return np.sum(hit_rate_c + hit_rate_t) / 2 / N