Evraz AI Challenge
---

29-31 октября 2021 г

**Трек 1: Продуйте металл через Data Science**. 

Разработайте модель, прогнозирующую содержание углерода и температуру чугуна во время процесса продувки металла

https://hackathon.evraz.com/

**Исходные данные**

В рамках этой задачи вам будут даны данные о продувки чугуна в цехе:

- **produv** – Таблица содержит основные параметры продувки - мгновенный расход кислорода и положение (наклон) фурмы
- **lom** – Вместе с чугуном в фурму засыпают лом - это часть технологического процесса. Таблица содержит вес и тип ломов, использованных в каждой плавке
- **plavki** – Основная информация по плавке - характеристики плавки (марка металла, направление разливки) и оборудования
- **sip** – Сыпучие добавки, используемые в технологическом процессе
- **chronom** – хронометраж - время начала и конца различных операций во время плавки
- **chugun** – Химический состав и характеристики чугуна
- **gas** – Информация об анализе отходящих газов
- **target** – целевые значения


- [описание задачи](https://russianhackers.notion.site/1-Data-Science-4cc89ba42de1429bbac316f59bf07a3b)
- [атрибутный состав данных](https://russianhackers.notion.site/a685453e4fde41a098d9ad704d906e21?v=c482eaeb8f3143d58763b4b9008f1fec)

### Light Auto ML

Light Auto ML House prices regression: https://www.kaggle.com/alexryzhkov/lightautoml-houseprices-love

Light Auto ML Titanic classification: https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936 и https://www.kaggle.com/alexryzhkov/lightautoml-extreme-short-titanic-solution

In [1]:
import pandas as pd
import numpy as np
from pathlib2 import Path
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional
import re
import lightgbm as lgb
from datetime import datetime

from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task

## Функции

# Загрузка данных

In [2]:
path = Path('../../../data/2021_evraz')

In [3]:
target_train = pd.read_pickle(path.joinpath('target_train_with_gas_wo_sip.pkl'))
print(target_train.shape)
target_train.head(3)

(2063, 156)


Unnamed: 0,NPLV,TST,C,VES,T_x,SI,MN,S,P,CR,...,CO,CO2,H2,N2,O2,O2_pressure,T_y,Tfurmy1,Tfurmy2,V_y
0,510008,1690,0.06,263700.0,1396.0,0.44,0.22,0.023,0.097,0.03,...,41565.325339,34936.083312,768.890272,156085.786997,20685.819848,34191.508527,1297695.0,0.0,0.0,554980600.0
1,510009,1683,0.097,264500.0,1419.0,0.68,0.2,0.017,0.087,0.02,...,45281.138686,46447.033896,644.923314,255833.503217,43381.103374,55089.194445,1484196.0,0.0,0.0,857147900.0
2,510010,1662,0.091,263800.0,1384.0,0.56,0.26,0.017,0.096,0.03,...,42363.861283,36527.960575,898.578333,179821.062543,25108.381611,40258.212162,1406451.0,0.0,0.0,619007500.0


In [4]:
test = pd.read_pickle(path.joinpath('test_with_gas_wo_sip.pkl'))
print(test.shape)
test.head(3)

(780, 154)


Unnamed: 0,NPLV,VES,T_x,SI,MN,S,P,CR,NI,CU,...,CO,CO2,H2,N2,O2,O2_pressure,T_y,Tfurmy1,Tfurmy2,V_y
0,512324,240100.0,1355.0,0.46,0.33,0.027,0.079,0.01,0.01,0.02,...,33000.85845,39568.252094,491.99342,266109.431543,42125.515482,57597.302491,1470844.0,101464.968297,115924.772553,809378500.0
1,512327,266400.0,1390.0,0.3,0.33,0.032,0.099,0.01,0.0,0.0,...,42393.120691,44885.938358,336.973542,213638.833867,25216.409921,52825.081793,1304937.0,85182.232605,97366.753531,762839100.0
2,512328,270200.0,1373.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,68213.928782,63900.041993,721.388411,379327.183482,56372.427382,92308.091414,1698940.0,147624.03396,164204.12123,1331792000.0


In [5]:
target_train.columns

Index(['NPLV', 'TST', 'C', 'VES', 'T_x', 'SI', 'MN', 'S', 'P', 'CR',
       ...
       'CO', 'CO2', 'H2', 'N2', 'O2', 'O2_pressure', 'T_y', 'Tfurmy1',
       'Tfurmy2', 'V_y'],
      dtype='object', length=156)

In [6]:
automl_t = TabularAutoML(task = Task('reg', metric = 'mse'))
automl_c = TabularAutoML(task = Task('reg', metric = 'mse'))

In [7]:
oof_pred_t = automl_t.fit_predict(target_train,  roles = {'target': 'TST', 'drop': 'C'})
oof_pred_c = automl_c.fit_predict(target_train,  roles = {'target': 'C', 'drop': 'TST'})

In [8]:
test_pred_t = automl_t.predict(test)
test_pred_c = automl_c.predict(test)

In [9]:
pd.DataFrame({'NPLV': test['NPLV'],
              'TST':test_pred_t.data[:, 0], 
              'C': test_pred_c.data[:, 0]
             })

Unnamed: 0,NPLV,TST,C
0,512324,1655.895630,0.055032
1,512327,1661.177612,0.075187
2,512328,1654.020264,0.106143
3,512331,1653.269165,0.128139
4,512333,1658.896362,0.090677
...,...,...,...
775,513369,1662.528809,0.091954
776,513370,1663.140259,0.130896
777,513371,1666.706177,0.108258
778,513372,1671.117920,0.113524


In [10]:
# получаем текущие дату и время
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# создаём путь и имя файла с датой и временем
# file_name = f'../../data/kaggle/gb_competitive_data_analysis/lgb_predictions_{now}.csv'
file_name = f'lama_{now}.csv'
print('File name: ', file_name)

# сохраняем в csv
pd.DataFrame({'NPLV': test['NPLV'],
              'TST':test_pred_t.data[:, 0], 
              'C': test_pred_c.data[:, 0]
             }).to_csv(file_name, index=False, encoding='utf-8')
print('\n File saved to disk!')

File name:  lama_2021-10-31_00-42-38.csv

 File saved to disk!


In [11]:
def metric(answers, user_csv):

    delta_c = np.abs(np.array(answers['C']) - np.array(user_csv['C']))
    hit_rate_c = np.int64(delta_c < 0.02)

    delta_t = np.abs(np.array(answers['TST']) - np.array(user_csv['TST']))
    hit_rate_t = np.int64(delta_t < 20)

    N = np.size(answers['C'])

    return np.sum(hit_rate_c + hit_rate_t) / 2 / N