В ячейках ниже происходит обучение композиции моделей машинного обучения с помощью фреймворка lightautoml. Для его корректной работы необходимо использовать python 3.7

Импортируем необходимые библиотеки

In [182]:
import os
import numpy as np 
import pandas as pd 
import glob
# from ipyplot import plot_images
# import cv2
import torch
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
import seaborn as sns

from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from lightautoml.report.report_deco import ReportDeco

Подгружаем заранее сформированный обучающий датасет

In [185]:
data = pd.read_csv("meta_wag_final.csv", index_col=0)
data.replace(['no_data'], -1, inplace=True)

Удаляем часть признаков + создаем два набора обуччающих данных, где в одном случае таргет на месяц, а в другом случае таргет на 10 дней

In [187]:
data_m = data.drop(columns=["target_day", "wagnum", 'month', 'date_build', 'srok_sl', 'model', 'date_pl_rem', 'plan_date'])
data_d = data.drop(columns=["target_month", "wagnum", 'month', 'date_build', 'srok_sl', 'model', 'date_pl_rem', 'plan_date'])

Итоговый набор признаков

In [188]:
data_m.columns

Index(['ost_prob_x', 'manage_type', 'rod_id', 'reestr_state', 'target_month',
       'prob_diff', 'gruz', 'cnsi_gruz_capacity', 'cnsi_volumek', 'tara',
       'zavod_build', 'cnsi_probeg_dr', 'cnsi_probeg_kr', 'kuzov', 'telega',
       'tormoz', 'tipvozd', 'tippogl', 'norma_km', 'ownertype', 'lifespan',
       'lefttime', 'kod_vrab_x', 'neis1_kod', 'neis2_kod', 'neis3_kod',
       'mod1_kod', 'mod2_kod', 'mod3_kod', 'mod4_kod', 'mod5_kod', 'mod6_kod',
       'mod7_kod', 'road_id_send', 'gr_probeg', 'por_probeg', 'st_id_send_x',
       'rem_count', 'date_kap', 'date_dep', 'kod_vrab_y', 'id_road_disl',
       'st_id_dest', 'id_road_dest', 'st_id_send_y', 'id_road_send',
       'ost_prob_y', 'isload', 'fr_id', 'last_fr_id', 'distance', 'days_load',
       'fr_changes'],
      dtype='object')

In [189]:
N_THREADS = 1
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
TARGET_NAME_MONTH = 'target_month'
TARGET_NAME_DAY = 'target_day'

In [190]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

Обучаются 2 модели

1-ая модель - таргет на месяц

In [191]:
roles = {
    'target': TARGET_NAME_MONTH,
    
}

task = Task(
        name = 'binary',
        metric = lambda y_true, y_pred: f1_score(y_true, (y_pred > 0.5)*1)
)

automl_m = TabularAutoML(
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)

In [192]:
oof_pred = automl_m.fit_predict(
    data_m,
    roles = roles,
    verbose=4
)

[09:06:26] Stdout logging level is DEBUG.
[09:06:26] Task: binary

[09:06:26] Start automl preset with listed constraints:
[09:06:26] - time: 300.00 seconds
[09:06:26] - CPU: 1 cores
[09:06:26] - memory: 16 GB

[09:06:26] [1mTrain data shape: (194615, 53)[0m

[09:06:35] Feats was rejected during automatic roles guess: []
[09:06:36] Layer [1m1[0m train process start. Time left 290.02 secs
[09:06:44] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[09:06:44] Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [57, 58, 59, 60, 61, 62, 63], 'embed_sizes': array([30, 30, 23, 16, 16, 21, 11]), 'data_size': 64}
[09:06:44] ===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m =====
[09:06:46] Linear model: C = 1e-05 score = 0.0
[09:06:46] Linear model: C = 5e-05 score = 0.012448132780082987
[

2 - ая модель таргет на 10 дней

In [193]:
roles = {
    'target': 'target_day',
}

automl_d = TabularAutoML(
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
)

oof_pred_d = automl_d.fit_predict(
    data_d,
    roles = roles,
    verbose=4
)

[09:08:57] Stdout logging level is DEBUG.
[09:08:57] Task: binary

[09:08:57] Start automl preset with listed constraints:
[09:08:57] - time: 300.00 seconds
[09:08:57] - CPU: 1 cores
[09:08:57] - memory: 16 GB

[09:08:57] [1mTrain data shape: (194615, 53)[0m



[09:09:06] Feats was rejected during automatic roles guess: []
[09:09:06] Layer [1m1[0m train process start. Time left 290.78 secs
[09:09:14] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[09:09:14] Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [57, 58, 59, 60, 61, 62], 'embed_sizes': array([30, 30, 30, 23, 16, 16]), 'data_size': 63}
[09:09:14] ===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m =====
[09:09:15] Linear model: C = 1e-05 score = 0.0
[09:09:16] Linear model: C = 5e-05 score = 0.0
[09:09:16] Linear model: C = 0.0001 score = 0.028125
[09:09:17] Linear model: C = 0.0005 score = 0.27319587628865977
[09:09:17] Linear model: C = 0.001 score = 0.31396786155747836
[09:09:18] Linear model: C = 0.005 score = 0.3444976076555024
[09:09:18] Linear model: C = 0.01 score = 0.34

Загруаем признаки для предсказания таргетов на март

In [199]:
data_inf = pd.read_csv("feb_features.csv", index_col=0)
data_inf.replace(['no_data'], -1, inplace=True)

In [201]:
data_inf_ = data_inf.drop(columns=["wagnum", 'date_pl_rem'])

Инференс двух моделей

In [202]:
pred_inf_m = automl_m.predict(data_inf_)
pred_inf_d = automl_d.predict(data_inf_)

Формируем посылку на сайт

In [203]:
df_pred = pd.DataFrame({
    "wagnum": data_inf.wagnum.values,
    "target_month": (pred_inf_m.data[:, 0] > 0.5) * 1,
    "target_day": (pred_inf_d.data[:, 0] > 0.5) * 1
})

In [204]:
sub_df = pd.read_csv("y_predict_submit_example.csv")

In [205]:
sub_df = sub_df[["wagnum"]]
sub_df["month"] = pd.to_datetime('2023-03-01')

In [206]:
final_df = sub_df.merge(df_pred, how="left", on="wagnum")

In [207]:
final_df.to_csv("preds.csv", index=False)

In [155]:
final_df

Unnamed: 0,wagnum,month,target_month,target_day
0,33361,2023-03-01,0,0
1,33364,2023-03-01,0,0
2,33366,2023-03-01,0,0
3,33358,2023-03-01,0,0
4,33349,2023-03-01,0,0
...,...,...,...,...
33702,17621,2023-03-01,0,0
33703,25045,2023-03-01,0,0
33704,27156,2023-03-01,0,0
33705,21361,2023-03-01,0,0
