Скопировать шаблон в gisto

**Построение модели Lama**
1. сгенерировать дополнительные признаки исходя из знаний о данных
2. tabular automl с очень большим временем расчёта (10 ч), который проделает все шаги
3. посмотреть итоговый результат в логах - какие модели использовались, за какое время посчиталось, понимаешь сколько будет считаться каждый конфиг внутри утилайзда
4. планируешь расчёт только с нужными моделями и известным временем.

[LightAutoML vs Titanic: 80% accuracy in several lines of code](https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936)

In [None]:
#!pip install -U lightautoml

In [None]:
#!pip install autowoe

In [1]:
import pandas as pd
import numpy as np
from pathlib2 import Path
import matplotlib.pyplot as plt

import logging

import lightautoml
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.pipelines.selection.importance_based import ImportanceCutoffSelector, ModelBasedImportanceEstimator
from lightautoml.tasks import Task
from lightautoml.report import ReportDeco

from autowoe import AutoWoE, ReportDeco
#logging.basicConfig(format='[%(asctime)s] (%(levelname)s): %(message)s', level=logging.INFO)

**There is 3 different `task types`**:
- `binary` - for binary classification.
- `reg` - for regression.
- `multiclass` - for multiclass classification.

Avaliable **`losses` for binary task**:
- `logloss` - (uses by default) Standard logistic loss.

Avaliable **`losses` for regression task**:
- `mse` - (uses by default) Mean Squared Error.
- `mae` - Mean Absolute Error.
- `mape` - Mean Absolute Percentage Error.
- `rmsle` - Root Mean Squared Log Error.
- `huber` - Huber loss, reqired params: a - threshold between MAE and MSE losses.
- `fair` - Fair loss, required params: c - sets smoothness.
- `quantile` - Quantile loss, required params: q - sets quantile.

Avaliable **`losses` for multi-classification task**:
- `crossentropy` - (uses by default) Standard crossentropy function.
- `f1` - Optimizes F1-Macro Score, now avaliable for LightGBM and NN models. Here we implicitly assume that the prediction lies not in the set {0, 1}, but in the interval [0, 1].

Available **`metrics` for binary task**:
- `auc` - (uses by default) ROC-AUC score.
- `accuracy` - Accuracy score (uses argmax prediction).
- `logloss` - Standard logistic loss.

Avaliable **`metrics` for regression task**:
- `mse` - (uses by default) Mean Squared Error.
- `mae` - Mean Absolute Error.
- `mape` - Mean Absolute Percentage Error.
- `rmsle` - Root Mean Squared Log Error.
- `huber` - Huber loss, reqired params: a - threshold between MAE and MSE losses.
- `fair` - Fair loss, required params: c - sets smoothness.
- `quantile` - Quantile loss, required params: q - sets quantile.

Avaliable **`metrics` for multi-classification task**:
- `crossentropy` - (uses by default) Standard cross-entropy loss.
- `auc` - ROC-AUC of each class against the rest.
- `auc_mu` - AUC-Mu. Multi-class extension of standard AUC for binary classification. In short, mean of n_classes * (n_classes - 1) / 2 binary AUCs. More info on http://proceedings.mlr.press/v97/kleiman19a/kleiman19a.pdf

[Source](https://lightautoml.readthedocs.io/en/latest/generated/lightautoml.tasks.base.Task.html)

In [None]:
task = Task('reg', metric = 'mse')

**Available presets**
- **TabularAutoML**
- **TabularUtilizedAutoML** - preset for TIMEOUT utilization (try to spend it as much as possible inside TIMEOUT boundary)

**Base algorithms**, which are currently available to be in general_params use_algos:
- Linear model (called `linear_l2`)
- LightGBM model with expert params based on dataset (`lgb`)
- LightGBM with tuned params using Optuna (`lgb_tuned`)
- CatBoost model with expert params (`cb`) and
- CatBoost with params from Optuna (`cb_tuned`)

As you can see, `use_algos` are **lists in the list** — this is the notation to create ML pipelines with as many levels of algorithms as you want. For example, `[['linear_l2', 'lgb', 'cb'], ['lgb_tuned', 'cb']]` stands for 3 algorithms on the first level and 2 on the second. After the second level will be fully trained, predictions from the 2 algorithms are weighted averaged to construct the final prediction. The full set of parameters (not only general ones), which can be provided for the TabularAutoML customization, can be found in its [YAML config](https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/automl/presets/tabular_config.yml).

In [None]:
automl = TabularUtilizedAutoML(task = task, 
                               timeout = 600, # 600 seconds = 10 minutes
                               cpu_limit = 8, # Optimal for Kaggle kernels
                               general_params = {'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]}
                              )

# automl = TabularAutoML(task = task, 
#                                timeout = 600, # 600 seconds = 10 minutes
#                                cpu_limit = 8, # Optimal for Kaggle kernels
#                                general_params = {'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]}
#                               )

**`verbose` – Controls the verbosity: the higher, the more messages**. 
- `<1` : messages are not displayed; 
- `>=1` : the computation process for layers is displayed; 
- `>=2` : the information about folds processing is also displayed; 
- `>=3` : the hyperparameters optimization process is also displayed; 
- `>=4` : the training process for every algorithm is displayed;

In [None]:
oof_pred = automl.fit_predict(train,  roles = {'target': 'Qж м3сут', 'drop': ''}, verbose = 2)

In [None]:
#logging.info('oof_pred:\n{}\nShape = {}'.format(oof_pred, oof_pred.shape))

In [None]:
test_pred = automl.predict(test.drop(['Qж м3сут'], axis=1))

In [None]:
# Fast feature importances calculation
fast_fi = automl.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True);

In [None]:
# Accurate feature importances calculation (Permutation importances) -  can take long time to calculate
accurate_fi = automl.get_feature_scores('accurate', train, silent = False)
accurate_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True);