# Основные понятия машинного обучения

__Автор задач: Блохин Н.В. (NVBlokhin@fa.ru)__

Материалы:
* https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
* http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
* https://contrib.scikit-learn.org/category_encoders/
* https://scikit-learn.org/stable/modules/model_evaluation.html
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
* http://scikit-learn.org/stable/modules/cross_validation.html
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

## Задачи для совместного разбора

1\. Загрузите набор данных из файла `possum.csv` в виде `pd.DataFrame`. Решите задачу классификации по столбцу `sex`.

In [None]:
# pip install category_encoders

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
import category_encoders as ce

In [None]:
import pandas as pd

data = pd.read_csv("possum.csv").drop(columns=["case"]).fillna(0)
data.head(2)

Unnamed: 0,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0,74.5,54.5,15.2,28.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0


In [None]:
X = data.drop(columns=["sex"])
y = data["sex"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=0.8,
    random_state=41,
)

In [None]:
X_train.head(2)

Unnamed: 0,site,Pop,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
24,1,Vic,3.0,95.8,58.5,91.5,35.5,72.3,51.6,14.9,31.0,35.0
42,2,Vic,2.0,90.0,55.5,81.0,32.0,72.0,49.4,13.4,29.0,31.0


In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
t = ColumnTransformer(
    [
        ("pop_label", ce.OrdinalEncoder(), ["Pop"]),
        ("site_ohe", OneHotEncoder(), ["site"]),
    ],
    remainder=MinMaxScaler()
).fit(X_train)

In [None]:
t.transform(X_train)

array([[1.        , 1.        , 0.        , ..., 0.39583333, 0.9       ,
        0.66666667],
       [1.        , 0.        , 1.        , ..., 0.08333333, 0.7       ,
        0.4       ],
       [2.        , 0.        , 0.        , ..., 0.8125    , 0.4       ,
        0.56666667],
       ...,
       [1.        , 1.        , 0.        , ..., 0.58333333, 0.5       ,
        0.46666667],
       [1.        , 0.        , 1.        , ..., 0.39583333, 0.35      ,
        0.73333333],
       [2.        , 0.        , 0.        , ..., 0.20833333, 0.7       ,
        0.9       ]])

In [None]:
# t.fit(X_test) -- так делать не надо!
t.transform(X_test)

array([[ 2.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.33333333,  0.32524272,
         0.28169014,  0.3255814 ,  0.54545455,  0.19886364,  0.25352113,
        -0.04166667,  0.2       ,  0.4       ],
       [ 2.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.77777778,  0.45631068,
         0.45070423,  0.55813953,  0.54545455,  0.28977273,  0.1971831 ,
         0.        ,  0.5       ,  0.6       ],
       [ 1.        ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.22222222,  0.62135922,
         0.57746479,  0.6744186 ,  0.36363636,  0.63636364,  0.75352113,
         0.25      ,  0.8       ,  0.63333333],
       [ 2.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.33333333,  0.69902913,
         0.45774648,  0.6744186 ,  0.59090909,  0.153

In [None]:
map = {"f": 0, "m": 1}

In [None]:
X_train_t = t.transform(X_train)
X_test_t = t.transform(X_test)

y_train_t = y_train.map(map)
y_test_t = y_test.map(map)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier().fit(X_train_t, y_train_t)
rf.score(X_train_t, y_train_t)

1.0

2\. Проверьте качество обучения модели с использованием перекрестной проверки

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(
    RandomForestClassifier(),
    X_train_t,
    y_train_t,
    cv=5,
)

In [None]:
scores.mean(), scores.std()

(0.6147058823529412, 0.023065275207879588)

3\. Найдите оптимальные гиперпараметры модели, используя поиск по сетке.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = {
    "n_estimators": [50, 100, 150, 200],
    "max_depth": [1, 2, 3, None],
}

In [None]:
grid = GridSearchCV(
    RandomForestClassifier(),
    param_grid=grid,
    scoring="accuracy",
).fit(X_train_t, y_train_t)

In [None]:
grid.best_params_, grid.best_score_

({'max_depth': None, 'n_estimators': 200}, 0.6625)

In [None]:
rf = RandomForestClassifier(max_depth=3, n_estimators=50).fit(X_train_t, y_train_t)
rf.score(X_test_t, y_test_t)

0.5238095238095238

## Задачи для самостоятельного решения

<p class="task" id="1"></p>

1\. Загрузите набор данных из файла `Walmart.csv` в виде `pd.DataFrame`. Преобразуйте столбец `Temperature` в числовой. Преобразуйте столбец `IsHoliday` в числовой столбец, содержащий значения 0 и 1, предварительно проанализировав значения в этом столбце.

In [68]:
import pandas as pd

In [69]:
df = pd.read_csv('Walmart.csv')
df.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,StoreId,IsHoliday
0,05-02-2010,1643690.9,42.31°C,2.572,211.096358,8.106,c4ca4238a0b923820dcc509a6f75849b,0
1,12-02-2010,1641957.44,38.51°C,2.548,211.24217,8.106,c4ca4238a0b923820dcc509a6f75849b,Y
2,19-02-2010,1611968.17,39.93°C,2.514,211.289143,8.106,c4ca4238a0b923820dcc509a6f75849b,N
3,26-02-2010,1409727.59,46.63°C,2.561,211.319643,8.106,c4ca4238a0b923820dcc509a6f75849b,n
4,05-03-2010,1554806.68,46.5°C,2.625,211.350143,8.106,c4ca4238a0b923820dcc509a6f75849b,0


In [70]:
df['Temperature'] = df['Temperature'].apply(lambda x: float(x[:-2]))

In [71]:
df['IsHoliday'].unique()

array(['0', 'Y', 'N', 'n', '-', 'no', 'No', 'y', 'Yes', '1', 'yes'],
      dtype=object)

In [72]:
df['IsHoliday'] = df['IsHoliday'].apply(lambda x: 0 if x in ['0', 'N', 'n', '-','no','No'] else 1)

In [73]:
df.head()

Unnamed: 0,Date,Weekly_Sales,Temperature,Fuel_Price,CPI,Unemployment,StoreId,IsHoliday
0,05-02-2010,1643690.9,42.31,2.572,211.096358,8.106,c4ca4238a0b923820dcc509a6f75849b,0
1,12-02-2010,1641957.44,38.51,2.548,211.24217,8.106,c4ca4238a0b923820dcc509a6f75849b,1
2,19-02-2010,1611968.17,39.93,2.514,211.289143,8.106,c4ca4238a0b923820dcc509a6f75849b,0
3,26-02-2010,1409727.59,46.63,2.561,211.319643,8.106,c4ca4238a0b923820dcc509a6f75849b,0
4,05-03-2010,1554806.68,46.5,2.625,211.350143,8.106,c4ca4238a0b923820dcc509a6f75849b,0


<p class="task" id="2"></p>

2\. Разбейте набор данных на обучающую и тестовую выборку в соотношении 70 на 30 для решения задачи регрессии. Создайте несколько версий обучающей и тестовой выборки выбрав различные алгоритмы препроцессинга данных: кодирования нечисловой информации, масшабирования признаков и т.д. Обратите внимание, что все энкодеры должны настраиваться только на основе обучающей выборки, расчет статистик для масштабирования должен проводиться только на основе обучающей выборки и т.д.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer


In [74]:
X = df.drop(columns=['Weekly_Sales'])
y = df['Weekly_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
cat = ['StoreId', 'IsHoliday']
num = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']

In [None]:
t1 = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat),
        ('num', StandardScaler(), num)
    ]).fit(X_train)

t2 = ColumnTransformer(
    transformers=[
        ('cat', OrdinalEncoder(), cat),
        ('num', MinMaxScaler(), num)
    ]).fit(X_train)

In [None]:
x_train_1 = t1.transform(X_train)
x_test_1 = t1.transform(X_test)
x_train_2 = t2.transform(X_train)
x_test_2 = t2.transform(X_test)

<p class="task" id="3"></p>

3\. Решите задачу предсказания столбца `Weekly_Sales` с использованием пакета `sklearn`. Продемонстрируйте несколько различных моделей и значения основных регрессионных метрик (MAE, MSE, RMSE, MAPE). Представьте результат в виде таблицы, где по строкам расположены различные комбинации модели и версий датасетов (дайте этим комбинациям названия и укажите их в качестве индекса), а по столбцам - метрики на обучающем и тестовом множестве (двойной индекс по колонкам). Отсортируйте таблицу по убыванию значений любой выбранной вами метрики на тестовом множестве.   

In [78]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error,mean_absolute_percentage_error


In [79]:
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(random_state=42),
    'Random Forest Regressor': RandomForestRegressor(random_state=42)
}


In [79]:
data = [
    ('One-Hot + StandardScaler', t1, x_train_1, x_test_1),
    ('Ordinal + MinMaxScaler', t2, x_train_2, x_test_2)
]

In [80]:
results_list = []


for model_name, model in models.items():
    for t_name, t, X_train_t, X_test_t in data:

        model.fit(X_train_t, y_train)


        train_preds = model.predict(X_train_t)
        test_preds = model.predict(X_test_t)


        train_mae = mean_absolute_error(y_train, train_preds)
        test_mae = mean_absolute_error(y_test, test_preds)
        train_mse = mean_squared_error(y_train, train_preds)
        test_mse = mean_squared_error(y_test, test_preds)
        train_rmse = mean_squared_error(y_train, train_preds, squared=False)
        test_rmse = mean_squared_error(y_test, test_preds, squared=False)
        train_mape = mean_absolute_percentage_error(y_train, train_preds)
        test_mape = mean_absolute_percentage_error(y_test, test_preds)


        results_list.append({
            'Model': model_name,
            'Preprocessing': t_name,
            'Train MAE': train_mae,
            'Test MAE': test_mae,
            'Train MSE': train_mse,
            'Test MSE': test_mse,
            'Train RMSE': train_rmse,
            'Test RMSE': test_rmse,
            'Train MAPE': train_mape,
            'Test MAPE': test_mape
        })


results = pd.DataFrame(results_list)


results.sort_values(by='Test RMSE', ascending=True, inplace=True)
results.set_index(['Model', 'Preprocessing'], inplace=True)
results


Unnamed: 0_level_0,Unnamed: 1_level_0,Train MAE,Test MAE,Train MSE,Test MSE,Train RMSE,Test RMSE,Train MAPE,Test MAPE
Model,Preprocessing,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Random Forest Regressor,One-Hot + StandardScaler,28279.632704,77525.3761,3086802000.0,21611750000.0,55558.99625,147009.352702,0.025226,0.068106
Random Forest Regressor,Ordinal + MinMaxScaler,28595.587313,78563.843921,2979643000.0,21940610000.0,54586.105538,148123.619659,0.025758,0.069749
Linear Regression,One-Hot + StandardScaler,90743.52879,93555.736998,25008670000.0,26700110000.0,158141.294466,163401.687837,0.087711,0.088776
Decision Tree Regressor,Ordinal + MinMaxScaler,0.0,99416.364593,0.0,37173450000.0,0.0,192804.167575,0.0,0.087505
Decision Tree Regressor,One-Hot + StandardScaler,0.0,98994.00216,0.0,38590450000.0,0.0,196444.5213,0.0,0.08616
Linear Regression,Ordinal + MinMaxScaler,464703.299879,471701.904555,308973600000.0,313990200000.0,555853.934607,560348.324096,0.656809,0.664749


<p class="task" id="4"></p>

4\. Повторите решение задачи 3, используя перекрестную проверку для оценки качества модели на обучающем множестве. При представлении результата в виде таблицы значения метрик указывайте в виде строки "среднее±ст.откл."

In [81]:
from sklearn.model_selection import cross_val_score

results_cv_list = []

num_folds = 5

for model_name, model in models.items():
    for t_name, t, X_train_t, _ in data:

        cv_scores_mae = cross_val_score(model, X_train_t, y_train, cv=num_folds, scoring='neg_mean_absolute_error')
        cv_scores_mse = cross_val_score(model, X_train_t, y_train, cv=num_folds, scoring='neg_mean_squared_error')
        cv_scores_rmse = cross_val_score(model, X_train_t, y_train, cv=num_folds, scoring='neg_root_mean_squared_error')
        cv_scores_mape = cross_val_score(model, X_train_t, y_train, cv=num_folds, scoring='neg_mean_absolute_percentage_error')

        cv_scores_mae = -cv_scores_mae
        cv_scores_mse = -cv_scores_mse
        cv_scores_rmse = -cv_scores_rmse
        cv_scores_mape = -cv_scores_mape


        mean_cv_mae = cv_scores_mae.mean()
        std_cv_mae = cv_scores_mae.std()

        mean_cv_mse = cv_scores_mse.mean()
        std_cv_mse = cv_scores_mse.std()

        mean_cv_rmse = cv_scores_rmse.mean()
        std_cv_rmse = cv_scores_rmse.std()

        mean_cv_mape = cv_scores_mape.mean()
        std_cv_mape = cv_scores_mape.std()

        results_cv_list.append({
            'Model': model_name,
            'Preprocessing': t_name,
            'CV MAE': f'{mean_cv_mae:.2f}±{std_cv_mae:.2f}',
            'CV MSE': f'{mean_cv_mse:.2f}±{std_cv_mse:.2f}',
            'CV RMSE': f'{mean_cv_rmse:.2f}±{std_cv_rmse:.2f}',
            'CV MAPE': f'{mean_cv_mape:.2f}±{std_cv_mape:.2f}'
        })


results_cv = pd.DataFrame(results_cv_list)


results_cv.set_index(['Model', 'Preprocessing'], inplace=True)
results_cv


Unnamed: 0_level_0,Unnamed: 1_level_0,CV MAE,CV MSE,CV RMSE,CV MAPE
Model,Preprocessing,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Linear Regression,One-Hot + StandardScaler,92187.24±756.69,25715428165.12±1984110257.53,160242.56±6144.12,0.09±0.00
Linear Regression,Ordinal + MinMaxScaler,465378.64±4995.45,309745383759.33±11241476289.06,556456.07±10100.89,0.66±0.02
Decision Tree Regressor,One-Hot + StandardScaler,98763.16±4627.58,37530175087.98±5638729818.75,193165.69±14737.39,0.09±0.00
Decision Tree Regressor,Ordinal + MinMaxScaler,99165.33±4072.03,37431662077.96±6017940088.17,192835.11±15693.37,0.09±0.00
Random Forest Regressor,One-Hot + StandardScaler,78551.91±1890.27,23501532500.02±3356907824.83,152939.70±10534.72,0.07±0.00
Random Forest Regressor,Ordinal + MinMaxScaler,79483.54±2107.08,23294758413.47±3527295804.48,152213.67±11214.20,0.07±0.00


<p class="task" id="5"></p>

5\. Разбейте набор данных на обучающую и тестовую выборку в соотношении 70 на 30 с сохранением распределения столбца `IsHoliday` для решения задачи классификации. Создайте несколько версий обучающей и тестовой выборки выбрав различные алгоритмы препроцессинга данных: кодирования нечисловой информации, масшабирования признаков и т.д. Обратите внимание, что все энкодеры должны настраиваться только на основе обучающей выборки, расчет статистик для масштабирования должен проводиться только на основе обучающей выборки и т.д.

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)


cat = ['StoreId']
num = ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']


t1 = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat),
        ('num', StandardScaler(), num)
    ]).fit(X_train)

t2 = ColumnTransformer(
    transformers=[
        ('cat', OrdinalEncoder(), cat),
        ('num', MinMaxScaler(), num)
    ]).fit(X_train)


x_train_1 = t1.transform(X_train)
x_test_1 = t1.transform(X_test)

x_train_2 = t2.transform(X_train)
x_test_2 = t2.transform(X_test)

<p class="task" id="6"></p>

6\. Решите задачу предсказания столбца `IsHoliday` с использованием пакета `sklearn`. Продемонстрируйте несколько различных моделей и значения основных метрик классификации (Accuracy, Precision, Recall, F1, AUC ROC). Представьте результат в виде таблицы, где по строкам расположены различные комбинации модели и версий датасетов (дайте этим комбинациям названия и укажите их в качестве индекса), а по столбцам - метрики на обучающем и тестовом множестве (двойной индекс по колонкам).  Отсортируйте таблицу по убыванию значений любой выбранной вами метрики на тестовом множестве.     

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

In [None]:
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree Classifier': DecisionTreeClassifier(random_state=42),
    'Random Forest Classifier': RandomForestClassifier(random_state=42)
}


In [84]:
results_list = []

for model_name, model in models.items():
    for t_name, (train_data, test_data) in data.items():

        model.fit(train_data, y_train)


        train_preds = model.predict(train_data)
        test_preds = model.predict(test_data)

        accuracy_train = accuracy_score(y_train, train_preds)
        accuracy_test = accuracy_score(y_test, test_preds)
        precision_train = precision_score(y_train, train_preds)
        precision_test = precision_score(y_test, test_preds)
        recall_train = recall_score(y_train, train_preds)
        recall_test = recall_score(y_test, test_preds)
        f1_train = f1_score(y_train, train_preds)
        f1_test = f1_score(y_test, test_preds)
        roc_auc_train = roc_auc_score(y_train, model.predict_proba(train_data)[:, 1])
        roc_auc_test = roc_auc_score(y_test, model.predict_proba(test_data)[:, 1])

        results_list.append({
            'Model': model_name,
            'Dataset': t_name,
            'Train Accuracy': accuracy_train,
            'Test Accuracy': accuracy_test,
            'Train Precision': precision_train,
            'Test Precision': precision_test,
            'Train Recall': recall_train,
            'Test Recall': recall_test,
            'Train F1': f1_train,
            'Test F1': f1_test,
            'Train ROC AUC': roc_auc_train,
            'Test ROC AUC': roc_auc_test
        })


results_df = pd.DataFrame(results_list)


results_df.set_index(['Model', 'Dataset'], inplace=True)
results_df.sort_values(by='Test Accuracy', ascending=False, inplace=True)


results_df


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0_level_0,Unnamed: 1_level_0,Train Accuracy,Test Accuracy,Train Precision,Test Precision,Train Recall,Test Recall,Train F1,Test F1,Train ROC AUC,Test ROC AUC
Model,Dataset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Decision Tree Classifier,Ordinal + MinMaxScaler,1.0,0.948731,1.0,0.62,1.0,0.688889,1.0,0.652632,1.0,0.828576
Random Forest Classifier,Ordinal + MinMaxScaler,0.999778,0.946142,1.0,0.860465,0.996825,0.274074,0.99841,0.41573,1.0,0.955655
Decision Tree Classifier,One-Hot + StandardScaler,1.0,0.93941,1.0,0.557692,1.0,0.644444,1.0,0.597938,1.0,0.803013
Logistic Regression,One-Hot + StandardScaler,0.930062,0.930088,0.0,0.0,0.0,0.0,0.0,0.0,0.705672,0.653126
Logistic Regression,Ordinal + MinMaxScaler,0.930062,0.930088,0.0,0.0,0.0,0.0,0.0,0.0,0.686722,0.653275
Random Forest Classifier,One-Hot + StandardScaler,0.999778,0.927499,1.0,0.333333,0.996825,0.037037,0.99841,0.066667,1.0,0.830685


<p class="task" id="7"></p>

7\. Повторите задачу 6, используя поиск по сетке гиперпараметров для улучшения метрик моделей. При представлении результата в виде таблицы значения в столбце с названием модели укажите наилучшие гиперпараметры в виде "LogisticRegression(C=1, class_weight=None)"

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
models = {
    'Logistic Regression': (LogisticRegression(random_state=42), {'C': [0.01, 0.1, 1, 10], 'class_weight': [None, 'balanced']}),
    'Decision Tree Classifier': (DecisionTreeClassifier(random_state=42), {'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]}),
    'Random Forest Classifier': (RandomForestClassifier(random_state=42), {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]})
}

In [85]:
results_list = []


for model_name, (model, param_grid) in models.items():

    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=3)

    for t_name, (train_data, test_data) in data.items():

        grid_search.fit(train_data, y_train)
        best_model = grid_search.best_estimator_


        train_preds = best_model.predict(train_data)
        test_preds = best_model.predict(test_data)


        accuracy_train = accuracy_score(y_train, train_preds)
        accuracy_test = accuracy_score(y_test, test_preds)
        precision_train = precision_score(y_train, train_preds)
        precision_test = precision_score(y_test, test_preds)
        recall_train = recall_score(y_train, train_preds)
        recall_test = recall_score(y_test, test_preds)
        f1_train = f1_score(y_train, train_preds)
        f1_test = f1_score(y_test, test_preds)
        roc_auc_train = roc_auc_score(y_train, best_model.predict_proba(train_data)[:, 1])
        roc_auc_test = roc_auc_score(y_test, best_model.predict_proba(test_data)[:, 1])


        results_list.append({
            'Model': f'{model_name}({grid_search.best_params_})',
            'Dataset': t_name,
            'Train Accuracy': accuracy_train,
            'Test Accuracy': accuracy_test,
            'Train Precision': precision_train,
            'Test Precision': precision_test,
            'Train Recall': recall_train,
            'Test Recall': recall_test,
            'Train F1': f1_train,
            'Test F1': f1_test,
            'Train ROC AUC': roc_auc_train,
            'Test ROC AUC': roc_auc_test
        })


results_grid_df = pd.DataFrame(results_list)

results_grid_df.set_index(['Model', 'Dataset'], inplace=True)
results_grid_df.sort_values(by='Test Accuracy', ascending=False, inplace=True)


results_grid_df


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Unnamed: 0_level_0,Unnamed: 1_level_0,Train Accuracy,Test Accuracy,Train Precision,Test Precision,Train Recall,Test Recall,Train F1,Test F1,Train ROC AUC,Test ROC AUC
Model,Dataset,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Decision Tree Classifier({'max_depth': None, 'min_samples_split': 5})",Ordinal + MinMaxScaler,0.992451,0.950285,0.976271,0.638298,0.914286,0.666667,0.944262,0.652174,0.999438,0.843799
"Decision Tree Classifier({'max_depth': 20, 'min_samples_split': 10})",One-Hot + StandardScaler,0.97913,0.946142,0.929961,0.622047,0.75873,0.585185,0.835664,0.603053,0.993052,0.839837
"Random Forest Classifier({'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200})",Ordinal + MinMaxScaler,1.0,0.945106,1.0,0.837209,1.0,0.266667,1.0,0.404494,1.0,0.96055
"Random Forest Classifier({'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 100})",One-Hot + StandardScaler,0.941829,0.930606,1.0,0.538462,0.168254,0.051852,0.288043,0.094595,0.997244,0.832508
"Logistic Regression({'C': 0.01, 'class_weight': None})",One-Hot + StandardScaler,0.930062,0.930088,0.0,0.0,0.0,0.0,0.0,0.0,0.687281,0.660171
"Logistic Regression({'C': 0.01, 'class_weight': None})",Ordinal + MinMaxScaler,0.930062,0.930088,0.0,0.0,0.0,0.0,0.0,0.0,0.649121,0.581679


<p class="task" id="8"></p>

8\. Постройте ROC-кривые для всех обученных в задаче 7 моделей. Изобразите их на одной плоскости, добавьте подписи осей и легенду.

<p class="task" id="9"></p>

9\. Используя любую из обученных моделей, сделайте предсказания столбца `IsHoliday` для тестового множества и сохраните результат в виде csv файла следующего вида:

```
id,isHoliday
1,0
2,1
...
```

## Обратная связь
- [ ] Хочу получить обратную связь по решению