##### Это финальное задание по курсу «Обучение на размеченных данных».

В нем вы сравните логистическую регрессию и случайный лес на разных наборах признаков. В качестве данных будет использован Adult Data Set из репозитория UCI. В нем нужно предсказать, получает ли человек больше 50 000$ в год, или нет, по ряду признаков, таких как пол, образование, раса и др. Подробное описание можно найти по ссылке: https://archive.ics.uci.edu/ml/datasets/Adult.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

### Загрузка данных

В ходе задания будем использовать __train_data.csv__ для обучения моделей, на нем же и будем производить кросс-валидацию. В качестве отложенной выборки будем использовать __test_data.csv__

In [2]:
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
                "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
                "hours-per-week", "native-country", "class"]

In [3]:
train = pd.read_csv("train_data.csv", sep=", ", header=None, engine="python", names=column_names)

In [4]:
train

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [5]:
test = pd.read_csv("test_data.csv", sep=", ", header=None, engine="python", names=column_names)

In [6]:
test

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
16277,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


### 1. Пропущенные значения 

В обучающей выборке порядка 7% строк имеют пропущенные значения (вместо значения поля указан вопросительный знак __'?'__). В каких признаках в обучающей и тестовой выборке имеются пропущенные значения? Так как они все пропущены в категориальных признаках, то можно пока их ни на что не заменять, а просто считать еще одной категорией.

Проверка на NaN

In [7]:
train.isnull().any()

age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
class             False
dtype: bool

Находим колонки с символами ?

In [8]:
train.apply(lambda x: x.astype('str').str.contains('\?')).any()

age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
class             False
dtype: bool

Пропущенные значения в колонках: workclass, occupation, native-country

In [9]:
train[(train.values.ravel() == '?').reshape(train.shape).any(1)].shape

(2399, 15)

In [10]:
print("Доля строк с пропущенными значениями:", round(2399 / 32561 * 100), '%' )

Доля строк с пропущенными значениями: 7 %


### 2. Обучение на вещественных признаках

В этом разделе обучите модели только на вещественных признаках ("continuous" в описании данных). Обучите логистическую регрессию (linear_model.LogisticRegression) и случайный лес (ensemble.RandomForestClassifier) из sklearn. В первом случае подберите оптимальные параметры $penalty$ и $C$ на отрезке $[10^{-6}, 10^{6}]$ (по степеням $10$ с шагом $1$, начиная с $-6$), а во втором при фиксированном числе деревьев в 50 подберите $max\_depth$ и $min\_samples\_split$ из отрезка $[2, 14]$ с шагом в 2 и множества $\{1, 2, 4, 8\}$ соответственно. За целевую метрику качества возьмите AUC-ROC. В качестве схемы валидации используйте стратифицированную кросс-валидацию по 5-ти фолдам. Какие параметры оказались оптимальными?

Учтите, что целевая переменная в датасете является строкой. Поэтому для начала ее нужно перевести в бинарную величину. Также не забудьте отмасштабировать данные с помощью StandartScaler'а из модуля preprocessing.

Вывод типов колонок

In [11]:
train.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
class             object
dtype: object

Выберем вещественные признаки

In [12]:
X_train_int = train.select_dtypes(include=['int64'])
X_test_int = test.select_dtypes(include=['int64'])

In [13]:
X_train_int.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40


Проверим целевую переменную

In [14]:
pd.unique(train['class'])

array(['<=50K', '>50K'], dtype=object)

Закодируем целевую переменную 

In [15]:
le = LabelEncoder()

In [16]:
y_train = le.fit_transform(train['class'])
y_test = le.transform(test['class'])

In [17]:
y_train = pd.Series(y_train, name='class')
y_test = pd.Series(y_test, name='class')

Масштабирование признаков

In [17]:
scaler = StandardScaler()

In [18]:
X_train_int_std = scaler.fit_transform(X_train_int)
X_test_int_std = scaler.transform(X_test_int)

In [19]:
X_train_int_std = pd.DataFrame(X_train_int_std, columns=X_train_int.columns)
X_test_int_std = pd.DataFrame(X_test_int_std, columns=X_test_int.columns)

In [20]:
X_train_int_std.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429


Посмотрим на список гиперпараметров лог регрессии

In [21]:
lr = LogisticRegression()
lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

Сетка гиперпараметров

In [22]:
param_grid_lr = [
  {'C': [10**p for p in range(-6, 7)], 'penalty': ['l1', 'l2']}
]

Обучение лог регрессии

Параметр cv=5 - кросс-валидация на 5 фолдов по схеме (Stratified)KFold

In [24]:
log_reg = GridSearchCV(estimator=LogisticRegression(solver='liblinear'), param_grid=param_grid_lr, scoring='roc_auc', cv=5)

In [25]:
log_reg.fit(X=X_train_int_std, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10,
                                100, 1000, 10000, 100000, 1000000],
                          'penalty': ['l1', 'l2']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [26]:
log_reg.best_params_

{'C': 0.1, 'penalty': 'l2'}

Для лог регрессии лучшие параметры: С = 0.1, регулризация - L2

In [27]:
log_reg.best_score_

0.8316016479529169

Сетка гиперпараметров случайного леса

In [28]:
param_grid_rf = [
  {'max_depth': [i for i in range(2, 15, 2)], 'min_samples_split': [2, 4, 8]}
]

Обучение случайного леса

In [29]:
rand_for = GridSearchCV(estimator=RandomForestClassifier(n_estimators=50), param_grid=param_grid_rf, scoring='roc_auc', cv=5)

In [30]:
rand_for.fit(X=X_train_int_std, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=50, n_jobs=None,
                                              oob_score=False,
                                              random

In [31]:
rand_for.best_params_

{'max_depth': 12, 'min_samples_split': 4}

Для случайного леса лучшие параметры: максимальная глубина = 12, минимальное число объектов в узле для его разделения = 8

In [32]:
rand_for.best_score_

0.8638313690735071

Посчитайте accuracy, precision, recall, f1-score и AUC-ROC на отложенной выборке для оптимальных алгоритмов. Какие они получились? Какой алгоритм лучше?

Лог регрессия

In [33]:
accuracy_score(y_test, log_reg.predict(X_test_int_std))

0.8132178613107303

In [34]:
precision_score(y_test, log_reg.predict(X_test_int_std))

0.6871222687122269

In [35]:
recall_score(y_test, log_reg.predict(X_test_int_std))

0.3842953718148726

In [36]:
f1_score(y_test, log_reg.predict(X_test_int_std))

0.4929131232282808

In [51]:
roc_auc_score(y_test, log_reg.predict_proba(X_test_int_std)[:,1])

0.8255017510712491

Случайный лес

In [38]:
accuracy_score(y_test, rand_for.predict(X_test_int_std))

0.8379092193354216

In [39]:
precision_score(y_test, rand_for.predict(X_test_int_std))

0.7700223713646532

In [40]:
recall_score(y_test, rand_for.predict(X_test_int_std))

0.44747789911596464

In [41]:
f1_score(y_test, rand_for.predict(X_test_int_std))

0.5660253247821081

In [52]:
roc_auc_score(y_test, rand_for.predict_proba(X_test_int_std)[:,1])

0.8616730555832607

Случайный лес лучше

### 3. Категориальные признаки как есть

Теперь к вещественным добавьте категориальные признаки, заменив их на числа с помощью LabelEncoder из модуля preprocessing. Переподберите параметры для логистической регрессии и случайного леса аналогично прошлому пункту. Как изменилось качество моделей на тестовой выборке? Как вы можете это объяснить?

Выделим кат признаки

In [53]:
X_train_cat = train.select_dtypes(include=['object']).iloc[:, :-1]
X_test_cat = test.select_dtypes(include=['object']).iloc[:, :-1]

In [54]:
X_train_cat.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


Проверим соответствие значений признаков на обучении и на тесте

In [55]:
X_train_cat.describe()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
count,32561,32561,32561,32561,32561,32561,32561,32561
unique,9,16,7,15,6,5,2,42
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
freq,22696,10501,14976,4140,13193,27816,21790,29170


In [56]:
X_train_cat['native-country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [57]:
X_test_cat.describe()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
count,16281,16281,16281,16281,16281,16281,16281,16281
unique,9,16,7,15,6,5,2,41
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States
freq,11210,5283,7403,2032,6523,13946,10860,14662


In [58]:
X_test_cat['native-country'].unique()

array(['United-States', '?', 'Peru', 'Guatemala', 'Mexico',
       'Dominican-Republic', 'Ireland', 'Germany', 'Philippines',
       'Thailand', 'Haiti', 'El-Salvador', 'Puerto-Rico', 'Vietnam',
       'South', 'Columbia', 'Japan', 'India', 'Cambodia', 'Poland',
       'Laos', 'England', 'Cuba', 'Taiwan', 'Italy', 'Canada', 'Portugal',
       'China', 'Nicaragua', 'Honduras', 'Iran', 'Scotland', 'Jamaica',
       'Ecuador', 'Yugoslavia', 'Hungary', 'Hong', 'Greece',
       'Trinadad&Tobago', 'Outlying-US(Guam-USVI-etc)', 'France'],
      dtype=object)

В тесте меньше значений признака native-country. Нет значения Holand-Netherlands. Кодирование на обучении не вызовет ошибок.

Закодируем кат признаки

In [59]:
cat_encoder = LabelEncoder()
X_train_cat_enc = X_train_cat.copy()
X_test_cat_enc = X_test_cat.copy()

In [60]:
for col in X_train_cat.columns:
    X_train_cat_enc[col] = cat_encoder.fit_transform(X_train_cat[col])
    X_test_cat_enc[col] = cat_encoder.transform(X_test_cat[col])

In [61]:
X_train_cat_enc.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,7,9,4,1,1,4,1,39
1,6,9,2,4,0,4,1,39
2,4,11,0,6,1,4,1,39
3,4,1,2,6,0,2,1,39
4,4,9,2,10,5,2,0,5


Объединяем кат признаки и вещественные

In [62]:
X_train_int_std_enc = X_train_int_std.join(X_train_cat_enc)
X_test_int_std_enc = X_test_int_std.join(X_test_cat_enc)

In [63]:
X_train_int_std_enc.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,7,9,4,1,1,4,1,39
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,6,9,2,4,0,4,1,39
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,4,11,0,6,1,4,1,39
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,4,1,2,6,0,2,1,39
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,4,9,2,10,5,2,0,5


Обучение лог регрессии

In [64]:
log_reg.fit(X=X_train_int_std_enc, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10,
                                100, 1000, 10000, 100000, 1000000],
                          'penalty': ['l1', 'l2']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [65]:
log_reg.best_params_

{'C': 1, 'penalty': 'l2'}

In [66]:
log_reg.best_score_

0.8538838320967315

Обучение леса

In [67]:
rand_for.fit(X=X_train_int_std_enc, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=50, n_jobs=None,
                                              oob_score=False,
                                              random

In [68]:
rand_for.best_params_

{'max_depth': 14, 'min_samples_split': 2}

In [69]:
rand_for.best_score_

0.9170091349959073

Метрики лог регрессии

In [70]:
accuracy_score(y_test, log_reg.predict(X_test_int_std_enc))

0.8244579571279406

In [71]:
precision_score(y_test, log_reg.predict(X_test_int_std_enc))

0.7013039934800326

In [72]:
recall_score(y_test, log_reg.predict(X_test_int_std_enc))

0.44747789911596464

In [73]:
f1_score(y_test, log_reg.predict(X_test_int_std_enc))

0.5463492063492064

In [74]:
roc_auc_score(y_test, log_reg.predict_proba(X_test_int_std_enc)[:,1])

0.8509484995403033

Метрики случайного леса

In [75]:
accuracy_score(y_test, rand_for.predict(X_test_int_std_enc))

0.8636447392666298

In [76]:
precision_score(y_test, rand_for.predict(X_test_int_std_enc))

0.7780437756497948

In [77]:
recall_score(y_test, rand_for.predict(X_test_int_std_enc))

0.5915236609464378

In [78]:
f1_score(y_test, rand_for.predict(X_test_int_std_enc))

0.672082717872969

In [79]:
roc_auc_score(y_test, rand_for.predict_proba(X_test_int_std_enc)[:,1])

0.9158716537644216

Качество увеличилось по сравнению с обучающей выборкой без категориальных признаков. Это означает, что целевая переменная зависит от категориальных признаков. И при построении модели с учетом кат признаков, модель находит эти зависимости и поэтому имеет большую предсказательную силу. Однако случайный лес улучшился сильнее лог регрессии. Это может быть связано с тем, то LabelEncoding создаёт упорядоченный признак.

### 4. Бинарное кодирование категориальных признаков

А теперь замените категориальные признаки из прошлого пункта на бинарно закодированные. Опять переподберите параметры для моделей и проверьте качество на тестовой выборке. Как изменилось качество относительно предыдущего пункта? Как вы можете это объяснить?

Бинарно кодируем кат признаки 

In [80]:
ohe = OneHotEncoder(sparse=False)

In [81]:
X_train_cat_ohe = pd.DataFrame(ohe.fit_transform(X=X_train_cat))
X_test_cat_ohe = pd.DataFrame(ohe.transform(X=X_test_cat))

In [82]:
X_train_cat_ohe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Соединяем вещественные и бинарно-закодированные признаки

In [83]:
X_train_int_std_ohe = X_train_int_std.join(X_train_cat_ohe)
X_test_int_std_ohe = X_test_int_std.join(X_test_cat_ohe)

In [84]:
X_train_int_std_ohe.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,0,1,2,3,...,92,93,94,95,96,97,98,99,100,101
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Обучение лог регрессии

In [85]:
log_reg.fit(X=X_train_int_std_ohe, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1, 10,
                                100, 1000, 10000, 100000, 1000000],
                          'penalty': ['l1', 'l2']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [86]:
log_reg.best_params_

{'C': 1, 'penalty': 'l2'}

In [87]:
log_reg.best_score_

0.9068726854359965

Обучение леса

In [88]:
rand_for.fit(X=X_train_int_std_ohe, y=y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=50, n_jobs=None,
                                              oob_score=False,
                                              random

In [89]:
rand_for.best_params_

{'max_depth': 14, 'min_samples_split': 2}

In [90]:
rand_for.best_score_

0.9148612803529078

Метрики лог регрессии

In [91]:
accuracy_score(y_test, log_reg.predict(X_test_int_std_ohe))

0.8530188563356059

In [92]:
precision_score(y_test, log_reg.predict(X_test_int_std_ohe))

0.7301235350015838

In [93]:
recall_score(y_test, log_reg.predict(X_test_int_std_ohe))

0.5993239729589184

In [94]:
f1_score(y_test, log_reg.predict(X_test_int_std_ohe))

0.6582893045837498

In [95]:
roc_auc_score(y_test, log_reg.predict_proba(X_test_int_std_ohe)[:,1])

0.9054543950957877

Метрики случайного леса

In [96]:
accuracy_score(y_test, rand_for.predict(X_test_int_std_ohe))

0.8613721515877403

In [97]:
precision_score(y_test, rand_for.predict(X_test_int_std_ohe))

0.7879666545849946

In [98]:
recall_score(y_test, rand_for.predict(X_test_int_std_ohe))

0.5652626105044202

In [99]:
f1_score(y_test, rand_for.predict(X_test_int_std_ohe))

0.6582891748675247

In [100]:
roc_auc_score(y_test, rand_for.predict_proba(X_test_int_std_ohe)[:,1])

0.9136269809457436

Качество лог регрессии повысилось по сравнению с кодированием LabelEncoder'ом, а случайного леса -  слегка понизилось.
Бинарное кодирование категориальных признаков плохо для деревьев, особенно, если признак имеет много уровней. Бинарное кодирование увеличивает разреженность матрицы признаков. Каждый бинарный признак рассматривается деревом как независимая переменная. Разделение узла по разреженному бинарному признаку уменьшает разброс ответов незначительно. Поэтому дерево будет выбирать вещественные переменные ближе к корню дерева. Поэтому прирост качества случайного леса с признаками continious относительно обучения по признакам continious+OneHot будет меньше чем у continious+LabelEncoder.
Лог регрессия в свою очередь улучшает качество за счёт того, что мы убрали искусственно введённый порядок на категориальных признаках.