# Подготовка данных

## Импорт данных

In [1]:
import pandas as pd
dataset = pd.read_csv('../data/adult_preprocessed.csv')
dataset.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,salary,workclass_Federal-gov,workclass_Local-gov,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,77516,13,0,2174,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,0,13,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,0,40,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,1,0,0,40,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Аномалии

Исходя из данных, полученных в файле 1_visualization можно сделать вывод, что аномалии имеются у параметров *hours-peer-week* и *capital-gain*. Проверим, что произойдет с данными, если отбросить значения на расстоянии более 1.5\*IQR от квартилей Q1 и Q3

In [2]:
first_quartile = dataset['hours-per-week'].describe()['25%']
third_quartile = dataset['hours-per-week'].describe()['75%']

iqr = third_quartile - first_quartile

dataset[(dataset['hours-per-week'] > (first_quartile - 3 * iqr)) &
            (dataset['hours-per-week'] < (third_quartile + 3 * iqr))].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24444 entries, 0 to 30161
Data columns (total 88 columns):
 #   Column                                     Non-Null Count  Dtype
---  ------                                     --------------  -----
 0   age                                        24444 non-null  int64
 1   fnlwgt                                     24444 non-null  int64
 2   education-num                              24444 non-null  int64
 3   sex                                        24444 non-null  int64
 4   capital-gain                               24444 non-null  int64
 5   capital-loss                               24444 non-null  int64
 6   hours-per-week                             24444 non-null  int64
 7   salary                                     24444 non-null  int64
 8   workclass_Federal-gov                      24444 non-null  int64
 9   workclass_Local-gov                        24444 non-null  int64
 10  workclass_Private                          244

Для параметра *hours-per-week* это приводит к потере значительной части данных, кроме того, нельзя исключать, что люди действительно столько работают. Проверим для второго парметра

In [3]:
first_quartile = dataset['capital-gain'].describe()['25%']
third_quartile = dataset['capital-gain'].describe()['75%']

iqr = third_quartile - first_quartile

dataset[(dataset['capital-gain'] > (first_quartile - 3 * iqr)) &
            (dataset['capital-gain'] < (third_quartile + 3 * iqr))].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 88 columns):
 #   Column                                     Non-Null Count  Dtype
---  ------                                     --------------  -----
 0   age                                        0 non-null      int64
 1   fnlwgt                                     0 non-null      int64
 2   education-num                              0 non-null      int64
 3   sex                                        0 non-null      int64
 4   capital-gain                               0 non-null      int64
 5   capital-loss                               0 non-null      int64
 6   hours-per-week                             0 non-null      int64
 7   salary                                     0 non-null      int64
 8   workclass_Federal-gov                      0 non-null      int64
 9   workclass_Local-gov                        0 non-null      int64
 10  workclass_Private                          0 non-null      int

Для второго параметра это приводит к полной потере данных

## Разбиение данных

Для начала разобьем данные на множество описаний объектов и множество меток:

In [2]:
X = dataset.drop('salary', 1).values
y = dataset['salary'].values

Разобьем на обучающую и тестовую выборки:

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

Проверим соотношение разбиения:

In [4]:
from collections import Counter
print(Counter(y_train).values())
print(Counter(y_test).values())

dict_values([18123, 6006])
dict_values([4531, 1502])


Нормализуем параметры:

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Классификация

## KNN

Произведем обучение и проверим точность классификации алгоритма KNN

In [9]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[4092  439]
 [ 652  850]]
              precision    recall  f1-score   support

           0       0.86      0.90      0.88      4531
           1       0.66      0.57      0.61      1502

    accuracy                           0.82      6033
   macro avg       0.76      0.73      0.75      6033
weighted avg       0.81      0.82      0.81      6033



Произведем поиск гиперпараметров:

In [18]:
from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

parameters = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

knn = KNeighborsClassifier()

clf = GridSearchCV(knn, parameters)

clf.fit(X_train, y_train)

clf.best_params_

{'n_neighbors': 9}

Произведем кросс-валидацию

In [8]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf = KNeighborsClassifier(n_neighbors=9)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(classification_report(y_test, y_pred))
  print(clf.score(X_test, y_test))

              precision    recall  f1-score   support

           0       0.80      0.95      0.87      2266
           1       0.64      0.27      0.38       751

    accuracy                           0.78      3017
   macro avg       0.72      0.61      0.62      3017
weighted avg       0.76      0.78      0.75      3017

0.7812396420285052
              precision    recall  f1-score   support

           0       0.80      0.95      0.87      2266
           1       0.66      0.27      0.39       751

    accuracy                           0.78      3017
   macro avg       0.73      0.61      0.63      3017
weighted avg       0.76      0.78      0.75      3017

0.7835598276433543
              precision    recall  f1-score   support

           0       0.80      0.95      0.87      2266
           1       0.66      0.27      0.38       750

    accuracy                           0.78      3016
   macro avg       0.73      0.61      0.62      3016
weighted avg       0.76      0.78   

Полученные значения на разных итерациях примерно равны. Можно сделать вывод, что данные устойчивы

Точность работы алгоритма немного увеличилась относительно k = 5

## DTC

Произведем обучение и проверим точность классификации алгоритма DTC

In [23]:
from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[3960  571]
 [ 576  926]]
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      4531
           1       0.62      0.62      0.62      1502

    accuracy                           0.81      6033
   macro avg       0.75      0.75      0.75      6033
weighted avg       0.81      0.81      0.81      6033



Произведем поиск гиперпараметров:

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

parameters = {'max_depth': range(1,11),'max_features': range(4,19)}

dtc = tree.DecisionTreeClassifier()

clf = GridSearchCV(dtc, parameters)

clf.fit(X_train, y_train)

clf.best_params_

{'max_depth': 10, 'max_features': 17}

Произведем кросс-валидацию

In [24]:
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf = tree.DecisionTreeClassifier(max_depth = 10, max_features = 17)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(classification_report(y_test, y_pred))
  print(clf.score(X_test, y_test))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87      2266
           1       0.60      0.61      0.60       751

    accuracy                           0.80      3017
   macro avg       0.73      0.74      0.74      3017
weighted avg       0.80      0.80      0.80      3017

0.8004640371229699
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      2266
           1       0.61      0.61      0.61       751

    accuracy                           0.81      3017
   macro avg       0.74      0.74      0.74      3017
weighted avg       0.81      0.81      0.81      3017

0.8070931388796818
              precision    recall  f1-score   support

           0       0.86      0.86      0.86      2266
           1       0.59      0.59      0.59       750

    accuracy                           0.80      3016
   macro avg       0.73      0.73      0.73      3016
weighted avg       0.80      0.80   

Полученные значения на разных итерациях примерно равны. Можно сделать вывод, что данные устойчивы

Точность повысилась

## NB

Произведем обучение и проверим точность классификации алгоритма NB

In [31]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf = GaussianNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[1918 2613]
 [  58 1444]]
              precision    recall  f1-score   support

           0       0.97      0.42      0.59      4531
           1       0.36      0.96      0.52      1502

    accuracy                           0.56      6033
   macro avg       0.66      0.69      0.55      6033
weighted avg       0.82      0.56      0.57      6033



Произведем поиск гиперпараметров:

In [37]:
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

parameters = {'var_smoothing': np.logspace(0,-9, num=100)}

gnb = GaussianNB()

clf = GridSearchCV(gnb, parameters)

clf.fit(X_train, y_train)

clf.best_params_

{'var_smoothing': 0.0657933224657568}

Произведем кросс-валидацию

In [32]:
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf = GaussianNB(var_smoothing = 0.0657933224657568)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(classification_report(y_test, y_pred))
  print(clf.score(X_test, y_test))

              precision    recall  f1-score   support

           0       0.81      0.94      0.87      2266
           1       0.65      0.31      0.42       751

    accuracy                           0.79      3017
   macro avg       0.73      0.63      0.65      3017
weighted avg       0.77      0.79      0.76      3017

0.7875372886973815
              precision    recall  f1-score   support

           0       0.81      0.95      0.87      2266
           1       0.66      0.31      0.42       751

    accuracy                           0.79      3017
   macro avg       0.73      0.63      0.65      3017
weighted avg       0.77      0.79      0.76      3017

0.7888631090487239
              precision    recall  f1-score   support

           0       0.80      0.94      0.87      2266
           1       0.63      0.30      0.41       750

    accuracy                           0.78      3016
   macro avg       0.72      0.62      0.64      3016
weighted avg       0.76      0.78   

Полученные значения на разных итерациях примерно равны. Можно сделать вывод, что данные устойчивы

Точность примерна равна точности со значением по умолчанию

## SVM

Произведем обучение и проверим точность классификации алгоритма SVM

In [39]:
from sklearn import svm

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf = svm.SVC()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[4201  330]
 [ 617  885]]
              precision    recall  f1-score   support

           0       0.87      0.93      0.90      4531
           1       0.73      0.59      0.65      1502

    accuracy                           0.84      6033
   macro avg       0.80      0.76      0.78      6033
weighted avg       0.84      0.84      0.84      6033



В связи с чрезмерно долгим сроком выполнения поиска гиперпараметров для данного алгоритма, поиск не выполнялся

Произведем кросс-валидацию

In [40]:
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf = svm.SVC()
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(classification_report(y_test, y_pred))
  print(clf.score(X_test, y_test))

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      2266
           1       0.98      0.16      0.27       751

    accuracy                           0.79      3017
   macro avg       0.88      0.58      0.57      3017
weighted avg       0.83      0.79      0.73      3017

0.7891945641365595
              precision    recall  f1-score   support

           0       0.78      1.00      0.88      2266
           1       0.95      0.16      0.28       751

    accuracy                           0.79      3017
   macro avg       0.87      0.58      0.58      3017
weighted avg       0.83      0.79      0.73      3017

0.7898574743122307
              precision    recall  f1-score   support

           0       0.78      1.00      0.88      2266
           1       0.97      0.15      0.26       750

    accuracy                           0.79      3016
   macro avg       0.88      0.57      0.57      3016
weighted avg       0.83      0.79   

Полученные значения на разных итерациях примерно равны. Можно сделать вывод, что данные устойчивы

## LR

Произведем обучение и проверим точность классификации алгоритма LR

In [9]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[4206  325]
 [ 601  901]]
              precision    recall  f1-score   support

           0       0.87      0.93      0.90      4531
           1       0.73      0.60      0.66      1502

    accuracy                           0.85      6033
   macro avg       0.80      0.76      0.78      6033
weighted avg       0.84      0.85      0.84      6033



Произведем поиск гиперпараметров:

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

parameters = {'C': [100, 10, 1.0, 0.1, 0.01]}

lr = LogisticRegression()

clf = GridSearchCV(lr, parameters)

clf.fit(X_train, y_train)

clf.best_params_

{'C': 100}

Произведем кросс-валидацию

In [24]:
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(X, y):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf = LogisticRegression()
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(classification_report(y_test, y_pred))
  print(clf.score(X_test, y_test))

              precision    recall  f1-score   support

           0       0.80      0.96      0.87      2266
           1       0.70      0.27      0.39       751

    accuracy                           0.79      3017
   macro avg       0.75      0.61      0.63      3017
weighted avg       0.77      0.79      0.75      3017

0.7895260192243951
              precision    recall  f1-score   support

           0       0.80      0.97      0.87      2266
           1       0.72      0.26      0.39       751

    accuracy                           0.79      3017
   macro avg       0.76      0.62      0.63      3017
weighted avg       0.78      0.79      0.75      3017

0.7918462048392443
              precision    recall  f1-score   support

           0       0.80      0.96      0.87      2266
           1       0.69      0.26      0.38       750

    accuracy                           0.79      3016
   macro avg       0.74      0.61      0.63      3016
weighted avg       0.77      0.79   

Полученные значения на разных итерациях примерно равны. Можно сделать вывод, что данные устойчивы

## Выводы по работе алгоритмов на данной модели

Лучшие результаты показал SVC

In [60]:
import abcqwe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

test = abcqwe(X_train, X_test, y_train, y_test, 5)
len(y_train)

IndentationError: expected an indented block (abcqwe.py, line 7)