Получаем данные и импортим необходимые библиотеки

In [1]:
import numpy as np
import pandas as pd

wine_train = pd.read_csv('winequality-red.csv', sep=";")
wine_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Разбиваем на матрицу объектов-признаков и вектор правильных ответов

In [40]:
X = wine_train.drop(['quality'], axis='columns').values
y = wine_train['quality'].values

Применяем отложенную выборку. Т.е. разделяем предоставленные данные на обучающие и валидационные(80/20)

In [41]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, shuffle=True, random_state=9, test_size=0.2, stratify=y)

Подготоваливаем матрицу объектов-признаков

In [42]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_val = ss.transform(X_val)

In [71]:
from sklearn.metrics import accuracy_score, confusion_matrix

def print_prediction_assessment(predictions, y_val):
    acc = accuracy_score(y_val, predictions)
    cm = confusion_matrix(y_val, predictions)
    print('ACCURACY:    ', acc * 100,'%')
    print(cm)

Построим модель используя алгоритм Метод ближайших соседей.

In [72]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

knnc = KNeighborsClassifier(algorithm='auto')
knnc = knnc.fit(X_train, y_train)

predictions = knnc.predict(X_val)
print_prediction_assessment(predictions, y_val)

ACCURACY:     59.375 %
[[ 0  1  1  0  0  0]
 [ 0  1  5  5  0  0]
 [ 1  1 96 37  1  0]
 [ 0  1 36 79 12  0]
 [ 0  0  1 25 14  0]
 [ 0  0  0  1  2  0]]


Настроим число соседей на 5кратной кросс-валидации. 

In [78]:
from sklearn.model_selection import GridSearchCV, cross_val_score

knn_params = {'n_neighbors': range(1,20)}
knn_grid = GridSearchCV(knnc, knn_params, cv=5, n_jobs=-1, verbose=True)

knn_grid.fit(X_train, y_train)
knn_grid.best_params_

Fitting 5 folds for each of 19 candidates, totalling 95 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   32.0s
[Parallel(n_jobs=-1)]: Done  95 out of  95 | elapsed:   33.8s finished


{'n_neighbors': 1}

In [79]:
# 59% кол-во правильных ответов на кросс-валидации и 67% на отложенной выборке
knn_grid.best_score_ * 100

59.030492572322125

In [80]:
accuracy_score(y_val, knn_grid.predict(X_val)) * 100

66.875

Построим модель используя алгоритм Дерева решений. Также применим адаптивный бустинг для Дерева решений.

In [81]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

print("DecisionTreeClassifier, criterion=entropy")
dt1 = DecisionTreeClassifier(criterion='entropy', random_state=12345)
dt1 = dt1.fit(X_train, y_train)

predictions1 = dt1.predict(X_val)
print_prediction_assessment(predictions1, y_val)

print("DecisionTreeClassifier, criterion=gini")
dt2 = DecisionTreeClassifier(criterion='gini', random_state=1337)
dt2 = dt2.fit(X_train, y_train)

predictions2 = dt2.predict(X_val)
print_prediction_assessment(predictions2, y_val)

print("AdaBoostClassifier")
abdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=11, random_state=228),
                          algorithm="SAMME",
                          random_state=228)

abdt.fit(X_train, y_train)
pred = abdt.predict(X_val)
print_prediction_assessment(pred, y_val)

DecisionTreeClassifier, criterion=entropy
ACCURACY:     62.5 %
[[ 0  0  2  0  0  0]
 [ 1  1  5  4  0  0]
 [ 1  3 90 37  5  0]
 [ 2  2 24 86 12  2]
 [ 0  0  2 15 23  0]
 [ 0  0  0  1  2  0]]
DecisionTreeClassifier, criterion=gini
ACCURACY:     65.0 %
[[ 0  0  1  1  0  0]
 [ 1  3  2  4  0  1]
 [ 0  4 94 33  4  1]
 [ 0  5 25 87 11  0]
 [ 0  1  2 10 24  3]
 [ 0  0  0  2  1  0]]
AdaBoostClassifier
ACCURACY:     71.875 %
[[  0   0   2   0   0   0]
 [  1   0   5   5   0   0]
 [  0   0 110  25   1   0]
 [  0   0  18 101   9   0]
 [  0   0   0  20  19   1]
 [  0   0   0   2   1   0]]


Настроим максимальную глубину дерева на 5кратной кросс-валидации.

In [82]:
from sklearn.model_selection import GridSearchCV, cross_val_score

tree_params1 = {'max_depth': range(1,50)}
tree_grid1 = GridSearchCV(dt1, tree_params1, cv=5, verbose=True)
tree_grid1.fit(X_train, y_train)
tree_grid1.best_params_

Fitting 5 folds for each of 49 candidates, totalling 245 fits


[Parallel(n_jobs=1)]: Done 245 out of 245 | elapsed:    8.8s finished


{'max_depth': 17}

In [86]:
# 60% кол-во правильных ответов на кросс-валидации и 62.5% на отложенной выборке
tree_grid1.best_score_ * 100

59.73416731821736

In [87]:
accuracy_score(y_val, tree_grid1.predict(X_val)) * 100

62.5

In [88]:
tree_params2 = {'max_depth': range(1,50)}
tree_grid2 = GridSearchCV(dt2, tree_params2, cv=5, verbose=True)
tree_grid2.fit(X_train, y_train)
tree_grid2.best_params_

Fitting 5 folds for each of 49 candidates, totalling 245 fits


[Parallel(n_jobs=1)]: Done 245 out of 245 | elapsed:    4.5s finished


{'max_depth': 16}

In [91]:
# 58% кол-во правильных ответов на кросс-валидации и 68.125% на отложенной выборке
tree_grid2.best_score_ * 100

58.483189992181394

In [92]:
accuracy_score(y_val, tree_grid2.predict(X_val)) * 100

68.125

Построим модель используя алгоритм Случайный лес.

In [95]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=16, random_state=100)
rf = rf.fit(X_train, y_train)

predictions4 = rf.predict(X_val)
print_prediction_assessment(predictions4, y_val)

ACCURACY:     69.6875 %
[[  0   1   0   1   0   0]
 [  1   2   5   3   0   0]
 [  0   0 110  23   3   0]
 [  0   0  30  92   6   0]
 [  0   0   2  19  19   0]
 [  0   0   0   1   2   0]]


Настроим число деревьев на 5кратной кросс-валидации.

In [96]:
tree_params4 = {'n_estimators': range(50,100)}
tree_grid4 = GridSearchCV(rf, tree_params4, cv=5, verbose=True)
tree_grid4.fit(X_train, y_train)
tree_grid4.best_params_

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:  2.6min finished


{'n_estimators': 96}

In [97]:
# 68% кол-во правильных ответов на кросс-валидации и 73% на отложенной выборке
tree_grid4.best_score_ * 100

67.94370602032838

In [98]:
accuracy_score(y_val, tree_grid4.predict(X_val)) * 100

73.125

Построим модель используя алгоритм Градиентный бустинг.

In [107]:
from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(random_state=228)
gbm.fit(X_train, y_train)

pred = gbm.predict(X_val)
print_prediction_assessment(pred, y_val)

ACCURACY:     69.6875 %
[[  0   1   1   0   0   0]
 [  0   2   6   3   0   0]
 [  0   1 113  22   0   0]
 [  0   0  26  91   9   2]
 [  0   0   2  20  17   1]
 [  0   0   0   1   2   0]]


Настроим число деревьев на 5кратной кросс-валидации.

In [109]:
gbm_params = {'n_estimators': range(100,120)}
gbm_grid = GridSearchCV(gbm, gbm_params, cv=5, n_jobs=-1, verbose=True)
gbm_grid.fit(X_train, y_train)
gbm_grid.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   34.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.5min finished


{'n_estimators': 106}

In [112]:
# 65% кол-во правильных ответов на кросс-валидации и 70% на отложенной выборке
gbm_grid.best_score_ * 100

64.97263487099296

In [113]:
accuracy_score(y_val, gbm_grid.predict(X_val)) * 100

70.0