## Dataset

The dataset was adapted from the Wine Quality Dataset (https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

### Attribute Information:

For more information, read [Cortez et al., 2009: http://dx.doi.org/10.1016/j.dss.2009.05.016].

Input variables (based on physicochemical tests):

    1 - fixed acidity 
    2 - volatile acidity 
    3 - citric acid 
    4 - residual sugar 
    5 - chlorides 
    6 - free sulfur dioxide 
    7 - total sulfur dioxide 
    8 - density 
    9 - pH 
    10 - sulphates 
    11 - alcohol 
Output variable (based on sensory data):

    12 - quality (0: normal wine, 1: good wine)
    
## Problem statement
Predict the quality of a wine given its input variables. Use AUC (area under the receiver operating characteristic curve) as the evaluation metric.

First, let's load and explore the dataset.

In [1]:
import numpy as np
import pandas as pd
np.random.seed = 42

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression




from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import roc_auc_score

from sklearn.ensemble import VotingClassifier

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
data = pd.read_csv("whitewine.csv")
data.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4715 entries, 0 to 4714
Data columns (total 12 columns):
fixed_acidity           4715 non-null float64
volatile_acidity        4715 non-null float64
citric_acid             4715 non-null float64
residual_sugar          4715 non-null float64
chlorides               4715 non-null float64
free_sulfur_dioxide     4715 non-null float64
total_sulfur_dioxide    4715 non-null float64
density                 4715 non-null float64
pH                      4715 non-null float64
sulphates               4715 non-null float64
alcohol                 4715 non-null float64
quality                 4715 non-null int64
dtypes: float64(11), int64(1)
memory usage: 442.1 KB


In [4]:
data["quality"].value_counts()

0    3655
1    1060
Name: quality, dtype: int64

Please note that this dataset is unbalanced.

## Questions and Code

**[1]. Split the given data using stratify sampling into 2 subsets: training (80%) and test (20%) sets. Use random_state = 42. [1 points]**

In [5]:
from sklearn.model_selection import train_test_split
X = data
y = data['quality']
X.drop(['quality'], axis=1, inplace = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

In [6]:
print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

(3772, 11)
(3772,)
(943, 11)
(943,)


**[2]. Use ``GridSearchCV`` and ``Pipeline`` to tune hyper-parameters for 3 different classifiers including ``KNeighborsClassifier``, ``LogisticRegression`` and ``svm.SVC`` and report the corresponding AUC values on the training and test sets. Note that a scaler may need to be inserted into each pipeline. [6 points]**

In [7]:
KNN_parameters = {'knn__n_neighbors': (1,3,5,7,10), 'knn__weights':('uniform', 'distance')}
LR_parameters = {'lr__C':[0.5, 1.0, 2.0, 3.0, 4.5, 6.0], 'lr__solver':('lbfgs', 'liblinear','saga')}
SVM_parameters = {'svm__C': [0.5, 1.0, 2.0, 3.0, 4.5, 6.0], 'svm__kernel':('linear', 'sigmoid', 'rbf')}


SVM_Pipeline = Pipeline(steps = [('scale', StandardScaler()), ('svm', SVC(probability=True))])
KNN_Pipeline = Pipeline(steps = [('scale', StandardScaler()), ('knn', KNeighborsClassifier())])
LR_Pipeline =  Pipeline(steps = [('scale', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))])

In [8]:
KNN_GSCV = GridSearchCV(KNN_Pipeline, param_grid = KNN_parameters, cv=3, scoring='roc_auc')
KNN_GSCV.fit(X_train, y_train)

train_Pred = KNN_GSCV.predict(X_train)
test_Pred = KNN_GSCV.predict(X_test)

knn_train_auc = roc_auc_score(y_train, train_Pred)
print("KNN training score= " + str(knn_train_auc))
knn_test_auc = roc_auc_score(y_test, test_Pred)
print("KNN test score= " + str(knn_test_auc))

KNN training score= 1.0
KNN test score= 0.8127339132230338


In [9]:
LR_GSCV = GridSearchCV(LR_Pipeline, param_grid = LR_parameters, cv=3, scoring='roc_auc')
LR_GSCV.fit(X_train, y_train)

train_Pred = LR_GSCV.predict(X_train)
test_Pred = LR_GSCV.predict(X_test)

lr_train_auc = roc_auc_score(y_train, train_Pred)
print("LR training score= " + str(lr_train_auc))
lr_test_auc = roc_auc_score(y_test, test_Pred)
print("LR test score= " + str(lr_test_auc))

LR training score= 0.6133712864259351
LR test score= 0.5962722298221614


In [10]:
SVM_GSCV = GridSearchCV(SVM_Pipeline, param_grid = SVM_parameters, cv=3, scoring='roc_auc')
SVM_GSCV.fit(X_train, y_train)

train_Pred = SVM_GSCV.predict(X_train)
test_Pred = SVM_GSCV.predict(X_test)

svm_train_auc = roc_auc_score(y_train, train_Pred)
print("SVM training score= " + str(svm_train_auc))
svm_test_auc = roc_auc_score(y_test, test_Pred)
print("SVM test score= " + str(svm_test_auc))

SVM training score= 0.7504533076942932
SVM test score= 0.6845978628397388


In [11]:
ave_train = (svm_train_auc + knn_train_auc + lr_train_auc) / 3
ave_test = (svm_test_auc + knn_test_auc + lr_test_auc) / 3
print("Average training score = " + str(ave_train))
print("Average test score = " + str(ave_test))


Average training score = 0.7879415313734094
Average test score = 0.6978680019616448


**[3]. Train a soft ``VotingClassifier`` with the estimators are the three tuned pipelines obtained from [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [2 points]**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [12]:
vc = VotingClassifier(estimators=[('lr', LR_GSCV), ('knn', KNN_GSCV), ('svm',SVM_GSCV)], voting='soft')
vc.fit(X_train, y_train)

train_preds = vc.predict(X_train)
train_auc = roc_auc_score(train_preds, y_train)
print("Voting Classifier training auc = " + (str(train_auc)))
test_preds = vc.predict(X_test)
test_auc = roc_auc_score(test_preds, y_test)
print("Voting Classifier test auc = " + (str(test_auc)))

Voting Classifier training auc = 0.9692451107221673
Voting Classifier test auc = 0.8578478352248844


The performance of the Voting Classifier model is superior to all of the three individual models.The performance auc improves from the highest value of 0.81 up to 0.85 ad reduces the overfitting of the training data that occurs in the knn model , and the difference between the training and test aucs is not strong enough to suggest major bias towards the training data. Using the soft voting gives a higher auc score than the hard voting, which is based on the optimisation of the models which occurs in the grid search cv models


**[4]. Redo [3] with a sensible set of ``weights`` for the estimators. Comment on the performance of the ensemble model in this case. [1 point]**

In [13]:
weights = [1,4,2]
weighted_vc = VotingClassifier(estimators=[('lr', LR_GSCV), ('knn', KNN_GSCV), ('svm',SVM_GSCV)], weights=weights, voting='soft')
weighted_vc.fit(X_train, y_train)

train_preds = weighted_vc.predict(X_train)
train_auc = roc_auc_score(train_preds, y_train)
print("Voting Classifier training auc = " + (str(train_auc)))
test_preds = weighted_vc.predict(X_test)
test_auc = roc_auc_score(test_preds, y_test)
print("Voting Classifier test auc = " + (str(test_auc)))

Voting Classifier training auc = 1.0
Voting Classifier test auc = 0.8531247705074538


The second, weighted model of the voting classifier has an improvement in training AUC, though the gap between the two AUC scores is wider which suggests that there is more overfitting on the training data. The weights are stacked so that the KNN model, which was the most accurate, has the strongest weight, while the logisitic regression model has the weakest weights and thus the least influence on the ensemble model. The weighted model does not perform better than the unweighted model. The performance of the model is superior to the KNN model, matching the training AUC score but improving on the test AUC score. Heavier weights to the KNN model in the voting classifier did not improve the scores, while lower weights lessened the difference between the training and testing scores but also lowered those scores overall. The unweighted model has a very similar test score but a lower training, which suggests that the weighted model is overfitting to the training data and producing an unnecessarily high roc score on this data

**[5][*Optional - for bonus points only*]. Use the ``VotingClassifier`` with ``GridSearchCV`` to tune the hyper-parameters of the individual estimators. The parameter grid should be a combination of those in [2]. Report the AUC values on the training and test sets. Comment on the performance of the ensemble model. [2 points]**

Document: https://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearchcv