# Red Wine Quality Prediction 

authors: 

## Summary 

## Introduction 

## EDA

### Imports

In [1]:
import pandas as pd 
import numpy as np
import sys
from hashlib import sha1
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler

### Part 1: Dataset Description
The dataset which will be used for analysis is the "winequality-red.csv" file from the <a href=https://archive.ics.uci.edu/dataset/186/wine+quality>UC Irvine Machine Learning Repository</a>, which was originally referenced from <a href=http://www3.dsi.uminho.pt/pcortez/wine/>Decision Support Systems, Elsevier</a>. The dataset contains physiochemical proprties (features) of red vinho verde wine samples from the north of Portugal, along with an associated wine quality score from 1 (worst) to 10 (best). 

### Part 2: Data Summary
In this section the df.info() and df.describe() commands will be used to obtain a preliminary overview of the dataset. Areas of particular interest are row and column counts, target variable type, feature types, and null entries. 

In [2]:
# Import the dataset:
df = pd.read_csv('data/winequality-red.csv', sep = ';')

# Visualize the first 5 rows:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# Determine the size of the dataset:
print(df.shape)

# Determine the value counts for each unique target (quality) value::
print(df.quality.unique())
df.quality.value_counts().sort_index()

(1599, 12)
[5 6 7 4 8 3]


quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: count, dtype: int64

It can be seen that there are 11 feature columns representing physiochemical characteristics of the wines, such as fixed acidity, residual sugar, chlorides, density, etc. There are 1598 rows or observations in the dataset. The target is the quality column which is listed as a set of ordinal values from 3 to 8. Most observations have an "average" quality between 5 and 7, with very few observations below a score of 5 or above a score of 7. This shows an unbalanced dataset.

In [4]:
# View feature dtypes and counts, null values:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In each of the 11 feature columns above, all entries have non-null values. As a result, imputation is not required to replace the null values. The data types of all features columns are float64 and the target column is int64. Although the target column "quality" is numeric, this problem could be considered as a classification or regression problem, as the numeric targets represent a wine quality category (i.e. from 1 to 10). Lastly, due to the numeric nature of the features preprocessing in the form of StandardScaler() will be required to scale each of the features, to prevent features of a large relative magnitude from dominating features of a smaller relative magnitude. 

In [5]:
# View summary statistics:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


The summary statistics show that discrepancy in magnitude between each of the classes, which will be addressed with scaling during pre-processing. It can also be seen that several classes have outliers including fixed acidity (mean of 8.31, max of 15.90), residual sugar (mean of 2.53, max of 15.50), free sulfur dioxide (mean of 15.87, max of 72.00), and total sulfur dioxide (mean of 46.47, max of 289.00). The outliers will be explored further in subseqent analysis, as they may impact model prediction accuracy. 

## Methods

EDA VIZ

In [6]:
import pandas as pd 
import numpy as np
import altair as alt
import matplotlib.pyplot as plt


In [7]:
# read data
wine_df = pd.read_csv('data/winequality-red.csv', sep = ';')

wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


We first observe the distribution of the features using their statistical summaries and a histogram. We can see that the majority of features have a skewed distribution, with many containing outliers. Volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all have very extreme outliers. Decision trees, linear models, K-NN and SVM are all sensitive to outliers. Omitting outliers may be useful for the performance of these models. 

In [8]:
wine_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [9]:
# Features 
feature_df = wine_df.drop('quality', axis=1)

feature_names = list(feature_df.columns)


alt.Chart(feature_df).mark_bar().encode(
     alt.X(alt.repeat()).type('quantitative').bin(maxbins=40),
     y='count()',
).properties(
    width=200,
    height=200
).repeat(
    feature_names, 
    columns=4
)

In [10]:
# Box plots 
alt.data_transformers.enable('vegafusion')

alt.Chart(feature_df).mark_boxplot().encode(
     alt.X(alt.repeat()).type('quantitative')
).properties(
    width=200,
    height=200
).repeat(
    feature_names, 
    columns=4
)


Next we look at potential correlations between the features using a correlation matrix. The correlation matrix ranges from dark green to indicate a strong positive correlation, white to indicate no correlation and dark pink to indicate a strong negative correlation. The features with the strongest correlations are summarized in the table below.

In [11]:

feature_df.corr(numeric_only=True).style.background_gradient('PiYG')


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
fixed acidity,1.0,-0.256131,0.671703,0.114777,0.093705,-0.153794,-0.113181,0.668047,-0.682978,0.183006,-0.061668
volatile acidity,-0.256131,1.0,-0.552496,0.001918,0.061298,-0.010504,0.07647,0.022026,0.234937,-0.260987,-0.202288
citric acid,0.671703,-0.552496,1.0,0.143577,0.203823,-0.060978,0.035533,0.364947,-0.541904,0.31277,0.109903
residual sugar,0.114777,0.001918,0.143577,1.0,0.05561,0.187049,0.203028,0.355283,-0.085652,0.005527,0.042075
chlorides,0.093705,0.061298,0.203823,0.05561,1.0,0.005562,0.0474,0.200632,-0.265026,0.37126,-0.221141
free sulfur dioxide,-0.153794,-0.010504,-0.060978,0.187049,0.005562,1.0,0.667666,-0.021946,0.070377,0.051658,-0.069408
total sulfur dioxide,-0.113181,0.07647,0.035533,0.203028,0.0474,0.667666,1.0,0.071269,-0.066495,0.042947,-0.205654
density,0.668047,0.022026,0.364947,0.355283,0.200632,-0.021946,0.071269,1.0,-0.341699,0.148506,-0.49618
pH,-0.682978,0.234937,-0.541904,-0.085652,-0.265026,0.070377,-0.066495,-0.341699,1.0,-0.196648,0.205633
sulphates,0.183006,-0.260987,0.31277,0.005527,0.37126,0.051658,0.042947,0.148506,-0.196648,1.0,0.093595


In [12]:
# Features with stronger correlations (|correlation coeff| > 0.6)
corr_df = feature_df.corr().unstack().reset_index()
corr_df.columns = ['feature_1', 'feature_2', 'correlation']
corr_top = corr_df[(abs(corr_df['correlation']) > 0.6)]
corr_top = corr_top[corr_top['feature_1'] != corr_top['feature_2']]

# get unique pairings
corr_top[['feature_1', 'feature_2']] = corr_top[['feature_1', 'feature_2']].apply(sorted, axis=1, result_type='expand')
unique_corr_df = corr_top[~corr_top.duplicated(subset=['feature_1', 'feature_2'])]

plots= []
for i in range(unique_corr_df.shape[0]):
    plot = alt.Chart(wine_df).mark_point(opacity=0.3).encode(
            alt.X(unique_corr_df.iloc[i,0]),
            alt.Y(unique_corr_df.iloc[i,1])
            ).properties(
                height = 200, 
                width = 200, 
            )
    plots.append(plot)

# adjust scale and plot 
grid_plot = alt.vconcat(
    alt.hconcat(plots[0], plots[1].encode(alt.X(unique_corr_df.iloc[1,0], scale=alt.Scale(domain=(0.95, 1.05))), 
                                          alt.Y(unique_corr_df.iloc[1,1]))),
    alt.hconcat(plots[2], plots[3])
)

# display results
display(unique_corr_df)

grid_plot.properties(
   title = 'Features with the Strongest Correlations'
)

Unnamed: 0,feature_1,feature_2,correlation
2,citric acid,fixed acidity,0.671703
7,density,fixed acidity,0.668047
8,fixed acidity,pH,-0.682978
61,free sulfur dioxide,total sulfur dioxide,0.667666


The stronger correlated features are: fixed acidity and citric acid; fixed acidity and density; fixed acidity and pH; and lastly, total sulfur dioxide and free sulfur dioxide. All have a positive correlation, except for pH and fixed acidity. To reduce model dimensionality and redundancy we may wish to remove a feature that is highly correlated with another. (?)

## Results 

In [13]:
import pandas as pd 
import numpy as np
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [14]:
df = pd.read_csv('data/winequality-red.csv', sep = ';')

In [15]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [17]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [18]:
has_na = df.isna().any().any()
has_na

False

In [19]:
X = df.drop(columns = ['quality'])
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 522)


In [20]:
models1 = {
    "dummy": DummyClassifier(random_state=522), 
    "KNN": KNeighborsClassifier(),
    "RBF SVM": SVC(random_state=123), 
    'Ridge model':Ridge(),
    'linear SVC':SVC(kernel = 'linear'),
    'decision tree': DecisionTreeClassifier(),
}

In [21]:
from sklearn.compose import make_column_transformer


results1 = []

for name, model in models1.items():
    pipeline =  make_pipeline(StandardScaler(), model) 
    scores = cross_validate(pipeline, X_train, y_train, return_train_score=True, n_jobs=-1)
 #   mean_std_cross_val_scores
    results1.append({
        'model': name,
        'test_score': np.mean(scores['test_score']),
        'train_score': np.mean(scores['train_score']),
        'fit_time': np.mean(scores['fit_time']),
        'score_time': np.mean(scores['score_time'])
    })

results_df1 = pd.DataFrame(results1)
results_df1.set_index('model', inplace=True)
results_df1

Unnamed: 0_level_0,test_score,train_score,fit_time,score_time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dummy,0.440575,0.440572,0.0,0.0
KNN,0.551381,0.699952,0.003199,0.00552
RBF SVM,0.605898,0.67694,0.023098,0.008416
Ridge model,0.332069,0.356684,0.01608,0.000624
linear SVC,0.570179,0.596963,0.028464,0.002535
decision tree,0.588877,1.0,0.012523,0.0


In [22]:
pipe_ridge = make_pipeline(StandardScaler(), Ridge())
pipe_ridge.fit(X_train, y_train)

coeffs = pipe_ridge.named_steps["ridge"].coef_

# Use df.columns to get the feature names if X_train was derived from df
coeff_df = pd.DataFrame(data=coeffs, index=X_train.columns, columns=["Coefficients"])
sorted_coeff_df = coeff_df.sort_values(by="Coefficients", ascending=False)

sorted_coeff_df

Unnamed: 0,Coefficients
alcohol,0.315652
sulphates,0.153658
free sulfur dioxide,0.052899
fixed acidity,0.045094
residual sugar,0.0349
density,-0.019654
citric acid,-0.046625
chlorides,-0.057411
pH,-0.063352
total sulfur dioxide,-0.117559


In [23]:
#drop variables with small coefficients(< 0.05) (free sulfur dioxide, residual sugar, density, citric acid)
X = df.drop(columns = ['quality','free sulfur dioxide', 'residual sugar', 'density', 'citric acid'])
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 522)
results1 = []

for name, model in models1.items():
    pipeline =  make_pipeline(StandardScaler(), model) 
    scores = cross_validate(pipeline, X_train, y_train, return_train_score=True, n_jobs=-1)
 #   mean_std_cross_val_scores
    results1.append({
        'model': name,
        'test_score': np.mean(scores['test_score']),
        'train_score': np.mean(scores['train_score']),
        'fit_time': np.mean(scores['fit_time']),
        'score_time': np.mean(scores['score_time'])
    })

results_df1 = pd.DataFrame(results1)
results_df1.set_index('model', inplace=True)
results_df1

Unnamed: 0_level_0,test_score,train_score,fit_time,score_time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dummy,0.440575,0.440572,0.000985,0.00022
KNN,0.560314,0.703306,0.001208,0.017289
RBF SVM,0.588057,0.652142,0.037086,0.016497
Ridge model,0.332776,0.350687,0.006246,0.0
linear SVC,0.555898,0.574396,0.039479,0.001107
decision tree,0.588917,1.0,0.013539,0.000627


In [24]:
#ridge coefficient
pipe_ridge = make_pipeline(StandardScaler(), Ridge())
pipe_ridge.fit(X_train, y_train)

coeffs = pipe_ridge.named_steps["ridge"].coef_

# Use df.columns to get the feature names if X_train was derived from df
coeff_df = pd.DataFrame(data=coeffs, index=X_train.columns, columns=["Coefficients"])
sorted_coeff_df = coeff_df.sort_values(by="Coefficients", ascending=False)

sorted_coeff_df

Unnamed: 0,Coefficients
alcohol,0.32322
sulphates,0.148581
fixed acidity,0.004924
pH,-0.064149
chlorides,-0.064908
total sulfur dioxide,-0.081622
volatile acidity,-0.187494


# ANALYSIS:

In this study, we employed various machine learning models to predict the quality of wine based on its chemical properties. The models used included a Dummy model, K-Nearest Neighbors (KNN), Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel, Ridge Regression, and Linear Support Vector Classification (SVC). These models were rigorously cross-validated with 5 folds to assess their performance. The Python programming language, along with essential packages such as Pandas,scikit-learn was instrumental in conducting this analysis.

The initial performance of each model, as measured by test scores, was as follows:

Dummy Model: 0.437059
KNN: 0.562917
RBF SVM: 0.615313
Ridge Model: 0.340925
Linear SVC: 0.566094

To improve model performance and streamline the feature set, we conducted a coefficient analysis. This analysis led to the exclusion of variables with coefficients less than 0.05, including variables within 'free sulfur dioxide', 'residual sugar', 'density', and 'citric acid'. The updated model performances were:

Dummy Model: 0.437059 (unchanged)
KNN: 0.571541
RBF SVM: 0.606740
Ridge Model: 0.344543
Linear SVC: 0.576244

The new coefficients for the remaining variables were:

Alcohol: 0.322230
Sulphates: 0.145794
Fixed Acidity: 0.011367
pH: -0.059857
Chlorides: -0.067214
Total Sulfur Dioxide: -0.089573
Volatile Acidity: -0.179928


This refined analysis suggests a more focused model, with the reduced feature set enhancing the predictive accuracy of certain models, notably the KNN and Linear SVC. The data utilized for this study encompass various physicochemical properties of wine, such as acidity, sulfur dioxide levels, and alcohol content, which are believed to influence its quality.
Based on the results, we will focus on decision tree model, RBF SVM, Ridge for hyperparameter optimizations

# Hyperparameter Tuning
I'm tuning three models: decision tree, linear SVC, and RBF SVM. Make sure we add justifications for these three choices later.

In [25]:
from sklearn.model_selection import GridSearchCV
import altair as alt
import matplotlib.pyplot as plt

In [26]:
#Decision Tree Tuning
pipe_dt = make_pipeline(StandardScaler(), DecisionTreeClassifier())

dt_param_grid = {
    "decisiontreeclassifier__criterion": ['gini', 'entropy'],
    "decisiontreeclassifier__max_depth": 2 ** np.arange(8)
}

dt_gs = GridSearchCV(pipe_dt, param_grid=dt_param_grid, n_jobs=-1, return_train_score=True)

In [27]:
dt_gs.fit(X_train, y_train)

In [28]:
dt_df = (pd.DataFrame(dt_gs.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_decisiontreeclassifier__criterion",
        "param_decisiontreeclassifier__max_depth",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())

In [29]:
dt_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_decisiontreeclassifier__criterion,param_decisiontreeclassifier__max_depth,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.596961,1.0,gini,128,0.010281


In [30]:
plot = alt.Chart(dt_df, title="Validation Score for Different Parameters for Decision Tree").mark_line().encode(x=alt.X('param_decisiontreeclassifier__max_depth', title='max_depth'), 
                                    y=alt.Y('mean_test_score', title='Validation Score').scale(zero=False),
                                    color=alt.Color('param_decisiontreeclassifier__criterion', title='criterion'))
plot + alt.Chart(dt_df.head(1)).mark_text(dy=-5).encode(
    x='param_decisiontreeclassifier__max_depth',
    y="mean_test_score",
    text=alt.value('Max'))

In [31]:
#Linear SVC Tuning
pipe_lsvc = make_pipeline(StandardScaler(), SVC(kernel = 'linear'))

lsvc_param_grid = {
    "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100, 1000]
}

lsvc_gs = GridSearchCV(pipe_lsvc, param_grid=lsvc_param_grid, n_jobs=-1, return_train_score=True)

In [32]:
lsvc_gs.fit(X_train, y_train)

In [33]:
lsvc_df = (pd.DataFrame(lsvc_gs.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_svc__C",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())

In [34]:
lsvc_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_svc__C,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.557707,0.568813,0.1,0.023819


In [35]:
plot = alt.Chart(lsvc_df, title="Validation Score for Different Parameters for Linear SVM").mark_line().encode(
    x=alt.X('param_svc__C', title='C (log scale)', scale=alt.Scale(type='log')),
    y=alt.Y('mean_test_score', title='Validation Score').scale(zero=False))

plot + alt.Chart(lsvc_df.head(1)).mark_text(dy=-5).encode(
    x='param_svc__C',
    y="mean_test_score",
    text=alt.value('Max'))

In [36]:
#RBF SVM Tuning
pipe_rbf = make_pipeline(StandardScaler(), SVC(random_state=123))

rbf_param_grid = {
    "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}

rbf_gs = GridSearchCV(pipe_rbf, param_grid=rbf_param_grid, n_jobs=-1, return_train_score=True)

In [37]:
rbf_gs.fit(X_train, y_train)

In [38]:
rbf_df = (pd.DataFrame(rbf_gs.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_svc__gamma",
        "param_svc__C",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())

In [39]:
rbf_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_svc__gamma,param_svc__C,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.621092,0.841597,1.0,1.0,0.097549


In [40]:
plot = alt.Chart(rbf_df, title="Validation Score for Different Parameters for RBF SVM").mark_line().encode(
    x=alt.X('param_svc__C', title='C (log scale)', scale=alt.Scale(type='log')),
    y=alt.Y('mean_test_score', title='Validation Score').scale(zero=False),
    color=alt.Color('param_svc__gamma:O', title='gamma').scale(scheme='tableau20')
)

plot + alt.Chart(rbf_df.head(1)).mark_text(dy=-5).encode(
    x='param_svc__C',
    y="mean_test_score",
    text=alt.value('Max'))

Based on the above findings, we have found that in terms of validation score, the best parameters for decision tree is to use gini as the criterion and a `max_depth` of 128, that for linear SVM is to set `C` as 0.1, and that for RBF SVM is to set both `gamma` and `C` as 1.0. We now use these three best models on the test set to assess their performances.

In [41]:
best_dt_pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(criterion='gini', max_depth=128))
best_lsvm_pipe = make_pipeline(StandardScaler(), SVC(kernel='linear', C=0.1))
best_rbf_pipe = make_pipeline(StandardScaler(), SVC(gamma=1.0, C=1.0))

best_dt_pipe.fit(X_train, y_train)
best_lsvm_pipe.fit(X_train, y_train)
best_rbf_pipe.fit(X_train, y_train)

In [42]:
best_dt_pipe.score(X_test, y_test)

0.5958333333333333

In [43]:
best_lsvm_pipe.score(X_test, y_test)

0.6145833333333334

In [44]:
best_rbf_pipe.score(X_test, y_test)

0.6666666666666666

Among all three models, RBF SVM is the best in terms of test set performance.

## Discussion 

## References 

1. https://archive.ics.uci.edu/dataset/186/wine+quality
2. http://www3.dsi.uminho.pt/pcortez/wine/