# Red Wine Quality Prediction 

authors: 

## Summary 

## Introduction 

Red wines have a long history that can be traced all the way back to the ancient Greeks. Today, they are more accessible to an average person than ever and the entire industry is estimated to be worth around 109.5 billion USD, and yet despite its ubiquity, most people can barely tell the difference between a good and a bad wine, to the point where we need trained professionals (sommeliers) to understand the difference. In this project, we seek to use machine learning algorithms to predict the quality of the wine based on the physical and chemical properties of the liquid.  This model could allow for manufactures and suppliers to have a more robust understanding of the quality of the wine based on measurable properties.

## EDA

In [70]:
import pandas as pd 
import numpy as np
import sys
from hashlib import sha1
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import altair as alt
from sklearn.linear_model import Ridge
from sklearn.metrics import ConfusionMatrixDisplay 

### Dataset Description
The dataset which will be used for analysis is the "winequality-red.csv" file from the <a href=https://archive.ics.uci.edu/dataset/186/wine+quality>UC Irvine Machine Learning Repository</a>, which was originally referenced from <a href=http://www3.dsi.uminho.pt/pcortez/wine/>Decision Support Systems, Elsevier</a>. The dataset contains physiochemical proprties (features) of red vinho verde wine samples from the north of Portugal, along with an associated wine quality score from 0 (worst) to 10 (best). 

In [2]:
# Import the dataset:
df = pd.read_csv('data/winequality-red.csv', sep = ';')

# Visualize the first 5 rows:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# This section is to be deleted after the report is compiled.
# Determine the size of the dataset:
print(df.shape)

# Determine the value counts for each unique target (quality) value::
print(df.quality.unique())
df.quality.value_counts().sort_index()

(1599, 12)
[5 6 7 4 8 3]


quality
3     10
4     53
5    681
6    638
7    199
8     18
Name: count, dtype: int64

There are 11 feature columns representing physiochemical characteristics of the wines, such as fixed acidity, residual sugar, chlorides, density, etc. There are 1598 rows or observations in the dataset. The target is the quality column which is listed as a set of ordinal values from 3 to 8, although they could go as low as 0 or as high as 10. Most observations have an "average" quality between 5 and 7, with very few observations below a score of 5 or above a score of 7. This shows an unbalanced dataset. 

In [4]:
# View feature dtypes and counts, null values:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In each of the 11 feature columns above, all entries have non-null values. As a result, imputation is not required to replace the null values. The data types of all features columns are float64 and the target column is int64. Although the target column "quality" is numeric, this problem could be considered as a classification or regression problem, as the numeric targets represent a wine quality category (i.e. from 0 to 10). Lastly, due to the numeric nature of the features preprocessing in the form of StandardScaler() will be required to scale each of the features, to prevent features of a large relative magnitude from dominating features of a smaller relative magnitude. 

All 11 features are numeric with no missing values. The response variable is a label for wine quality ranging from 0 to 10. We will treat this as a classification problem for our models.

In [5]:
# View summary statistics:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


### Visualization

We first observe the distribution of the features using their statistical summaries and a histogram. We can see that the majority of features have a skewed distribution, with many containing outliers. Volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all have very extreme outliers.

In [6]:
# Features 
feature_df = df.drop('quality', axis=1)

feature_names = list(feature_df.columns)

alt.Chart(feature_df).mark_bar().encode(
     alt.X(alt.repeat()).type('quantitative').bin(maxbins=40),
     y='count()',
).properties(
    width=150,
    height=150
).repeat(
    feature_names, 
    columns=3
)

## Methods 

Our method for model selection involves using 5-fold cross-validation and hyperparameter tuning on several models: logistic regression, decision tree, kNN and SVM RBF. We use validation accuracy as our metric. Below we first use a dummy classifier to establish the baseline.

In [48]:
X = df.drop(columns = ['quality'])
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 522)

In [49]:
#Baseline
baseline = DummyClassifier(random_state=522)
pd.DataFrame(cross_validate(baseline, X_train, y_train, return_train_score=True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.000997,0.000973,0.441964,0.440223
1,0.000998,0.000991,0.4375,0.441341
2,0.001049,0.000959,0.4375,0.441341
3,0.000978,0.000997,0.441964,0.440223
4,0.001003,0.00103,0.443946,0.439732


In [77]:
np.mean(pd.DataFrame(cross_validate(baseline, X_train, y_train, return_train_score=True)), axis=0)

fit_time       0.000601
score_time     0.000600
test_score     0.440575
train_score    0.440572
dtype: float64

As we can see, the baseline obtains an accuracy of around 0.44.

In [50]:
pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('LR', LogisticRegression(random_state=522))])
pipe_dt = Pipeline([('scl', StandardScaler()),
                    ('DT', DecisionTreeClassifier(random_state=522))])
pipe_knn = Pipeline([('scl', StandardScaler()),
                    ('KNN', KNeighborsClassifier())])
pipe_svm = Pipeline([('scl', StandardScaler()),
                     ('SVM', SVC(random_state=522))])

In [51]:
lr_param_grid = [{'LR__C': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000],
                  'LR__class_weight': ['balanced', None]}]
dt_param_grid = [{'DT__criterion': ['gini', 'entropy'],
                  'DT__max_depth': 2 ** np.arange(8),
                  'DT__class_weight': ['balanced', None]}]
knn_param_grid = [{'KNN__n_neighbors': [1, 2, 3, 4, 5, 6]}]
svm_param_grid = [{'SVM__C': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000],
                   'SVM__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100, 1000],
                   'SVM__class_weight': ['balanced', None]}]

In [52]:
lr_grid_search = GridSearchCV(estimator=pipe_lr, param_grid=lr_param_grid, n_jobs=-1, return_train_score=True)

dt_grid_search = GridSearchCV(estimator=pipe_dt, param_grid=dt_param_grid, n_jobs=-1, return_train_score=True)

knn_grid_search = GridSearchCV(estimator=pipe_knn, param_grid=knn_param_grid, n_jobs=-1, return_train_score=True)

svm_grid_search = GridSearchCV(estimator=pipe_svm, param_grid=svm_param_grid, n_jobs=-1, return_train_score=True)

In [53]:
grids = [lr_grid_search, dt_grid_search, knn_grid_search, svm_grid_search]
for pipe in grids:
    pipe.fit(X_train, y_train)

In [64]:
lr_df = (pd.DataFrame(lr_grid_search.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_LR__C",
        "param_LR__class_weight",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())
lr_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_LR__C,param_LR__class_weight,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.586275,0.592495,0.1,,0.013251


In [65]:
dt_df = (pd.DataFrame(dt_grid_search.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_DT__criterion",
        "param_DT__max_depth",
        "param_DT__class_weight",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())
dt_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_DT__criterion,param_DT__max_depth,param_DT__class_weight,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.59335,0.993743,gini,16,,0.015172


In [66]:
knn_df = (pd.DataFrame(knn_grid_search.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_KNN__n_neighbors",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())
knn_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_KNN__n_neighbors,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.59781,1.0,1,0.003778


In [68]:
svm_df = (pd.DataFrame(svm_grid_search.cv_results_)[
    [
        "mean_test_score",
        "mean_train_score",
        "param_SVM__C",
        "param_SVM__gamma",
        "param_SVM__class_weight",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index())
svm_df.head(1)

Unnamed: 0_level_0,mean_test_score,mean_train_score,param_SVM__C,param_SVM__gamma,param_SVM__class_weight,mean_fit_time
rank_test_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.613053,0.76787,1000,0.01,,0.265334


## Discussion 

Based on the accuracy we've obtained there is no good way to use machine learning algorithm to predict wine quality. These features 

## References 

1. https://archive.ics.uci.edu/dataset/186/wine+quality
2. http://www3.dsi.uminho.pt/pcortez/wine/
3. https://www.thebusinessresearchcompany.com/report/red-wine-global-market-report#:~:text=The%20global%20red%20wine%20market%20size%20grew%20from%20%24102.97%20billion,least%20in%20the%20short%20term.