# Wine Quality Project

### Introduction

This project details the steps taken to create a classifier for predicting the quality of a wine given a set of measurement from the wine. The algorithm implemented will be a support vector machine but we will explore the results of other types of models as well. This model will be adapted to determine if a wine is high quality or not in a binary classification mode. It could be adapted to be a multiclass classifier as well.

The goal of this project is to create a wine predictor that will allow people to input data that can be measured from the wine and determine is it is a high quality wine. The quality score used in training were taken from the median score given by 3 different wine experts. People may have different opionions about the taste of a wine but this model could be uses to  determine the best price of a wine and differentiate it from other similar wines. Using this model could be a cheaper alternative to bringing in wine experts to judge a new wine type.

### Data

The data was taken from the UCI Machine Learning Repositry Wine Quality Data Set (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The target attribute is the `quality` field and we will use feature selection to narrow down the other attributes. There are 6497 total wine samples, a detailed description of the data can be seen in the "winequality.names" file included in this project folder.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
red = pd.read_csv('winequality-red.csv', delimiter=';')
print(red.dtypes)
red.quality.hist()

In [None]:
white = pd.read_csv('winequality-white.csv', delimiter=';')
print(white.dtypes)
white.quality.hist()

We notice in both of these datasets there are not many samples of high quality or low quality wines, most are in the middle. For our case, we will split the quality to be any value 7 or above will be high quality and any quality of 6 or below will not. We will keep these data sets separate for now since there may be different things that determine quality between red and white wines but we will experiment with combining them later on.

#### Data Cleaning

We will need to renencode the quality field to be a binary class of 1 for high quality or 0 for not high quality. We will also check if there are any null values in the data that need to be removed.

In [None]:
print(f"Null Values: {red.isna().sum().sum() + white.isna().sum().sum()}")

# Set value of quality to 1 if score was 7 or greater, 0 otherwise
red.quality = np.where(red.quality >= 7, 1, 0)
white.quality = np.where(white.quality >= 7, 1, 0)

red_x = red.drop('quality', axis=1)
red_y = red.quality
white_x = white.drop('quality', axis=1)
white_y = white.quality
print(red_y.sum())

#### Exploratory Data Analysis

In this section we will inspect the data and perform principle component analysis to reduce the number of features in our data.

We notice from the correlation matrix that the red wine has a high negative correlation between `pH`, `fixed acidity`, and `citric acid`. This is unsurprising since all of these are related to the acidic levels of the wine. There is also a strong positive correlation between `fixed acidity` and `density` as well as `free sulfur dioxide` and `total sulfur dioxide`. Therefore we will drop the `fixed acidity`, `citric acid`, and `free sulfur dioxide` fields.

For the white wine we see from the correlation matrix that there is a correlation between `alcohol` and `density` as well as `density` and `pH`. Similar to the red wines there is also correlation between `free sulfur dioxide` and `total sulfur dioxide`. Thus we will drop `density` and `free sulfur dioxide`. This is a surprising result as I would have expected a similar set of correlated features between the two types of wines.

In [None]:
# Explore Red data set
sns.heatmap(red_x.corr())
sns.pairplot(red_x, diag_kind='kde')
red_x.drop(['fixed acidity', 'citric acid', 'free sulfur dioxide'], axis=1, inplace=True)

In [None]:
# Explore White data set
sns.heatmap(white_x.corr())
sns.pairplot(white_x, diag_kind='kde')
white_x.drop(['density', 'free sulfur dioxide'], axis=1, inplace=True)

### Models

The model used is a support vector machine for multi class classification. 

In [None]:
svc = SVC(kernel='linear')
svc.fit(red_x, red_y)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV

params = {
    "C": np.logspace(-5, 5, num=10, base=2),
    "gamma": np.logspace(-5, 5, num=10, base=2),
    #"kernel": ["rbf", "sigmoid"]
    #"degree": range(0, 10)
}
red_grid = GridSearchCV(SVC(), param_grid=params, cv=3)
red_grid.fit(red_x, red_y)

print(red_grid.best_params_)
print(red_grid.best_score_)

#linear
# {'C': 0.03125, 'gamma': 0.03125}
# 0.864290181363352

# {'C': 1.4697344922755986, 'gamma': 0.3149802624737183, 'kernel': 'rbf'}
# 0.8667917448405253

In [None]:
from matplotlib.colors import Normalize

scores = [x for x in red_grid.cv_results_["mean_test_score"]]
scores = np.array(scores).reshape(len(red_grid.param_grid["C"]), len(red_grid.param_grid["gamma"]))

plt.figure(figsize=(10, 8))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
            norm=Normalize(vmin=0.2, vmax=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(red_grid.param_grid["gamma"])), red_grid.param_grid["gamma"], rotation=45)
plt.yticks(np.arange(len(red_grid.param_grid["C"])), red_grid.param_grid["C"])
plt.title('Validation accuracy')
plt.show()

#### Results and Analysis

I ran through several iterations of GridSearch to implement hyperparameter tuning and cross validation to help select the best model. Unforunutelt each training run took a long time to complete so I was only able to iterate over a few different combinations. I found that linear performed better than polynomial and a C value of 32 was considered the best in all the models.

#### Conclusion

The model that was created was the result of hyperparameter tuning and domain knowledge. The data