# Predicting Wine Quality

People often use a price point to determine the value in a wine, but a truly good wine goes deeper than that. In this dataset, we explore characteristics of wine and attempt to predict both wine quality and whether or not a wine is considered "good" by the general public.

In [82]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import io
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_val_score
from sklearn import ensemble

In [4]:
wine = pd.read_csv('winequality.csv')

In [7]:
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,good,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0,red


In [8]:
wine.good.value_counts()

0    5220
1    1277
Name: good, dtype: int64

In [9]:
wine.quality.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

In [14]:
#Turning color into features
wine = pd.get_dummies(wine, columns=['color'])

In [17]:
#Quality is directly related to rating so let's drop that from our
#training data.
X = wine.drop(columns=['quality','good'])
Y_qual = wine.quality
Y_good = wine.good

In [75]:
mlp_g = MLPClassifier(hidden_layer_sizes=(1000,10), activation='tanh')
mlp_g.fit(X, Y_good)

print(mlp_g.score(X, Y_good))

mlpg = cross_val_score(mlp_g, X, Y_good, cv=5)

0.8034477451131291


In [97]:
print(mlpg.std())

0.030407658028022724


In [103]:
rfc = ensemble.RandomForestClassifier(n_estimators=20, criterion='entropy')
rfc.fit(X, Y_good)

rfc_csv = cross_val_score(rfc, X, Y_good, cv=5)

In [104]:
print(rfc_csv.mean())

0.8120656126014094


In [90]:
mlp_q = MLPClassifier(hidden_layer_sizes=(1000,))
mlp_q.fit(X, Y_qual)

print(mlp_q.score(X, Y_qual))

mlpq = cross_val_score(mlp_q, X, Y_qual, cv=5)

0.5300908111436048


In [91]:
print(mlpq.mean())

0.4334857357133443


In [98]:
rfr = ensemble.RandomForestClassifier(n_estimators=5)
rfr.fit(X, Y_qual)

rfrq = cross_val_score(rfr, X, Y_qual, cv=5)

In [99]:
print(rfrq.mean())

0.43839762164643786


For a small data set, ~6k rows. Random Forest Classifer performs about the same as Multi-Layer Perceptron but runs much faster with a smaller variance. With Perceptrons, however we can slowly improve our mean score by varying layers in both number and size.