# Prediction of Red Wine Quality(93.215%)

Wine classification is a difficult task since taste is the least understood of the human senses. A good wine quality prediction can be very useful in the certification phase, since currently the sensory analysis is performed by human tasters, being clearly a subjective approach. 

An automatic predictive system can be integrated into a decision support system, helping the speed and quality of the oenologist performance. Furthermore, a feature selection process can help to analyze the impact of the analytical tests. If it is concluded that several input variables are highly relevant to predict the wine quality, since in the production process some variables can be controlled, this information can be used to improve the wine quality.

(This introduction is extracted from "Modeling Wine Preferences from Physicochemical Properties using Fuzzy Techniques".)

It is not the only intriguing topic in wine industry. For example, vehicle riding and handling test is also conducted by human so it is hard to find correlation between specification and performance although some bias sensor which helps to evaluate performance objectively works well during test. Plus, there are lots of external variables. If we can build its predictive model, we can research and develop new products efficiently. In other words, it allows to reduce cost and develop time and even improve performance.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
import os
print(os.listdir("./data"))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('./data/wine_X_train.csv')
df.info()

## - Data Analysis
Define the grade below
            
             1(good) if quality is more than 7 
            
             0(not good) if quality is less than 7 

### 1. Correlation

In [None]:
df['grade'] = 1 # good
df.grade[df.quality < 7] = 0 # not good

plt.figure(figsize = (8,8))
labels = df.grade.value_counts().index
plt.pie(df.grade.value_counts(), autopct='%1.1f%%')
plt.legend(labels, loc="best")
plt.axis('equal')
plt.title('Quality Pie Chart')
plt.show()
print('The good quality wines count for ',round(df.grade.value_counts(normalize=True)[1]*100,1),'%.')

In [None]:
sns.pairplot(df, hue='grade')
plt.show()

Each attribute in the two group(grade=0 or grade=1) has the almost same histogram.

In [None]:
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.subplots(figsize = (12,12))
sns.heatmap(df.corr(), 
            annot=True,
            mask = mask,
            cmap = 'RdBu_r',
            linewidths=0.1, 
            linecolor='white',
            vmax = .9,
            square=True)
plt.title("Correlations Among Features", y = 1.03,fontsize = 20)
plt.show()

Seeing the pairplot and the correlation heatmap, there is no distinguishing correlations to quality.

### 2. Average Radar Chart

In [None]:
good = df[df.grade == 1]
notgood = df[df.grade == 0]

In [None]:
drop_items = ['quality','grade']
g1 = pd.DataFrame(good.drop(drop_items, axis=1).mean(), columns=['Good']).T
g2 = pd.DataFrame(notgood.drop(drop_items, axis=1).mean(), columns=['Not Good']).T
total = pd.DataFrame(df.drop(drop_items, axis=1).mean(), columns=['Total Average']).T
data = g1.append([g2, total])

In [None]:
# Set standard
temp1 = data.values.reshape((3, 11))
standard = data.loc['Total Average'].values.reshape((1, 11))
temp = 100* temp1 / standard
data_percentage = pd.DataFrame(temp, columns = data.columns.values.tolist())

In [None]:
from math import pi
Attributes =list(data_percentage)
AttNo = len(Attributes)

values = data_percentage.iloc[0].tolist() #
values += values [:1]

angles = [n / float(AttNo) * 2 * pi for n in range(AttNo)]
angles += angles [:1]

values2 = data_percentage.iloc[1].tolist() # 
values2 += values2 [:1]

angles2 = [n / float(AttNo) * 2 * pi for n in range(AttNo)]
angles2 += angles2 [:1]

values3 = data_percentage.iloc[2].tolist() # 
values3 += values3 [:1]

angles3 = [n / float(AttNo) * 2 * pi for n in range(AttNo)]
angles3 += angles3 [:1]

plt.figure(figsize=(10,10))
ax = plt.subplot(111, polar=True)
plt.xticks(angles[:-1],Attributes)

# Good 
ax.plot(angles, values, color = 'r')
ax.fill(angles, values, 'red', alpha=0.1)

# Not Good
ax.plot(angles2, values2, color = 'b')
ax.fill(angles2, values2, 'blue', alpha=0.1)

# Total Average
ax.plot(angles3, values3, color = 'black')
ax.fill(angles3, values3, 'black', alpha=0.1)

plt.figtext(0.4,0.82,'Good Quality Average',color='red')
plt.figtext(0.28,0.48,'Not Good Quality Average',color='blue')
plt.figtext(0.36,0.23,'Total Average',color='black')
plt.show()

In [None]:
data_percentage[:2].T.plot(kind='bar',figsize=(15,5), color=['red','blue'])
plt.title('Average of the two groups')
plt.legend(('Good','Not Good'))
x = np.linspace(-10,100,10)
y = 100*np.ones(10)
plt.plot(x,y,'green')
plt.show()

In [None]:
plt.figure(figsize = (14,10))
plt.subplots_adjust(hspace = 0.3, wspace = 0.3)

plt.subplot(241) 
sns.kdeplot(df['citric acid'], df['quality'])
plt.subplot(242)
sns.kdeplot(df['sulphates'], df['quality'])
plt.subplot(243)
sns.kdeplot(df['alcohol'], df['quality'])
plt.subplot(244)
sns.kdeplot(df['residual sugar'], df['quality'])

plt.subplot(245)
sns.kdeplot(df['free sulfur dioxide'], df['quality'])
plt.subplot(246)
sns.kdeplot(df['volatile acidity'], df['quality'])
plt.subplot(247)
sns.kdeplot(df['total sulfur dioxide'], df['quality'])
plt.subplot(248)
sns.kdeplot(df['chlorides'], df['quality'])
plt.show()

On average, the more wine includes citric acid, sulphates, or alcohol, the wine is better, otherwise the less wine includes free sulfur dioxide, volatile acidity, or total sulfur dioxide, the wine is better.

## - Data Prediction

The several models are used for the prediction.

    1. Decision Tree
    2. Random Forest
    3. KNeighbors
    4. GaussianNB
    5. SVC
    6. XGBoost

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import xgboost
from sklearn.metrics import accuracy_score

df_train_features = df.drop(['quality','grade'], axis =1)
n = 11

x_train, x_test, y_train, y_test = train_test_split(df_train_features, df['grade'], test_size=0.1, random_state=7)

x_train_mat = x_train.values.reshape((len(x_train), n))
x_test_mat = x_test.values.reshape((len(x_test), n))


In [None]:
##############################################################################
# Create Predictive Models
##############################################################################
print('Start Predicting...')

decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train_mat,y_train)
tree_pred = decision_tree.predict(x_test_mat)

rf = RandomForestClassifier()
rf.fit(x_train_mat,y_train)
rf_pred = rf.predict(x_test_mat)

KN = KNeighborsClassifier()
KN.fit(x_train_mat,y_train)
KN_pred = KN.predict(x_test_mat)

Gaussian = GaussianNB()
Gaussian.fit(x_train_mat,y_train)
Gaussian_pred = Gaussian.predict(x_test_mat)

svc = SVC()
svc.fit(x_train_mat,y_train)
svc_pred = svc.predict(x_test_mat)

xgb = xgboost.XGBClassifier()
xgb.fit(x_train_mat,y_train)
xgb_pred = xgb.predict(x_test_mat)

print('...Complete')

In [None]:
##############################################################################
# Obtain Accuracy Scores for the test set
##############################################################################
print('Decision Tree:', accuracy_score(y_test, tree_pred)*100,'%')
print('Random Forest:', accuracy_score(y_test, rf_pred)*100,'%')
print('KNeighbors:',accuracy_score(y_test, KN_pred)*100,'%')
print('GaussianNB:',accuracy_score(y_test, Gaussian_pred)*100,'%')
print('SVC:',accuracy_score(y_test, svc_pred)*100,'%')
print('XGB:',accuracy_score(y_test, xgb_pred)*100,'%')

The random forest is selected as a predictive model.

In [None]:
##############################################################################
# Obtain Accuracy Scores
# Each classifier has a different random state.
##############################################################################
k = [10,20,30,40,50]
for i in k:
    rf_tune = RandomForestClassifier(n_estimators=50, random_state=i)
    rf_tune.fit(x_train_mat,y_train)
    y_pred = rf_tune.predict(x_test_mat)
    print(accuracy_score(y_test, y_pred)*100,'%')

In [None]:
##############################################################################
# Input all train data
##############################################################################
x_train_check = df_train_features.values.reshape((len(df_train_features), n))
x_test_check = df['grade'].values.reshape((len(df['grade']), 1))

k = [10,20,30,40,50]
for i in k:
    rf_tune = RandomForestClassifier(n_estimators=50, random_state=i)
    rf_tune.fit(x_train_mat,y_train)
    yy_pred = rf_tune.predict(x_train_check)
    print(accuracy_score(x_test_check, yy_pred)*100,'%')

### * Visualization for the predicted values

In [None]:
plt.figure(figsize = (20,8))
domain = np.linspace(1,100,len(y_pred)) 
plt.plot(domain, rf_pred,'o')
plt.plot(domain, y_test,'o')
plt.legend(('Prediction','Actual value'))
plt.show()

This is my first notebook about sensory test so the purpose of this notebook is to share useful information and discuss about predicting subjective test, not to deliver any knowledge. Thank you for reading this notebook. 