# Classification: Vinho Verde Wine Quality

The following project is based on the Wine Quality dataset found on UCI's Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality). It contains two data sets (white and red wine), with 12 attributes each. 

The main task is to create a model that would predict the quality of the wine, which is the score between 3 and 9. I would assume the original scale is between 0 and 10, but our dataset does not contain scores 0, 1, 2 and 10. 

The main difference between Titanic project and this one is that in the former there are only 2 possible cases to predict: survived or not. Since classifying wines into 7 possible outcomes is an extremely difficult task, we will try to find the number of classes that will give us a lot of information, yet will have high performance and accuracy.

Note: since there are two dataset, we are going to combine them in one but add an extra column reflecting wine type.

## Plan of Attack

<ol>
    <li>Data Cleaning: check for</li>
        <ul>
            <li>Null values</li>
            <li>Missing values</li>
            <li>Duplicates</li>
        </ul>
    <li>Exploratory Analysis: helps us to understand the dataset better</li>
    <li>Classification model</li>
        <ul>
            <li>7 classes</li>
            <li>4 classes</li>
            <li>2 classes</li>
        </ul>
    <li>Final words</li>
</ol>

## Importing Libraries

In [1]:
# data manipulation
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import numpy as np

# data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

## Loading Datasets

Firstly, we need to read csv files. While csv stands for comma-separated values, this particular documents have values separated by ';' so we have to include 'sep' parameter in read_csv() method.

In [2]:
red_wine = pd.read_csv('datasets/winequality-red.csv', sep=';')
white_wine = pd.read_csv('datasets/winequality-white.csv', sep=';')

Since we want to unite both dataframes in one, we need to create a custom function which combines both dataframes and creates a new column type that records from which dataframe whine comes from (red vs white).

In [3]:
def unite_wines(red_wine, white_wine):
    red_wine['is white'] = 0
    white_wine['is white'] = 1
    return pd.concat([red_wine, white_wine], sort=True).reset_index(drop=True)

In [4]:
df_wine = unite_wines(red_wine, white_wine)
df_wine.head()

Unnamed: 0,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,is white,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
0,9.4,0.076,0.0,0.998,7.4,11.0,0,3.51,5,1.9,0.56,34.0,0.7
1,9.8,0.098,0.0,0.997,7.8,25.0,0,3.2,5,2.6,0.68,67.0,0.88
2,9.8,0.092,0.04,0.997,7.8,15.0,0,3.26,5,2.3,0.65,54.0,0.76
3,9.8,0.075,0.56,0.998,11.2,17.0,0,3.16,6,1.9,0.58,60.0,0.28
4,9.4,0.076,0.0,0.998,7.4,11.0,0,3.51,5,1.9,0.56,34.0,0.7


## Data Cleaning

In [5]:
df_wine['quality'].unique()

array([5, 6, 7, 4, 8, 3, 9], dtype=int64)

In [6]:
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   alcohol               6497 non-null   float64
 1   chlorides             6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   density               6497 non-null   float64
 4   fixed acidity         6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   is white              6497 non-null   int64  
 7   pH                    6497 non-null   float64
 8   quality               6497 non-null   int64  
 9   residual sugar        6497 non-null   float64
 10  sulphates             6497 non-null   float64
 11  total sulfur dioxide  6497 non-null   float64
 12  volatile acidity      6497 non-null   float64
dtypes: float64(11), int64(2)
memory usage: 660.0 KB


We can see that there are 6497 entries in the dataframe, and each column has 6497 non-null values. So this means that there are no missing values and no null values in the dataframe. 

In [7]:
df_wine = df_wine.drop_duplicates()
print(len(df_wine))

5320


We have started with 6497 entries and after dropping duplicates we have reduced the number of entries to 5320, which is 18% decrease in size. That's a lot of duplicates!

Now we need to reset indices since after dropping duplicates we have removed some indices. If we don't reset them, looping through the dataframe would become problematic.

In [8]:
df_wine = df_wine.reset_index(drop=True)


## Exploratory Analysis

First we can use describe() function to have a look at the data.

In [9]:
df_wine.describe()

Unnamed: 0,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,is white,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
count,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0,5320.0
mean,10.549,0.057,0.318,0.995,7.215,30.037,0.745,3.225,5.796,5.048,0.533,114.109,0.344
std,1.186,0.037,0.147,0.003,1.32,17.805,0.436,0.16,0.88,4.5,0.15,56.774,0.168
min,8.0,0.009,0.0,0.987,3.8,1.0,0.0,2.72,3.0,0.6,0.22,6.0,0.08
25%,9.5,0.038,0.24,0.992,6.4,16.0,0.0,3.11,5.0,1.8,0.43,74.0,0.23
50%,10.4,0.047,0.31,0.995,7.0,28.0,1.0,3.21,6.0,2.7,0.51,116.0,0.3
75%,11.4,0.066,0.4,0.997,7.7,41.0,1.0,3.33,6.0,7.5,0.6,153.25,0.41
max,14.9,0.611,1.66,1.039,15.9,289.0,1.0,4.01,9.0,65.8,2.0,440.0,1.58


Lets see how the data changes with respect to the quality of the wine. 

In [10]:
df_wine.groupby('quality').describe()

Unnamed: 0_level_0,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol,alcohol,chlorides,chlorides,...,total sulfur dioxide,total sulfur dioxide,volatile acidity,volatile acidity,volatile acidity,volatile acidity,volatile acidity,volatile acidity,volatile acidity,volatile acidity
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
quality,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
3,30.0,10.215,1.106,8.0,9.625,10.15,11.0,12.6,30.0,0.077,...,193.25,440.0,30.0,0.517,0.342,0.17,0.253,0.415,0.633,1.58
4,206.0,10.215,0.991,8.4,9.4,10.1,10.9,13.5,206.0,0.061,...,150.25,272.0,206.0,0.462,0.232,0.11,0.28,0.385,0.61,1.13
5,1752.0,9.872,0.828,8.0,9.3,9.6,10.3,14.9,1752.0,0.066,...,167.0,344.0,1752.0,0.394,0.183,0.1,0.26,0.34,0.5,1.33
6,2323.0,10.649,1.112,8.4,9.8,10.5,11.4,14.0,2323.0,0.054,...,154.0,294.0,2323.0,0.316,0.149,0.08,0.22,0.27,0.37,1.04
7,856.0,11.511,1.116,8.6,10.8,11.5,12.325,14.2,856.0,0.045,...,134.0,289.0,856.0,0.292,0.117,0.08,0.21,0.28,0.35,0.915
8,148.0,11.912,1.078,8.5,11.2,12.2,12.7,14.0,148.0,0.04,...,135.5,212.5,148.0,0.303,0.118,0.12,0.22,0.28,0.36,0.85
9,5.0,12.18,1.013,10.4,12.4,12.5,12.7,12.9,5.0,0.027,...,124.0,139.0,5.0,0.298,0.058,0.24,0.26,0.27,0.36,0.36


We can see how imbalanced is the dataset. There are much observations of average wine compare to very poor and high quality wine. This is one of the things we would have to address later on.

Now lets create dummy variables for the dependent variable.

In [11]:
dummys = pd.get_dummies(df_wine['quality'])
dummys.head()

Unnamed: 0,3,4,5,6,7,8,9
0,0,0,1,0,0,0,0
1,0,0,1,0,0,0,0
2,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0
4,0,0,1,0,0,0,0


In [16]:
new_df = pd.concat([df_wine, dummys], axis=1)
new_df.head()

Unnamed: 0,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,is white,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity,3,4,5,6,7,8,9
0,9.4,0.076,0.0,0.998,7.4,11.0,0,3.51,5,1.9,0.56,34.0,0.7,0,0,1,0,0,0,0
1,9.8,0.098,0.0,0.997,7.8,25.0,0,3.2,5,2.6,0.68,67.0,0.88,0,0,1,0,0,0,0
2,9.8,0.092,0.04,0.997,7.8,15.0,0,3.26,5,2.3,0.65,54.0,0.76,0,0,1,0,0,0,0
3,9.8,0.075,0.56,0.998,11.2,17.0,0,3.16,6,1.9,0.58,60.0,0.28,0,0,0,1,0,0,0
4,9.4,0.075,0.0,0.998,7.4,13.0,0,3.51,5,1.8,0.56,40.0,0.66,0,0,1,0,0,0,0


In [17]:
new_df.drop(['quality'], axis=1, inplace=True)
new_df.head()

Unnamed: 0,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,is white,pH,residual sugar,sulphates,total sulfur dioxide,volatile acidity,3,4,5,6,7,8,9
0,9.4,0.076,0.0,0.998,7.4,11.0,0,3.51,1.9,0.56,34.0,0.7,0,0,1,0,0,0,0
1,9.8,0.098,0.0,0.997,7.8,25.0,0,3.2,2.6,0.68,67.0,0.88,0,0,1,0,0,0,0
2,9.8,0.092,0.04,0.997,7.8,15.0,0,3.26,2.3,0.65,54.0,0.76,0,0,1,0,0,0,0
3,9.8,0.075,0.56,0.998,11.2,17.0,0,3.16,1.9,0.58,60.0,0.28,0,0,0,1,0,0,0
4,9.4,0.075,0.0,0.998,7.4,13.0,0,3.51,1.8,0.56,40.0,0.66,0,0,1,0,0,0,0


In [18]:
from sklearn.preprocessing import MinMaxScaler
x = new_df.loc[:, new_df.columns].values #returns a numpy array
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_normalized = pd.DataFrame(data=x_scaled, columns=new_df.columns)

In [19]:
df_normalized.head()

Unnamed: 0,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,is white,pH,residual sugar,sulphates,total sulfur dioxide,volatile acidity,3,4,5,6,7,8,9
0,0.203,0.111,0.0,0.206,0.298,0.035,0.0,0.612,0.02,0.191,0.065,0.413,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.261,0.148,0.0,0.187,0.331,0.083,0.0,0.372,0.031,0.258,0.141,0.533,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.261,0.138,0.024,0.191,0.331,0.049,0.0,0.419,0.026,0.242,0.111,0.453,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.261,0.11,0.337,0.21,0.612,0.056,0.0,0.341,0.02,0.202,0.124,0.133,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.203,0.11,0.0,0.206,0.298,0.042,0.0,0.612,0.018,0.191,0.078,0.387,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Regression

In [12]:
X = df_wine.drop(['quality'], axis=1)
y = df_wine['quality']

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [22]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score, confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[  0   1   4   1   0   0   0]
 [  0   5  31  18   0   0   0]
 [  0   5 283 140   3   0   0]
 [  0   1 127 399  35   0   0]
 [  0   0  10 120  69   3   0]
 [  0   0   0  27   8   0   0]
 [  0   0   0   1   0   0   0]]


0.5855925639039504