## Red Wine Quality Prediction with Decision Tree Classifier

This is practice for becoming familiar with the code. I used the information found on 
https://medium.com/themlblog/wine-quality-prediction-using-machine-learning-59c88a826789 to run through the red wine data set.

In [1]:
# Import the libraries
# pandas will be used to work with file formats like csv, xls, etc.

import pandas as pd

# numpy is used for making the mathematical calculations more accurate

import numpy as np

# sklearn (scikit-learn) will be used to import our classifier for prediction
# is used to split our dataset into training and testing data

from sklearn.model_selection import train_test_split

# is used to preprocess the data before fitting into predictor

from sklearn import preprocessing

# is used to import our decision tree classifier

from sklearn import tree

In [2]:
# Read in the csv file

wine_data=pd.read_csv("winequality-red.csv")
wine_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# separate the features and labels into two different dataframes

y = wine_data.quality
X = wine_data.drop('quality', axis=1)

In [4]:
# split the dataset into test and train data
# we made the test data 20% of the original data.
# the remaining 80% is used for training

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)

In [5]:
# Print the first five elements of data we have split

print(X_train.head())

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
1543           11.1             0.440         0.42             2.2      0.064   
1191            6.5             0.885         0.00             2.3      0.166   
1598            6.0             0.310         0.47             3.6      0.067   
334             7.9             0.650         0.01             2.5      0.078   
1411            6.4             0.470         0.40             2.4      0.071   

      free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
1543                 14.0                  19.0  0.99758  3.25       0.57   
1191                  6.0                  12.0  0.99551  3.56       0.51   
1598                 18.0                  42.0  0.99549  3.39       0.66   
334                  17.0                  38.0  0.99630  3.34       0.74   
1411                  8.0                  19.0  0.99630  3.56       0.73   

      alcohol  
1543     10.4  
1191     10.8  
15

In [6]:
# After obtaining the data we will be using, the next step
# is data normalization. It is part of pre-processing in
# which data is converted to fit in a range of -1 and 1.

X_train_scaled = preprocessing.scale(X_train)
print(X_train_scaled)

[[ 1.56133054 -0.48211278  0.74992101 ... -0.35679967 -0.50928046
  -0.02592581]
 [-1.04437994  2.06316363 -1.42407281 ...  1.63147725 -0.85341382
   0.34305328]
 [-1.32760934 -1.22567668  1.0087298  ...  0.54113184  0.00691957
   0.52754282]
 ...
 [-0.98773406  0.60463445 -1.42407281 ... -0.6774895  -1.19754718
  -0.85612875]
 [ 0.88157998  1.2338039  -0.44059941 ... -1.06231729 -0.62399158
  -0.94837352]
 [-0.76115054  0.08985944 -1.42407281 ...  0.54113184  1.03931964
   0.89652191]]


In [7]:
# Now we train our algorithm so that it can predict the wine quality
clf=tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [9]:
# Check to see how efficiently the algorithm is predicting the wine quality.
confidence = clf.score(X_test, y_test)

print("\nThe confidence score: {:.2f}%" .format(confidence * 100))


The confidence score: 58.96%


In [10]:
# This score can change over time depending on the size of the dataset
# and shuffling of data when we divide the data into test and train,
# but you can always expect a range of +/-5 around the first result.

# Now that we have trained our classifier with features, we obtain
# the labels using predict() function.

y_pred = clf.predict(X_test)

In [11]:
# Our predicted information is stored in y_pred but it has far too many columns
# to compare it with the expected labels we stored in y_test . So we will just
# take first five entries of both, print them and compare them.
#converting the numpy array to list

x=np.array(y_pred).tolist()

#printing first 5 predictions

print("\nThe prediction:\n")
for i in range(0,5):
    print (x[i])
    
#printing first five expectations

print("\nThe expectation:\n")
print (y_test.head())


The prediction:

5
5
7
5
5

The expectation:

35      6
1232    5
1029    7
1548    5
396     5
Name: quality, dtype: int64


In [13]:
# Notice that almost all of the values in the prediction are similar to the expectations. The predictor was wrong
# once, predicting 6 instead of 5. This gives us the accuracy of 80% for 5 examples. Of course, as the
# examples increases the accuracy goes down, precisely to 58.96%, but overall our predictor performs
# quite well, in-fact any accuracy % greater than 50% is considered as great.
# This is not a model that I will be using in my final project.