# Lesson 3 Assignment - Wine Classifier

## Author - Yulia Zubova

### Instructions
Your task for this assignment:  Design a simple, low-cost sensor that can distinguish between red wine and white wine.
Your sensor must correctly distinguish between red and white wine for at least 95% of the samples in a set of 6497 test samples of red and white wine.

Your technology is capable of sensing the following wine attributes:
- Fixed acidity
- Free sulphur dioxide
- Volatile acidity  
-  Total sulphur dioxide
- Citric acid  
-  Sulphates
- Residual sugar  
-  pH
- Chlorides  
- Alcohol
- Density




## Tasks
1. Read <a href="https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/WineQuality.pdf">WineQuality.pdf</a>.
2. Use the RedWhiteWine.csv or RedWhiteWine.arff that is provided.
Note: If needed, remove the quality attribute, which you will not need for this assignment.
3. Build an experiment using Naive Bayes Classifier.

Answer the following questions:
1. What is the percentage of correct classification results (using all attributes)?
2. What is the percentage of correct classification results (using a subset of the attributes)?
3. What is the AUC of your model?
4. What is the best AUC that you can achieve?
5. Which are the the minimum number of attributes? Why?


In [1]:
URL = "https://library.startlearninglabs.uw.edu/DATASCI420/Datasets/RedWhiteWine.csv"

In [2]:
# Import libraries
# import libraries
import pandas as pd
import numpy as np
import pandas_profiling
from matplotlib import pyplot
from sklearn.model_selection import KFold, cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from sklearn.metrics import *
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2



In [3]:
#load dataset with data about red wines 
wine_df = pd.read_csv(URL, header=0) 

In [4]:
#information about dataset
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
fixed acidity           6497 non-null float64
volatile acidity        6497 non-null float64
citric acid             6497 non-null float64
residual sugar          6497 non-null float64
chlorides               6497 non-null float64
free sulfur dioxide     6497 non-null float64
total sulfur dioxide    6497 non-null float64
density                 6497 non-null float64
pH                      6497 non-null float64
sulphates               6497 non-null float64
alcohol                 6497 non-null float64
quality                 6497 non-null int64
Class                   6497 non-null int64
dtypes: float64(11), int64(2)
memory usage: 659.9 KB


According this information, there aren't missing data in dataset.

In [5]:
#dataset statistics
wine_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Class
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378,0.246114
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255,0.430779
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0,0.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0,0.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0,0.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0,0.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0,1.0


In [6]:
# prints the report to the screen
pandas_profiling.ProfileReport(wine_df)

0,1
Number of variables,13
Number of observations,6497
Total Missing (%),0.0%
Total size in memory,659.9 KiB
Average record size in memory,104.0 B

0,1
Numeric,12
Categorical,0
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.24611

0,1
0,4898
1,1599

Value,Count,Frequency (%),Unnamed: 3
0,4898,75.4%,
1,1599,24.6%,

0,1
Distinct count,111
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10.492
Minimum,8
Maximum,14.9
Zeros (%),0.0%

0,1
Minimum,8.0
5-th percentile,9.0
Q1,9.5
Median,10.3
Q3,11.3
95-th percentile,12.7
Maximum,14.9
Range,6.9
Interquartile range,1.8

0,1
Standard deviation,1.1927
Coef of variation,0.11368
Kurtosis,-0.53169
Mean,10.492
MAD,0.99665
Skewness,0.56572
Sum,68165
Variance,1.4226
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
9.5,367,5.6%,
9.4,332,5.1%,
9.2,271,4.2%,
10.0,229,3.5%,
10.5,227,3.5%,
11.0,217,3.3%,
9.0,215,3.3%,
9.8,214,3.3%,
10.4,194,3.0%,
9.3,193,3.0%,

Value,Count,Frequency (%),Unnamed: 3
8.0,2,0.0%,
8.4,5,0.1%,
8.5,10,0.2%,
8.6,23,0.4%,
8.7,80,1.2%,

Value,Count,Frequency (%),Unnamed: 3
13.9,3,0.0%,
14.0,12,0.2%,
14.05,1,0.0%,
14.2,1,0.0%,
14.9,1,0.0%,

0,1
Distinct count,214
Unique (%),3.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.056034
Minimum,0.009
Maximum,0.611
Zeros (%),0.0%

0,1
Minimum,0.009
5-th percentile,0.028
Q1,0.038
Median,0.047
Q3,0.065
95-th percentile,0.102
Maximum,0.611
Range,0.602
Interquartile range,0.027

0,1
Standard deviation,0.035034
Coef of variation,0.62522
Kurtosis,50.898
Mean,0.056034
MAD,0.020647
Skewness,5.3998
Sum,364.05
Variance,0.0012274
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
0.044,206,3.2%,
0.036,200,3.1%,
0.042,187,2.9%,
0.046,185,2.8%,
0.04,182,2.8%,
0.05,182,2.8%,
0.048,182,2.8%,
0.047,175,2.7%,
0.045,174,2.7%,
0.038,169,2.6%,

Value,Count,Frequency (%),Unnamed: 3
0.009,1,0.0%,
0.012,3,0.0%,
0.013,1,0.0%,
0.014,4,0.1%,
0.015,4,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.422,1,0.0%,
0.464,1,0.0%,
0.467,1,0.0%,
0.61,1,0.0%,
0.611,1,0.0%,

0,1
Distinct count,89
Unique (%),1.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.31863
Minimum,0
Maximum,1.66
Zeros (%),2.3%

0,1
Minimum,0.0
5-th percentile,0.05
Q1,0.25
Median,0.31
Q3,0.39
95-th percentile,0.56
Maximum,1.66
Range,1.66
Interquartile range,0.14

0,1
Standard deviation,0.14532
Coef of variation,0.45607
Kurtosis,2.3972
Mean,0.31863
MAD,0.10583
Skewness,0.47173
Sum,2070.2
Variance,0.021117
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
0.3,337,5.2%,
0.28,301,4.6%,
0.32,289,4.4%,
0.49,283,4.4%,
0.26,257,4.0%,
0.34,249,3.8%,
0.29,244,3.8%,
0.27,236,3.6%,
0.24,232,3.6%,
0.31,230,3.5%,

Value,Count,Frequency (%),Unnamed: 3
0.0,151,2.3%,
0.01,40,0.6%,
0.02,56,0.9%,
0.03,32,0.5%,
0.04,41,0.6%,

Value,Count,Frequency (%),Unnamed: 3
0.91,2,0.0%,
0.99,1,0.0%,
1.0,6,0.1%,
1.23,1,0.0%,
1.66,1,0.0%,

0,1
Distinct count,998
Unique (%),15.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.9947
Minimum,0.98711
Maximum,1.039
Zeros (%),0.0%

0,1
Minimum,0.98711
5-th percentile,0.9899
Q1,0.99234
Median,0.99489
Q3,0.99699
95-th percentile,0.99939
Maximum,1.039
Range,0.05187
Interquartile range,0.00465

0,1
Standard deviation,0.0029987
Coef of variation,0.0030147
Kurtosis,6.6061
Mean,0.9947
MAD,0.0024834
Skewness,0.5036
Sum,6462.5
Variance,8.992e-06
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
0.9972,69,1.1%,
0.9976,69,1.1%,
0.998,64,1.0%,
0.992,64,1.0%,
0.9928,63,1.0%,
0.9986,61,0.9%,
0.9962,59,0.9%,
0.9966,59,0.9%,
0.9968,55,0.8%,
0.9956,55,0.8%,

Value,Count,Frequency (%),Unnamed: 3
0.98711,1,0.0%,
0.98713,1,0.0%,
0.98722,1,0.0%,
0.9874,1,0.0%,
0.98742,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1.00315,3,0.0%,
1.0032,1,0.0%,
1.00369,2,0.0%,
1.0103,2,0.0%,
1.03898,1,0.0%,

0,1
Distinct count,106
Unique (%),1.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.2153
Minimum,3.8
Maximum,15.9
Zeros (%),0.0%

0,1
Minimum,3.8
5-th percentile,5.7
Q1,6.4
Median,7.0
Q3,7.7
95-th percentile,9.8
Maximum,15.9
Range,12.1
Interquartile range,1.3

0,1
Standard deviation,1.2964
Coef of variation,0.17968
Kurtosis,5.0612
Mean,7.2153
MAD,0.91442
Skewness,1.7233
Sum,46878
Variance,1.6807
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
6.8,354,5.4%,
6.6,327,5.0%,
6.4,305,4.7%,
7.0,282,4.3%,
6.9,279,4.3%,
7.2,273,4.2%,
6.7,264,4.1%,
7.1,257,4.0%,
6.5,242,3.7%,
7.4,238,3.7%,

Value,Count,Frequency (%),Unnamed: 3
3.8,1,0.0%,
3.9,1,0.0%,
4.2,2,0.0%,
4.4,3,0.0%,
4.5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
14.3,1,0.0%,
15.0,2,0.0%,
15.5,2,0.0%,
15.6,2,0.0%,
15.9,1,0.0%,

0,1
Distinct count,135
Unique (%),2.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,30.525
Minimum,1
Maximum,289
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,6
Q1,17
Median,29
Q3,41
95-th percentile,61
Maximum,289
Range,288
Interquartile range,24

0,1
Standard deviation,17.749
Coef of variation,0.58146
Kurtosis,7.9062
Mean,30.525
MAD,14.019
Skewness,1.2201
Sum,198320
Variance,315.04
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
29.0,183,2.8%,
6.0,170,2.6%,
26.0,161,2.5%,
15.0,157,2.4%,
31.0,152,2.3%,
24.0,152,2.3%,
17.0,149,2.3%,
34.0,146,2.2%,
35.0,144,2.2%,
23.0,142,2.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,0.0%,
2.0,2,0.0%,
3.0,59,0.9%,
4.0,52,0.8%,
5.0,129,2.0%,

Value,Count,Frequency (%),Unnamed: 3
128.0,1,0.0%,
131.0,1,0.0%,
138.5,1,0.0%,
146.5,1,0.0%,
289.0,1,0.0%,

0,1
Distinct count,108
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.2185
Minimum,2.72
Maximum,4.01
Zeros (%),0.0%

0,1
Minimum,2.72
5-th percentile,2.97
Q1,3.11
Median,3.21
Q3,3.32
95-th percentile,3.5
Maximum,4.01
Range,1.29
Interquartile range,0.21

0,1
Standard deviation,0.16079
Coef of variation,0.049957
Kurtosis,0.36766
Mean,3.2185
MAD,0.12743
Skewness,0.38684
Sum,20911
Variance,0.025853
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
3.16,200,3.1%,
3.14,193,3.0%,
3.22,185,2.8%,
3.2,176,2.7%,
3.15,170,2.6%,
3.19,170,2.6%,
3.18,168,2.6%,
3.24,161,2.5%,
3.12,154,2.4%,
3.1,154,2.4%,

Value,Count,Frequency (%),Unnamed: 3
2.72,1,0.0%,
2.74,2,0.0%,
2.77,1,0.0%,
2.79,3,0.0%,
2.8,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3.81,1,0.0%,
3.82,1,0.0%,
3.85,1,0.0%,
3.9,2,0.0%,
4.01,2,0.0%,

0,1
Distinct count,7
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.8184
Minimum,3
Maximum,9
Zeros (%),0.0%

0,1
Minimum,3
5-th percentile,5
Q1,5
Median,6
Q3,6
95-th percentile,7
Maximum,9
Range,6
Interquartile range,1

0,1
Standard deviation,0.87326
Coef of variation,0.15009
Kurtosis,0.23232
Mean,5.8184
MAD,0.68555
Skewness,0.18962
Sum,37802
Variance,0.76257
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
6,2836,43.7%,
5,2138,32.9%,
7,1079,16.6%,
4,216,3.3%,
8,193,3.0%,
3,30,0.5%,
9,5,0.1%,

Value,Count,Frequency (%),Unnamed: 3
3,30,0.5%,
4,216,3.3%,
5,2138,32.9%,
6,2836,43.7%,
7,1079,16.6%,

Value,Count,Frequency (%),Unnamed: 3
5,2138,32.9%,
6,2836,43.7%,
7,1079,16.6%,
8,193,3.0%,
9,5,0.1%,

0,1
Distinct count,316
Unique (%),4.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.4432
Minimum,0.6
Maximum,65.8
Zeros (%),0.0%

0,1
Minimum,0.6
5-th percentile,1.2
Q1,1.8
Median,3.0
Q3,8.1
95-th percentile,15.0
Maximum,65.8
Range,65.2
Interquartile range,6.3

0,1
Standard deviation,4.7578
Coef of variation,0.87408
Kurtosis,4.3593
Mean,5.4432
MAD,3.9023
Skewness,1.4354
Sum,35365
Variance,22.637
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
2.0,235,3.6%,
1.8,228,3.5%,
1.6,223,3.4%,
1.4,219,3.4%,
1.2,195,3.0%,
2.2,187,2.9%,
2.1,179,2.8%,
1.9,176,2.7%,
1.7,175,2.7%,
1.5,172,2.6%,

Value,Count,Frequency (%),Unnamed: 3
0.6,2,0.0%,
0.7,7,0.1%,
0.8,25,0.4%,
0.9,41,0.6%,
0.95,4,0.1%,

Value,Count,Frequency (%),Unnamed: 3
22.6,1,0.0%,
23.5,1,0.0%,
26.05,2,0.0%,
31.6,2,0.0%,
65.8,1,0.0%,

0,1
Distinct count,111
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.53127
Minimum,0.22
Maximum,2
Zeros (%),0.0%

0,1
Minimum,0.22
5-th percentile,0.35
Q1,0.43
Median,0.51
Q3,0.6
95-th percentile,0.79
Maximum,2.0
Range,1.78
Interquartile range,0.17

0,1
Standard deviation,0.14881
Coef of variation,0.2801
Kurtosis,8.6537
Mean,0.53127
MAD,0.10953
Skewness,1.7973
Sum,3451.7
Variance,0.022143
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
0.5,276,4.2%,
0.46,243,3.7%,
0.54,235,3.6%,
0.44,232,3.6%,
0.38,214,3.3%,
0.48,208,3.2%,
0.52,203,3.1%,
0.49,197,3.0%,
0.47,191,2.9%,
0.45,190,2.9%,

Value,Count,Frequency (%),Unnamed: 3
0.22,1,0.0%,
0.23,1,0.0%,
0.25,4,0.1%,
0.26,4,0.1%,
0.27,13,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1.61,1,0.0%,
1.62,1,0.0%,
1.95,2,0.0%,
1.98,1,0.0%,
2.0,1,0.0%,

0,1
Distinct count,276
Unique (%),4.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,115.74
Minimum,6
Maximum,440
Zeros (%),0.0%

0,1
Minimum,6
5-th percentile,19
Q1,77
Median,118
Q3,156
95-th percentile,206
Maximum,440
Range,434
Interquartile range,79

0,1
Standard deviation,56.522
Coef of variation,0.48833
Kurtosis,-0.37166
Mean,115.74
MAD,45.68
Skewness,-0.0011775
Sum,751990
Variance,3194.7
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
111.0,72,1.1%,
113.0,65,1.0%,
122.0,57,0.9%,
117.0,57,0.9%,
98.0,56,0.9%,
114.0,56,0.9%,
124.0,56,0.9%,
128.0,56,0.9%,
118.0,55,0.8%,
150.0,54,0.8%,

Value,Count,Frequency (%),Unnamed: 3
6.0,3,0.0%,
7.0,4,0.1%,
8.0,14,0.2%,
9.0,15,0.2%,
10.0,28,0.4%,

Value,Count,Frequency (%),Unnamed: 3
307.5,1,0.0%,
313.0,1,0.0%,
344.0,1,0.0%,
366.5,1,0.0%,
440.0,1,0.0%,

0,1
Distinct count,187
Unique (%),2.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.33967
Minimum,0.08
Maximum,1.58
Zeros (%),0.0%

0,1
Minimum,0.08
5-th percentile,0.16
Q1,0.23
Median,0.29
Q3,0.4
95-th percentile,0.67
Maximum,1.58
Range,1.5
Interquartile range,0.17

0,1
Standard deviation,0.16464
Coef of variation,0.4847
Kurtosis,2.8254
Mean,0.33967
MAD,0.12439
Skewness,1.4951
Sum,2206.8
Variance,0.027105
Memory size,50.8 KiB

Value,Count,Frequency (%),Unnamed: 3
0.28,286,4.4%,
0.24,266,4.1%,
0.26,256,3.9%,
0.25,238,3.7%,
0.22,235,3.6%,
0.27,232,3.6%,
0.23,221,3.4%,
0.2,217,3.3%,
0.3,214,3.3%,
0.32,205,3.2%,

Value,Count,Frequency (%),Unnamed: 3
0.08,4,0.1%,
0.085,1,0.0%,
0.09,1,0.0%,
0.1,6,0.1%,
0.105,6,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1.18,1,0.0%,
1.185,1,0.0%,
1.24,1,0.0%,
1.33,2,0.0%,
1.58,1,0.0%,

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Class
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


All variables are numerical and there aren't missing data in wine dataset.

In [7]:
# we don't need 'quality' column, let's drop it
wine_df = wine_df.drop(["quality"], axis = 1)

### All features ###

In [8]:
X = wine_df.iloc[:, 0:11]   # load all features into X DF
Y = wine_df.iloc[:, 11]     # Load target into Y DF

In [9]:
# Add with viewing the data
pd.set_option('display.width', 100) 
pd.set_option('precision', 2)

In [10]:
#boxplots for each variable
wine_df.plot(kind='box', subplots=True, layout=(5,3), sharex=False, sharey=False, figsize=(12,15))
pyplot.show()

In [11]:
kfold = KFold(n_splits=15, random_state=7)  # 10 fold cross validation ; 
                                            # 7 random state is to assure consistent results

In [12]:
#Getting cross validation score for Naive Bayes Classifier using all features
nbc_results = cross_val_score(GaussianNB(), X, Y, cv=kfold)
print("Accuracy: %.3f%% (std:%.3f)" % (nbc_results.mean()*100, nbc_results.std()))

Accuracy: 96.876% (std:0.011)


#### The same actions but without cross validation ####

In [13]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2) #split data to train and test

In [14]:
clf = GaussianNB() # with default parameters
#Train classification model
nbc_model = clf.fit(x_train, y_train)
#Apply model and get predictions
y_pred = nbc_model.predict(x_test)

In [15]:
misclassified_points = (y_test != y_pred).sum()
print("Number of mislabeled points out of a total %d points : %d"\
      % (y_test.shape[0], misclassified_points))
print("Accuracy = %.2f"%(round((y_test.shape[0] - float(misclassified_points))/y_test.shape[0]*100,2)))

Number of mislabeled points out of a total 1300 points : 33
Accuracy = 97.46


In [16]:
#Confusion matrix

print(confusion_matrix(y_test, y_pred))

CM_log = confusion_matrix(y_test, y_pred)

[[964  25]
 [  8 303]]


In [17]:
#Getting performance metrics
report = classification_report(y_test, y_pred)
print(report)

             precision    recall  f1-score   support

          0       0.99      0.97      0.98       989
          1       0.92      0.97      0.95       311

avg / total       0.98      0.97      0.97      1300



In [25]:
#AUC score for this model
print(roc_auc_score(y_test, y_pred))

0.974499234343


Naive Bayes Classifier shows great results if we use all features (Accuracy, Precision, Recall and AUC are very close to 1). 

### Perform some feature selection ###

On correlation matrix (above) we can notice that several variables are highly correlated with each other. I'm going to delete "free sulfur dioxide", "density" from dataset and check performance.

In [26]:
#delete columns from X matrix (features) 
X2 = X.drop(["free sulfur dioxide", "density"], axis = 1)


In [27]:
#Getting cross validation score for Naive Bayes Classifier
nbc_results2 = cross_val_score(GaussianNB(), X2, Y, cv=kfold)
print("Accuracy: %.3f%% (std:%.3f)" % (nbc_results2.mean()*100, nbc_results2.std()))

Accuracy: 96.198% (std:0.013)


In [28]:
#Split X and Y matrices to train (80%) and test (20%) sets
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2, Y, test_size = 0.2)

In [29]:
# Classifier- Naive Bayes
clf = GaussianNB() # with default parameters
#Train classification model
nbc_model2 = clf.fit(x_train2, y_train2)
#Apply model and get predictions
y_pred2 = nbc_model2.predict(x_test2)

In [30]:
misclassified_points = (y_test2 != y_pred2).sum()
print("Number of mislabeled points out of a total %d points : %d"\
      % (y_test2.shape[0], misclassified_points))
print("Accuracy = %.2f"%(round((y_test2.shape[0] - float(misclassified_points))/y_test2.shape[0]*100,2)))

Number of mislabeled points out of a total 1300 points : 41
Accuracy = 96.85


In [31]:
#Confusion matrix
print(confusion_matrix(y_test2, y_pred2))

CM_log = confusion_matrix(y_test2, y_pred2)

[[928  32]
 [  9 331]]


In [32]:
#Getting performance metrics
report = classification_report(y_test2, y_pred2)
print(report)

             precision    recall  f1-score   support

          0       0.99      0.97      0.98       960
          1       0.91      0.97      0.94       340

avg / total       0.97      0.97      0.97      1300



In [33]:
print(roc_auc_score(y_test2, y_pred2))

0.970098039216


Model showed good performance too after deleting two variables, but a little lower (less then 1%) in comparison with model where all features were used.

### K-best feature selection ###

In [60]:
# feature extraction K-best selection method
test = SelectKBest(score_func=chi2, k=5)
fit = test.fit(X, Y)
# summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)  #get scores (importance) for each feature
features = fit.transform(X)


[  3.585e+02   2.211e+02   1.512e+01   3.287e+03   3.740e+01   1.491e+04
   8.795e+04   8.961e-03   5.652e+00   6.427e+01   9.574e-01]


According the results, the order of importance of features (starting with the most important) is:
- total sulfur dioxide
- free sulfur dioxide
- residual sugar
- fixed acidity
- volatile acidity
- sulphates
- chlorides
- citric acid
- pH
- alcohol
- density.

I've made several experiments, and target accuracy 95% was achieved using at least 7 variables ('residual sugar',  'free sulfur dioxide', 'total sulfur dioxide', 'fixed acidity', 'volatile acidity', 'sulphates', 'chlorides')

In [76]:
X.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',
       'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='object')

In [61]:
X3 = X[['residual sugar',  'free sulfur dioxide', 'total sulfur dioxide', 'fixed acidity', 'volatile acidity', 'sulphates', 
        'chlorides']]

In [59]:
#Getting cross validation score for Naive Bayes Classifier
nbc_results3 = cross_val_score(GaussianNB(), X3, Y, cv=kfold)
print("Accuracy: %.3f%% (std:%.3f)" % (nbc_results3.mean()*100, nbc_results3.std()))

Accuracy: 95.690% (std:0.015)


In [64]:
#Split X and Y matrices to train (80%) and test (20%) sets
x_train3, x_test3, y_train3, y_test3 = train_test_split(X3, Y, test_size = 0.2)

In [65]:
# Classifier- Naive Bayes
clf = GaussianNB() # with default parameters
#Train classification model
nbc_model3 = clf.fit(x_train3, y_train3)
#Apply model and get predictions
y_pred3 = nbc_model3.predict(x_test3)

In [66]:
misclassified_points = (y_test3 != y_pred3).sum()
print("Number of mislabeled points out of a total %d points : %d"\
      % (y_test3.shape[0], misclassified_points))
print("Accuracy = %.2f"%(round((y_test3.shape[0] - float(misclassified_points))/y_test3.shape[0]*100,2)))

Number of mislabeled points out of a total 1300 points : 60
Accuracy = 95.38


In [67]:
#Confusion matrix
print(confusion_matrix(y_test3, y_pred3))

CM_log = confusion_matrix(y_test3, y_pred3)

[[935  43]
 [ 17 305]]


In [70]:
#Getting performance metrics
report = classification_report(y_test3, y_pred3)
print(report)

             precision    recall  f1-score   support

          0       0.98      0.96      0.97       978
          1       0.88      0.95      0.91       322

avg / total       0.96      0.95      0.95      1300



In [71]:
print(roc_auc_score(y_test3, y_pred3))

0.95161884439


## SUMMARY ##

1. Accuracy (percentage of correct classification results):
    - using all features Accuracy = 96.876%
    - after some "intuitive" feature selection  Accuracy = 96.198%
    - after feature selection based on k-best selection Accuracy = 95.690%


2. AUC - the area under the the ROC curve. ROC AUC varies between 0 and 1 — with an uninformative classifier yielding 0.5.
So the best possible value ROC AUC = 1.
    - using all features ROC AUC = 0.974
    - after some "intuitive" feature selection  ROC AUC = 0.970
    - after feature selection based on k-best selection ROC AUC = 0.952.
    
3. Minimum amount of features that we need to use for getting at least 95% correct predictions is 7 ('residual sugar',  'free sulfur dioxide', 'total sulfur dioxide', 'fixed acidity', 'volatile acidity', 'sulphates', 'chlorides'). 
This subset was gotten using k-best selection method and following experiments.

In general, Naive Bayes Classificator performed very well using k-best method, and show the best results 
if we use all features.