# Module 6: Exercise B - Machine Learning with `Python`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

## For these exercises, we ask you to only complete **ONE** of the exercise notebooks, choose either the Machine Learning with`Python` or the Machine Learning with`R`. 

We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [2]:
import pandas as pd
import numpy as np 
from sklearn import tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_wine

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [3]:
with open('/dsa/data/all_datasets/wine-quality/winequality-red.csv') as file:
    df = pd.read_csv(file, delimiter=";")
    # if wine quality is less than 6, assign the value "bad".
    # if greater than 6, assign "good". 
    # 6 is the most popular value by a lot in this set, so 
    # we are going to assign it a unique value. We will call 
    # this "normal" as it is in the middle of the distribution.
    
    vals_to_replace = {3:'bad', 4:'bad', 5:'bad', 6:'okay', 7:'good', 8:'good',9:'good'}
    df['quality'] = df['quality'].map(vals_to_replace)
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,bad
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,bad
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,bad
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,okay
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,bad
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,bad
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,okay
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,okay
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,bad


**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to set the `random_state` to `1`.

In [4]:
# Code for exercise 1 goes here
# *****************************
train = df.sample(n=959,random_state = 1) # 1599/0.60 = 60% train data
test = df.drop(train.index) #Rest 40% in test data

**Exercise 2**: Create `numpy` arrays for both the input variables and the target variables. The target should be the `quality` variable. Use all of the values for the input, other than the target variable. Do this for both the training and testing set. Call the inputs for the training set `train_X` and the target `train_y`, and the inputs for the testing set `test_X` and `test_y` for the target.

In [5]:
# Code for exercise 2 goes here
# *****************************
train_X = np.asarray(train.iloc[:,0:-1])
train_y = np.asarray(train['quality'])
test_X = np.asarray(test.iloc[:,0:-1])
test_y = np.asarray(test['quality'])

**Exercise 3**: Create a Decision Tree model from the `tree` module. Make sure you name the classifier something (in the other notebooks, we called it `clf`). Then train the classifier using the `fit()` method, and pass the `train_X` and `train_y` as the parameters.

In [6]:
# Code for exercise 3 goes here
# *****************************
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(train_X, train_y )

**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [7]:
# Code for exercise 4 goes here
# *****************************
y_pred = clf.fit(train_X, train_y).predict(test_X)
print("Number of mislabeled points out of a total {} points : {}"
      .format(len(test),(test_y != y_pred).sum()))
print('The accuracy of the Decision Tree is ',"{:.3f}".format(metrics.accuracy_score(y_pred,test_y)))

Number of mislabeled points out of a total 640 points : 254
The accuracy of the Decision Tree is  0.603


**Exercise 5**: Find the feature importance by running the method `feature_importances_` on the classifier.

In [8]:
# Code for exercise 5 goes here
# *****************************
z = zip(df[0:11],clf.feature_importances_)
list(z)
#Alcohol is having the maximum feature importance

[('fixed acidity', 0.054006838401111125),
 ('volatile acidity', 0.12066552662940283),
 ('citric acid', 0.0884792600834978),
 ('residual sugar', 0.06547084848991837),
 ('chlorides', 0.06310831620214194),
 ('free sulfur dioxide', 0.061582500592246514),
 ('total sulfur dioxide', 0.08642269091512164),
 ('density', 0.04565738989874861),
 ('pH', 0.0709835446873291),
 ('sulphates', 0.1259684635866339),
 ('alcohol', 0.21765462051384812)]

**Exercise 6**: Create a Naïve Bayes model. Make sure you name the classifier something (in the other notebooks, we called it `nbc`). Then train the classifier using the `fit()` method, and pass the `train_X` and `train_y` as the parameters.

In [9]:
# Code for exercise 6 goes here
# *****************************
nbc = GaussianNB()
y_pred = nbc.fit(train_X, train_y).predict(test_X)

**Exercise 7**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [49]:
# Code for exercise 7 goes here
# *****************************
print("Number of mislabeled points out of a total {} points : {}"
      .format(df.shape[0],(test_y != y_pred).sum()))
print('The accuracy of the NB is ',"{:.3f}".format(metrics.accuracy_score(y_pred,test_y)))

Number of mislabeled points out of a total 1599 points : 244
The accuracy of the NB is  0.619


**Exercise 8**: Create a subset of the original data frame `df` to include the top 5 features (from the feature importances listed above) and the target variable `quality`. Call this new data frame something other than `df`.

Then create training and testing sets on this new data frame using the same method as in Exercise 1. Then create your new training and testing inputs and targets using the method in Exercise 2. Be sure to name these objects. 

In [31]:
# Code for exercise 8 goes here
# *****************************
df_new_features = pd.DataFrame([('fixed acidity', 0.06308681033505224),
 ('volatile acidity', 0.11898425383586138),
 ('citric acid', 0.08015675949416178),
 ('residual sugar', 0.07047265400289598),
 ('chlorides', 0.06942920080669616),
 ('free sulfur dioxide', 0.06559597436967743),
 ('total sulfur dioxide', 0.08856891320532846),
 ('density', 0.04314620771263842),
 ('pH', 0.05465346671906904),
 ('sulphates', 0.13647820383864842),
 ('alcohol', 0.20942755567997062)])
df_new_features = df_new_features.sort_values(by=[1],ascending=False).iloc[0:5,:]
df_new_features

Unnamed: 0,0,1
10,alcohol,0.209428
9,sulphates,0.136478
1,volatile acidity,0.118984
6,total sulfur dioxide,0.088569
2,citric acid,0.080157


In [38]:
# df_new = df.iloc[]
df_new = df.iloc[:,[1,2,6,9,10,11]]
df_new

Unnamed: 0,volatile acidity,citric acid,total sulfur dioxide,sulphates,alcohol,quality
0,0.700,0.00,34.0,0.56,9.4,bad
1,0.880,0.00,67.0,0.68,9.8,bad
2,0.760,0.04,54.0,0.65,9.8,bad
3,0.280,0.56,60.0,0.58,9.8,okay
4,0.700,0.00,34.0,0.56,9.4,bad
...,...,...,...,...,...,...
1594,0.600,0.08,44.0,0.58,10.5,bad
1595,0.550,0.10,51.0,0.76,11.2,okay
1596,0.510,0.13,40.0,0.75,11.0,okay
1597,0.645,0.12,44.0,0.71,10.2,bad


**Exercise 9**: Now create a new Naïve Bayes classifier and train it on using your new training data created in exercise 8.

In [48]:
# Code for exercise 9 goes here
# *****************************
train_new = df_new.sample(n=959,random_state=1)
test_new = df_new.drop(train_new.index)
train_new_X = train_new.iloc[:,0:5]
train_new_y = train_new.iloc[:,-1]
test_new_X = test_new.iloc[:,0:5]
test_new_y = test_new.iloc[:,-1]

y_pred_new = nbc.fit(train_new_X, train_new_y).predict(test_new_X)
print("Number of mislabeled points out of a total {} points : {}"
      .format(df_new.shape[0],(test_new_y != y_pred_new).sum()))
print('The accuracy of the NB is ',"{:.3f}".format(metrics.accuracy_score(y_pred_new,test_new_y)))

Number of mislabeled points out of a total 1599 points : 230
The accuracy of the NB is  0.641


**Exercise 10**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [51]:
# Code for exercise 10 goes here
# *****************************
clf = clf.fit(train_new_X, train_new_y )
y_pred_new = clf.fit(train_new_X, train_new_y).predict(test_new_X)
print("Number of mislabeled points out of a total {} points : {}"
      .format(df_new.shape[0],(test_new_y != y_pred_new).sum()))
print('The accuracy of the Decision Tree is ',"{:.3f}".format(metrics.accuracy_score(y_pred_new,test_new_y)))

Number of mislabeled points out of a total 1599 points : 238
The accuracy of the Decision Tree is  0.628


# Save your notebook, then `File > Close and Halt`