# Machine Learning Exercise with `R`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

For these exercises, we ask you to only complete **ONE** of the exercise notebooks, either `Python` or `R`. We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [2]:
library(tree)
library(ggplot2)
library(e1071)

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [3]:
wine <- read.csv('/dsa/data/all_datasets/wine-quality/winequality-red.csv', sep = ";")
head(wine)
nrow(wine)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


In [4]:
# if wine quality is less than 6, assign the value "bad".
# if 6 or greater, assign "good". create a new target called
# taste
wine$taste <- ifelse(wine$quality < 6, 'bad', 'good') 

# 6 is the most popular value by a lot in this set, so 
# we are going to assign it a unique value. We will call 
# this "normal" as it is in the middle of the distribution.
wine$taste[wine$quality == 6] <- 'normal'

# make this target variable categorical
wine$taste <- as.factor(wine$taste)

# remove the old target, since it is no longer needed
wine <- wine[,-12]

**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to pass `123` as the set.seed() first.

In [5]:
# Code for exercise 1 goes here
# *****************************

set.seed(123)
train_ind <- sample(seq_len(nrow(wine)), size = 960)
wine[train_ind,]

train <- wine[train_ind,]
test <- wine[-train_ind,]


Unnamed: 0_level_0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,taste
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
460,11.6,0.580,0.66,2.2,0.074,10,47,1.00080,3.25,0.57,9.0,bad
1260,6.8,0.640,0.00,2.7,0.123,15,33,0.99538,3.44,0.63,11.3,normal
654,9.4,0.330,0.59,2.8,0.079,9,30,0.99760,3.12,0.54,12.0,normal
1410,6.0,0.510,0.00,2.1,0.064,40,54,0.99500,3.54,0.93,10.7,normal
1501,7.5,0.725,0.04,1.5,0.076,8,15,0.99508,3.26,0.53,9.6,bad
73,7.7,0.690,0.22,1.9,0.084,18,94,0.99610,3.31,0.48,9.5,bad
842,6.6,0.660,0.00,3.0,0.115,21,31,0.99629,3.45,0.63,10.3,bad
1421,7.8,0.530,0.01,1.6,0.077,3,19,0.99500,3.16,0.46,9.8,bad
878,7.7,0.715,0.01,2.1,0.064,31,43,0.99371,3.41,0.57,11.8,normal
727,8.1,0.720,0.09,2.8,0.084,18,49,0.99940,3.43,0.72,11.1,normal


**Exercise 2**: Create a formula for the prediction task. First predict using all of the variables other than the target. In order to avoid typing out all of the variables, you can use the following notation:

```splus
target ~ .
```

The "." tells `R` to use all other variables in the dataset (that are not the target) as inputs.

In [6]:
# Code for exercise 2 goes here
# *****************************


frmla <- taste ~ .



**Exercise 3**: Create a Decision Tree model using the `tree` function. Make sure that you pass the newly created formula as a parameter and specify the training data set. Be sure to name this object something (in the examples, we called it `tr`). Then run a summary on the object. 

In [7]:
# Code for exercise 3 goes here
# *****************************

tr1 <- tree(frmla, data = train)
summary(tr1)




Classification tree:
tree(formula = frmla, data = train)
Variables actually used in tree construction:
[1] "alcohol"              "volatile.acidity"     "sulphates"           
[4] "chlorides"            "total.sulfur.dioxide"
Number of terminal nodes:  10 
Residual mean deviance:  1.467 = 1394 / 950 
Misclassification error rate: 0.326 = 313 / 960 

Pay attention to the output of the summary.

**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [8]:
# Code for exercise 4 goes here
# *****************************

tr2 <- tree(frmla, data = test)
summary(tr2)


# 0.3349 = 214/639



Classification tree:
tree(formula = frmla, data = test)
Variables actually used in tree construction:
[1] "alcohol"              "sulphates"            "total.sulfur.dioxide"
[4] "fixed.acidity"        "volatile.acidity"     "chlorides"           
[7] "citric.acid"          "density"             
Number of terminal nodes:  13 
Residual mean deviance:  1.383 = 866 / 626 
Misclassification error rate: 0.3349 = 214 / 639 

**Exercise 5**: Now create a Naïve Bayes classifier using the formula and training data. Be sure to name this model something (in the other notebooks, we called it `m`).

In [9]:
# Code for exercise 5 goes here
# *****************************

m1 <- naiveBayes(frmla, data = train)
m1




Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      bad      good    normal 
0.4739583 0.1302083 0.3958333 

Conditional probabilities:
        fixed.acidity
Y            [,1]     [,2]
  bad    8.187912 1.530198
  good   8.928000 2.106771
  normal 8.458947 1.834325

        volatile.acidity
Y             [,1]      [,2]
  bad    0.5982857 0.1870850
  good   0.4045200 0.1430999
  normal 0.4914079 0.1687192

        citric.acid
Y             [,1]      [,2]
  bad    0.2388352 0.1851983
  good   0.3744800 0.1905444
  normal 0.2882105 0.1950711

        residual.sugar
Y            [,1]     [,2]
  bad    2.511319 1.296993
  good   2.676400 1.299699
  normal 2.471053 1.327730

        chlorides
Y              [,1]       [,2]
  bad    0.09213846 0.05132500
  good   0.07500800 0.02245443
  normal 0.08388947 0.03630374

        free.sulfur.dioxide
Y            [,1]     [,2]
  bad    16.43297 10.68138
  good   

**Exercise 6**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [10]:
# Code for exercise 6 goes here
# *****************************

library(caret)

m2 <- naiveBayes(frmla, data = test)

predict_m2 <- predict(m2, test[ , names(test) != "taste"])

misclass <- mean(predict_m2 != test$taste)

cat("misclassification error rate =", misclass)

misclassification error rate = 0.3740219

Take a look at the summary of the tree created in Exercise 3. It shows us the features that it used for the classification task. 

**Exercise 7**: Create a new formula that predicts `taste` using only the features that the decision tree defined. Be sure to name this formula something different from the old formula.

In [11]:
# Code for exercise 7 goes here
# *****************************

frmla2 <- taste ~ alcohol + volatile.acidity + sulphates + chlorides + total.sulfur.dioxide



**Exercise 8**: Now create a Naïve Bayes classifier using this pruned formula and training data. Be sure to name this model something other than your original Naive Bayes model.

In [12]:
# Code for exercise 8 goes here
# *****************************


m3 <- naiveBayes(frmla2, data = train)
m3




Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      bad      good    normal 
0.4739583 0.1302083 0.3958333 

Conditional probabilities:
        alcohol
Y             [,1]      [,2]
  bad     9.942088 0.7189176
  good   11.574133 0.9948421
  normal 10.641535 1.0472273

        volatile.acidity
Y             [,1]      [,2]
  bad    0.5982857 0.1870850
  good   0.4045200 0.1430999
  normal 0.4914079 0.1687192

        sulphates
Y             [,1]      [,2]
  bad    0.6209451 0.1746078
  good   0.7485600 0.1372818
  normal 0.6681842 0.1439680

        chlorides
Y              [,1]       [,2]
  bad    0.09213846 0.05132500
  good   0.07500800 0.02245443
  normal 0.08388947 0.03630374

        total.sulfur.dioxide
Y            [,1]     [,2]
  bad    53.22088 36.21132
  good   36.41600 32.42365
  normal 41.64474 25.59287


**Exercise 9**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [13]:
# Code for exercise 9 goes here
# *****************************

predict_m3 <- predict(m3, train[ , names(train) != "taste"])

misclass2 <- mean(predict_m3 != test$taste)
cat("misclassification error rate =", misclass2)

# This does not appear to be a better model, as it has a higher rate of misclassification

“longer object length is not a multiple of shorter object length”

misclassification error rate = 0.6125

# Save your noteboot, then `File > Close and Halt`