# Practicing Machine Learning with Naïve Bayes

So we should be familiar with the major components involved in classification. But we still need to cover how to do Naïve Bayes in `R`. Much like the `Python` exercise, we are going to learn how to create a training and testing set, as well using feature selection to prune reduce the number of features that go into our model.

It begins by loading in the appropriate library. Again, there are several libraries that have a Naïve Bayesian classifier built in, but today we are going to be using `e1071`.

In [1]:
library(e1071)

Then we will begin this notebook like many of the notebooks before it, by looking at the data.

In [2]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


In the `Python` practice notebook, we discussed the importance of splitting our data in to training and testing sets. We are going to do the same thing with `R` before we train our model. The method is very similar...

In [2]:
set.seed(1)
train_ind <- sample(seq_len(nrow(iris)), size = 100)
iris[train_ind,]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
40,5.1,3.4,1.5,0.2,setosa
56,5.7,2.8,4.5,1.3,versicolor
85,5.4,3.0,4.5,1.5,versicolor
134,6.3,2.8,5.1,1.5,virginica
30,4.7,3.2,1.6,0.2,setosa
131,7.4,2.8,6.1,1.9,virginica
137,6.3,3.4,5.6,2.4,virginica
95,5.6,2.7,4.2,1.3,versicolor
90,5.5,2.5,4.0,1.3,versicolor
9,4.4,2.9,1.4,0.2,setosa


This looks complicated, but it's not. We can take this piece by piece starting with the inner most part `seq_len(nrow(iris))`. All this is doing is creating a sequence of numbers from 1 to the number of rows in the iris data frame, 150. Then we call the `sample()` function and specify that we only want 100 numbers returned that are sampled randomly from the sequence we just created. We call this sequence of randomly sampled numbers `train_ind`, as we will use these to reference these indexes on the iris data frame, which we can then use as a training set of data. The `set_seed(1)` just makes the the sample replicable.

**Activity 1**: *Create a training data frame by using the `train_ind` sequence on the iris data frame. Call this frame `train`.*

In [7]:
# Code for Activity 1 goes here
# *****************************
train <- iris[train_ind,]




And now we need to make a testing frame from those indexes that are not in the `train_ind` sequence. In `R` we can do this with the following notation `dataframe[-sequence,]` where the `dataframe` is our data frame object we are working with, and `sequence` is the sequence of indexes. The "`-`" sign specifies that we don't want those indexes. 

**Activity 2**: *Create a testing data frame from the rest of the data not included in `train`. Call this data frame `test`.*

In [8]:
# Code for Activity 2 goes here
# *****************************
test <- iris[-train_ind,]




Now that we have both the training and testing sets defined, we can specify our formula. We will begin predicting Species by all of the features. Let's do that now.

In [9]:
# just in case defining these two data frames was giving you trouble...
train <- iris[train_ind,]
test <- iris[-train_ind,]

In [10]:
frmla <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width 

And now we can train our model on our training set using our formula to specify our target and inputs. We will call this model `m`. 

In [11]:
m <- naiveBayes(frmla, data = train)
m



Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    setosa versicolor  virginica 
      0.34       0.44       0.22 

Conditional probabilities:
            Sepal.Length
Y                [,1]      [,2]
  setosa     4.958824 0.3518857
  versicolor 5.904545 0.5567570
  virginica  6.245455 0.8442318

            Sepal.Width
Y                [,1]      [,2]
  setosa     3.341176 0.4899730
  versicolor 2.809091 0.2543134
  virginica  2.909091 0.3884702

            Petal.Length
Y                [,1]      [,2]
  setosa     1.458824 0.1371989
  versicolor 4.304545 0.4412938
  virginica  5.345455 0.5645594

            Petal.Width
Y                 [,1]       [,2]
  setosa     0.2294118 0.07717436
  versicolor 1.3181818 0.20385888
  virginica  1.9545455 0.23393861


Now we can use the predict function on our testing data to assess the performance of the model. But this time we are going to create a table that shows the number of points properly classified and misclassified.

To do this, we can call the `predict()` function and specify the dataset we want it to predict. But be careful, the dataset that you test on must have the same variables as the input of the formula. It is that reason that the `test[,-5]` below removes the 5th column, the `Species` column. 

In [25]:
table(predict(m, test[,-5]), test[,5])

            
             setosa versicolor virginica
  setosa         33          0         0
  versicolor      0         27         2
  virginica       0          1        37

And there we have it only 2 points were misclassified! Not bad. 

But remember back to the Decision Tree lab notebook in `R` where it said it only used petal length, petal width, and sepal length in its model? This was the same thing that we found in the Decision Trees practice notebook. 

How would performance change if we took this variable out?

**Activity 3**: *Prune the formula and take out the `Sepal.Width` variable. Call this new formula `prune_frmla`.

In [14]:
# Code for Activity 3 goes here
# *****************************
prune_frmla <- Species ~ Sepal.Length + Petal.Length + Petal.Width 


**Activity 4**: *Now create a new model using `prune_frmla` with the training data set. Call this new model m2.*

In [15]:
# Code for Activity 4 goes here
# *****************************
prf<- naiveBayes(prune_frmla, data = train)
prf



Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    setosa versicolor  virginica 
      0.34       0.44       0.22 

Conditional probabilities:
            Sepal.Length
Y                [,1]      [,2]
  setosa     4.958824 0.3518857
  versicolor 5.904545 0.5567570
  virginica  6.245455 0.8442318

            Petal.Length
Y                [,1]      [,2]
  setosa     1.458824 0.1371989
  versicolor 4.304545 0.4412938
  virginica  5.345455 0.5645594

            Petal.Width
Y                 [,1]       [,2]
  setosa     0.2294118 0.07717436
  versicolor 1.3181818 0.20385888
  virginica  1.9545455 0.23393861


Is `m2` a better, worse, or about the same given our testing data set?

**Activity 5**: *Find the number of misclassified points using the `m2` model and the testing data set. Remember, when predicting on the testing data, remember that you will have to remove both the target variable and the `Sepal.Width` variable.*

In [24]:
# Code for Activity 5 goes here
# *****************************
table(predict(prf,test[,-c(2,5)]), test[,5])



            
             setosa versicolor virginica
  setosa         33          0         0
  versicolor      0         27         1
  virginica       0          1        38