# Module 20 Final Assignment: Implementing Naïve Bayes and Gaussian Naïve Bayes Classifiers


## Learning Outcomes Addressed:
6. Implement spam detection using Python.
8. Implement Naïve Bayes theorem using Scikit-learn.
9. Implement Gaussian Naïve Bayes theorem using Scikit-learn.

## Assignment Overview

In this assignment you will be working on implementing Naïve Bayes and Gaussian Naïve Bayes classifiers using the Python Scikit-learn *library*. In the first part of the assignment, you will be working with a dummy dataset to implement a simple Naïve Bayes classifier that classifies the data into binary values.

In the second part of the assignment, you will extend your Naïve Bayes classifier to make predictions on more than two labels using the `Wine` dataset from the Scikit-learn *library*.

Finally, in the third part of the assignment, you will be working with a dataset that contains both normal and spam SMS messages. For this last exercise, you will be using a Gaussian Naïve Bayes classifier to predict whether a message is normal or spam.

## Part 1: Naïve Bayes Classifier

The first exercise of this assignment will guide you through a simple implementation of a Naïve Bayes classifier using the Scikit-learn *library* and a dummy dataset with three columns: `weather`, `temperature`, and `play`. The first two are features (`weather` and `temperature`) and `play` is the label.

The goal of the classifier will be to decide on whether you want to play an outdoor activity based on the weather and the temperature.

In the code cell below, the `weather` and `temperature` features and the `play` label are defined. Run the code cell below to define your data.

In [3]:
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temperature=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']


The first thing you need to do is to convert the *string* labels into numbers. In other words, you want to encode each label in the `weather` *list* into a number. 

Because the `weather` *list* contains only 14 entries and only three different labels (`Sunny`, `Overcast`, and `Rainy`), you can manually encode these *strings*. For example, you can do this by assigning the value 0 to the `Rainy` *string*, the value 1 to the `Overcast` *string*, and the value 2 to the `Sunny` *string*.

Although this may be a viable option for this example, encoding the labels manually could easily result in errors in your code.

The Scikit-learn *library* provides a `LabelEncoder()` *function* for encoding labels with a value between 0 and one less than the number of discrete *classes*.

The pseudocode below demonstrates how to use this *function*:

```Python
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
label_encoded=le.fit_transform(label_to_encode)
```
### Question 1

In the code cell below, complete the code to encode the `weather` *label*.

In [4]:
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded = le.fit_transform(weather)


In [5]:
print(weather_encoded)

[2 2 0 1 1 1 0 2 2 1 2 0 0 1]


Similarly, you can also encode the *`temperature`* and the *`play`* columns.

### Question 2
In the code cell below, complete the code to encode the remaining two columns.

In [6]:
# Converting string labels into numbers
temp_encoded=le.fit_transform(temperature)
play_encoded=le.fit_transform(play)

In [7]:
print(temp_encoded)

[1 1 1 2 0 0 0 2 0 2 2 2 1 2]


In [8]:
print(play_encoded)

[0 0 1 1 1 0 1 0 1 1 1 1 1 0]


### Question 3
In the code cell below, create a NumPy *array* using the `weather_encoded` and `temp_encoded` *lists*.

In [9]:
import numpy as np
features=np.array([weather_encoded, temp_encoded])

In [10]:
print(features)

[[2 2 0 1 1 1 0 2 2 1 2 0 0 1]
 [1 1 1 2 0 0 0 2 0 2 2 2 1 2]]


Now, you are ready to generate a model using Scikit-learn to create a Naïve Bayes classifier by using the following steps:

- Create the Naïve Bayes classifier.
- Fit the dataset on the classifier.
- Perform the prediction.

Run the code cell below:

In [11]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(features.reshape(14, 2),play_encoded)



GaussianNB()

Next, let's have a look at some predictions.

You will first try to predict whether you want to play outside if the weather is overcast and the temperature is mild.

In [12]:
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print("Predicted Value:", predicted)

Predicted Value: [1]


### Question 4:

What does the predicted value represent? This is an open-ended question that requires a written response.
    
    
You can double-click this cell to write your answer.

Question 4: since the value returned was 1 - it means we have slightly overcast/mild temps and may want to go play


Next, let's predict whether you want to play outside if it's rainy and hot.

In [13]:
#Predict Output
predicted= model.predict([[1,1]]) # 1:Rainy, 1:Hot
print("Predicted Value:", predicted)

Predicted Value: [1]


### Question 5

What does the predicted value represent? This is an open-ended question that requires a written response.
    
    
You can double-click this cell to write your answer.


Question 5: (Write your answer here.)

## Part 2: Naïve Bayes Classifier with a Multiple Labels

In Part 1 of this assignment, you worked on a problem that classified your outcome either as 1 or 0.

In this part of the assignment, you will learn how to classify a problem with more than two possible outcomes.

For this exercise, you will use the [`Wine`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) dataset from the Scikit-learn *library*.

This dataset is comprised of 13 features (`alcohol`, `malic_acid`, `ash`, `alkalinity_of_ash`, `magnesium`, `total_phenols`, `flavanoids`, `nonflavanoid_phenols`, `proanthocyanins`, `color_intensity`, `hue`, `od280/od315_of_diluted_wines`, and  `proline`) and it has three types of wine: `Class_0`, `Class_1`, and `Class_2`. 

Let's start by importing the dataset from Scikit-learn. Run the code cell below:

In [14]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()

The code cell below *prints* the name of the features in the dataset. Run the code cell below.

In [15]:
# print the names of the 13 features
print("Features: ", wine.feature_names)

Features:  ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


### Question 6
Complete the code cell below to *print* the name of the label. Note that in this dataset the name of the label can be accessed by using the `target_names` *attribute*.

In [16]:
# print the label type of wine(Class_0, Class_1, Class_2)
print("Labels: ", wine.target_names)

Labels:  ['class_0' 'class_1' 'class_2']


### Splitting the Data Into a Training and a Testing Set

Because this dataset is widely used by the data science community, the features and labels are already defined.

The feature data can be accessed via the `wine.data` code. The labels can be accessed via the `wine.target` code.

Therefore, you can simply split the features and labels into training and testing sets by using the [*`train_test_split()`*](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) *function* from Scikit-learn.

### Question 7
Complete the code in the code cell below by setting the size of the testing set equal to 30% of the entire data. Set the `random_state` equal to 109.

In [17]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=.3 ,random_state=109) 

Now you are ready to generate a model using Scikit-learn to create a Naïve Bayes classifier by using the following steps:

- Create the Naïve Bayes classifier.
- Fit the dataset on the classifier.
- Perform the prediction.

### Question 8

In the code cell below, create a Gaussian Naïve Bayes `gnb` classifier. Next, fit the training sets dataset on the classifier by using the `fit()` *function*. Finally, perform a prediction by using the `predict()` *function* on your `X_test` set.

In [18]:
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)

Now you are ready to check the accuracy using actual and predicted values.

### Question 9

In the code cell below, fill in the ellipsis with *`y_test`* and *`y_pred`* to compute the accuracy of your model.

In [21]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test,y_pred))

Accuracy: 0.9074074074074074


## Part 3: Gaussian Naïve Bayes Classifier to Classify Spam SMS Messages

In the last exercise for this assignment, you will work on using a Gaussian Naïve Bayes classifier to classify spam emails.

For this exercise, you will use the [`
SMS Spam Collection Dataset `](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) which contains 5,572 SMS messages.

Run the code cell below to read the data.

In [22]:
import pandas as pd



sms_spam = pd.read_csv('SMSSpamCollection', sep='\t',
header=None, names=['label', 'text'])


### Question 10

In the code cells below, use the correct `pandas` *functions* to retrieve the shape of the `sms_spam` *dataframe* and to visualize its first five rows.

In [24]:
#print the shape
sms_spam.shape



(5572, 2)

In [25]:
#visualize the first five rows

sms_spam.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Question 11

In the code code cell below, use the pandas [*`value_counts()`*](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) *function* to see how many messages are normal and how many are spam.

Inside the *function*, set the `normalize` *argument* equal to True.

In [26]:
sms_spam['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

You can see that about 87% of the messages are ham (non-spam), and the remaining 13% are spam. 

### Question 12

In the code cell below, complete the dictionary inside the `replace` *function* to replace the `ham` values with 0 and the `spam` values with 1.

In [27]:
sms_spam['label'] = (sms_spam['label'].replace({'ham':0,'spam':1}))

### Question 13

In the code below, split the `text` feature and the `label` label into training and testing sets. 

Set the `random_state` variable equal to 1.

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sms_spam['text'], 
                                                    sms_spam['label'], 
                                                    random_state=1)

print('Total: {} rows'.format(sms_spam.shape[0]))
print('Train: {} rows'.format(X_train.shape[0]))
print(' Test: {} rows'.format(X_test.shape[0]))

Total: 5572 rows
Train: 4179 rows
 Test: 1393 rows


Because your data contains categorical (i.e., non-numerical) features, you need to use the `CountVectorizer()` *function* from Scikit-learn to convert the categorical variables to numerical.

Note that the `CountVectorizer` *function* is only trained on the training data, but is used to transform both the training and test data.

Run the code cell below.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

train_data = count_vector.fit_transform(X_train)
test_data = count_vector.transform(X_test)

### Question 14

In the code cell below, you have created a Gaussian Naïve Bayes classifier to train your model.

Replace the ellipsis with the names of the `X` and `y` training data defined above.

**HINT:** The `X` will be the vectorized set, `train_data`.

In [30]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()

naive_bayes.fit(train_data, y_train)

MultinomialNB()

Finally, you make the predictions on your model.

Run the code cell below.

In [31]:
predictions = naive_bayes.predict(test_data)

Now you are ready to check the accuracy using actual and predicted values.

### Question 15

In the code cell below, fill in the ellipsis with *`y_test`* and *`predictions`* to compute the accuracy of your model.

In [32]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import accuracy_score

# Model Accuracy, how often is the classifier correct?
accuracy_score = accuracy_score(y_test, predictions)

In [33]:
print(accuracy_score)

0.9885139985642498
