# Naive Bayes

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1">Introduction</a></span><ul class="toc-item"><li><span><a href="#Conditional-Probability-and-Bayes'-Theorom" data-toc-modified-id="Conditional-Probability-and-Bayes'-Theorom-1.1">Conditional Probability and Bayes' Theorom</a></span></li></ul></li><li><span><a href="#How-Does-the-Naive-Bayes-Classifier-Work?" data-toc-modified-id="How-Does-the-Naive-Bayes-Classifier-Work?-2">How Does the Naive Bayes Classifier Work?</a></span></li><li><span><a href="#Project---Spam-Message-Classification-Using-Naive-Bayes-Classifier" data-toc-modified-id="Project---Spam-Message-Classification-Using-Naive-Bayes-Classifier-3">Project - Spam Message Classification Using Naive Bayes Classifier</a></span></li><li><span><a href="#Data-Load" data-toc-modified-id="Data-Load-4">Data Load</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-5">Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-5.1">Train-Test Split</a></span></li><li><span><a href="#Bag-of-Words-(BoW)" data-toc-modified-id="Bag-of-Words-(BoW)-5.2">Bag of Words (BoW)</a></span></li></ul></li><li><span><a href="#Train-the-Model" data-toc-modified-id="Train-the-Model-6">Train the Model</a></span><ul class="toc-item"><li><span><a href="#Make-predictions" data-toc-modified-id="Make-predictions-6.1">Make predictions</a></span></li></ul></li><li><span><a href="#Evaluate-the-Model" data-toc-modified-id="Evaluate-the-Model-7">Evaluate the Model</a></span></li><li><span><a href="#Advantages-of-Naive-Bayes" data-toc-modified-id="Advantages-of-Naive-Bayes-8">Advantages of Naive Bayes</a></span></li></ul></div>

## Introduction

Naive Bayes is among one of the most simple and powerful algorithms for classification based on Bayes’ Theorem with an assumption of independence among predictors/features. Naive Bayes model is easy to build and particularly useful for very large data sets. There are two parts to this algorithm:

- Naive
- Bayes

The Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that a particular fruit is an apple or an orange or a banana and that is why it is known as “Naive”. 

### Conditional Probability and Bayes' Theorom

Here are a couple of short videos - one on conditional probability and another one on Bayes' theorom. These will give you a quick primer on the subject.

In [1]:
## Run this cell (shift+enter) to see the video

from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/ibINrxJLvlM", width="814", height="509")

In [2]:
## Run this cell (shift+enter) to see the video

from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/XQoLVl31ZfQ", width="814", height="509")

## How Does the Naive Bayes Classifier Work?

Now that you understand Conditional Probability and The Bayes' Theorom, let's see how we can use it to create a classifier algorithm. The Naive Bayes Classifier is one of the most useful and easy to use ML algorithms.

In [3]:
## Run this cell (shift+enter) to see the video

from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/O2L2Uv9pdDA", width="814", height="509")

## Project - Spam Message Classification Using Naive Bayes Classifier

Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.

In this mission we will be using the Naive Bayes algorithm to create a model that can classify SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like. Usually they have words like 'free', 'win', 'winner', 'cash', 'prize' and the like in them as these texts are designed to catch your eye and in some sense tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!

Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

We will be using a dataset from the UCI Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

## Data Load

Import the dataset into a pandas dataframe using the read_table method. Because this is a tab separated dataset
we will be using '\t' as the value for the 'sep' argument which specifies this format. 

Also, rename the column names by specifying a list `['label, 'sms_message']` to the 'names' argument of read_table().

In [1]:
import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('https://raw.githubusercontent.com/anikannal/ML_Projects/master/data/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

In [2]:
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Preprocessing

Now that we have a basic understanding of what our dataset looks like, lets convert our labels to binary variables, 0 to represent 'ham'(i.e. not spam) and 1 to represent 'spam' for ease of computation.

You might be wondering why do we need to do this step? The answer to this lies in how scikit-learn handles inputs. Scikit-learn only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally(more specifically, the string labels will be cast to unknown float values).

Our model would still be able to make predictions if we left our labels as strings but we could have issues later when calculating performance metrics, for example when calculating our precision and recall scores. Hence, to avoid unexpected 'gotchas' later, it is good practice to have our categorical values be fed into our model as integers.

In [3]:
# Convert the values in the 'label' colum to numerical values using map method as follows:
# {'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.

df['label'] = df.label.map({'ham':0, 'spam':1})

In [5]:
# to get an idea of the size of the dataset we are dealing with, print out number of rows and columns using 'shape'.
print(df.shape)

(5572, 2)


### Train-Test Split

Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data
using the following variables:

In [9]:
#split into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: ',df.shape[0])
print('Number of rows in the training set: ',X_train.shape[0])
print('Number of rows in the test set: ',X_test.shape[0])

Number of rows in the total set:  5572
Number of rows in the training set:  4179
Number of rows in the test set:  1393


### Bag of Words (BoW)

What we have here in our data set is a large collection of text data (5,572 rows of data). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. **The basic idea of BoW is to take a piece of text and count the frequency of the words in that text**. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

The code for this segment is in 2 parts. Firstly, we are learning a vocabulary dictionary for the training data 
and then transforming the data into a document-term matrix; secondly, for the testing data we are only 
transforming the data into a document-term matrix.

In [12]:
X_train.head()

710     4mths half price Orange line rental & latest c...
3740                           Did you stitch his trouser
2711    Hope you enjoyed your new content. text stop t...
3155    Not heard from U4 a while. Call 4 rude chat pr...
3748    Ü neva tell me how i noe... I'm not at home in...
Name: sms_message, dtype: object

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [18]:
type(training_data)

scipy.sparse.csr.csr_matrix

In [17]:
training_data

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

## Train the Model

We will be using the multinomial Naive Bayes implementation. This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

In [19]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()

naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Make predictions

Now that our algorithm has been trained using the training data set we can now make some predictions on the test data
stored in 'testing_data' using predict(). Save your predictions into the 'predictions' variable.

In [20]:
predictions = naive_bayes.predict(testing_data)

In [27]:
print(predictions)
print(predictions[100:120]) # print predictions for samples 100 through 120

[0 0 0 ... 0 1 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0]


## Evaluate the Model

Now that we have made predictions on our test set, our next goal is to evaluate how well our model is doing. There are various mechanisms for doing so, but first let's do quick recap of them.

**Accuracy** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

**Precision** tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classificatio), in other words it is the ratio of

[True Positives/(True Positives + False Positives)]

**Recall(sensitivity)** tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of

[True Positives/(True Positives + False Negatives)]

For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

We will be using all 4 metrics to make sure our model does well. For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ('Accuracy score: ', accuracy_score(y_test, predictions))
print ('Precision score: ', precision_score(y_test, predictions)) 
print ('Recall score: ', recall_score(y_test, predictions)) 
print ('F1 score: ', f1_score(y_test, predictions))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


## Advantages of Naive Bayes

- One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words.
- It performs well even with the presence of irrelevant features and is relatively unaffected by them. 
- The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. 
- It rarely ever overfits the data. 
- Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. 

Congratulations! You have succesfully designed a model that can efficiently predict if an SMS message is spam or not!

This project uses a variety of online resources as references. The tutorial on Naive Bayes by Adarsh was used extensively - https://github.com/adarsh0806/naive_bayes_tutorial/blob/master/Naive_Bayes_tutorial.ipynb