# **Naive Bayes Classification**

#### **Learning Points**
- Define naive Bayes classification.
- Calculate conditional probabilities and likelihoods using naive Bayes.
- Use conditional probabilities to classify text data.
- Explain when to apply Laplace smoothing.
- Explain advantages, disadvantages, and assumptions of naive Bayes.
- Implement naive Bayes classification for text data using scikit-learn.

#### **Introduction to naive Bayes classification**
***Naive Bayes classification*** is a supervised learning classifier that uses the number of times a category occurs in each possible class to estimate the likelihood an instance is in the class. Naive Bayes is often used for applications with large amounts of text data. Ex: Identifying the author of a new document based on prior documents with known authors.



#### **Advantages and disadvantages of a naive Bayes classification.**
Advantages	

- Works well for many features	
- Fast to calculate	
- Handles rare events, categorical, and missing data well	

Disadvantages
- Needs large amount of training data
- Continuous data must be preprocessed to be used
- Assumes features are independent and equally important

#### **Naive Bayes Calculation Example:**

- Naive Bayes uses the messages with a known author to classify a message with an unknown author. The proportion of known messages from each - author is found.
- The proportion of messages that contained, or did not contain a word is found for each author.
- The likelihood that the VIP sent the message "Tomorrow. Bacon!!" is estimated by multiplying the probability with or without each word.
- These probabilities are estimated from the messages that are known to come from the VIP and multiplied together.
- The likelihood for the Staff is estimated in the same way.
- The likelihoods for the VIP and the staff are then compared. The likelihood for the staff is larger, so "Tomorrow. Bacon!!" is classified as from the staff.

#### **Naive Bayes classification process**
The likelihood of a class is a score that is proportional to the probability that an instance comes from that class. Naive Bayes calculates the likelihood for each class using the frequency that categories in the instance appeared or failed to appear among the labeled data.

Step 1: Look at what categories are and are not present in this instance.

Step 2: Calculate the likelihood for each class by multiplying together the proportions of instances in this class that match each category (presence or absence). Then multiply by the proportion of training set from the class.

Step 3: Classify with the class that has the highest likelihood.

#### **Rationale: Laplace smoothing**
If no instances occur within a category for a class, then naive Bayes will assign a likelihood of zero for that class to an instance that contains the category. Ex: If the staff never tweets the word "love", then the likelihood for the staff on a tweet containing "love" will be zero. Having a likelihood of zero means that instance can never be classified as that category, regardless of the rest of the message. Laplace smoothing adds one fictional occurrence to each proportion when calculating the likelihood. Laplace smoothing prevents the lack of occurrences from causing the likelihood of a class to be zero.

- The VIP's account posts "I love tomorrows!" Without Laplace smoothing, the likelhood the post was from the staff was 0.
- Each possible author's likelihood is calculated as before, except every word gets an additional instance added. So the total number of tweets is increased by the number of additional instances.
- Words that are present have one case added to the count of present cases. Words that are absent have the case added to the absent cases.
- Since the likelihood for the VIP posting "I love tomorrows!" is higher, this post is classified as coming from the VIP.

#### **Naive Bayes classification in Python**
NBModel = MultinomialNB() initializes a naive Bayes classifier as described in this section. The text for the model must be processed by CountVectorizer(ngram_range = (1,2)) before being used to fit the model. ngram_range = (1,2) sets the number of words to be counted together. Here, single words and pairs of words will be counted. The parameters for both can be found in the sklearn documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

After initialization, the model must be fitted into the data with NBModel.fit(X, y), where X is the result of the trained vectorizer and y is the vector of corresponding classes.

The Python code below builds a naive Bayes model that predicts whether a text message is spam or not (ham). .

In [6]:
# Import packages and functions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [7]:
# Read in the data and view the first five instances.
# File does not include column headers so they are provided via names.
messages = pd.read_table("/Users/dylanlam/Documents/GitHub/data_science_practice_and_skills/datasets/SMSSpamCollection", names = ["Class", "Message"])
messages.head()

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# Split into testing and training sets
X_train, X_test, Y_train, Y_test = train_test_split(
    messages['Message'], messages['Class'], random_state=20220530
)

In [10]:
# Count the words that appear in the messages
vectorizer = CountVectorizer(ngram_range=(1, 1))
vectorizer.fit(X_train)
# Uncomment the line below to see the words.
vectorizer.vocabulary_

{'ok': 4721,
 'she': 5844,
 'll': 3993,
 'be': 1190,
 'guess': 3120,
 'where': 7226,
 'are': 976,
 'you': 7440,
 'what': 7215,
 'do': 2252,
 'how': 3366,
 'can': 1536,
 'stand': 6195,
 'to': 6681,
 'away': 1088,
 'from': 2876,
 'me': 4234,
 'doesn': 2264,
 'your': 7444,
 'heart': 3232,
 'ache': 741,
 'without': 7290,
 'don': 2283,
 'wonder': 7317,
 'of': 4699,
 'crave': 1944,
 'hope': 3338,
 'alright': 862,
 'babe': 1104,
 'worry': 7340,
 'that': 6560,
 'might': 4299,
 'have': 3209,
 'felt': 2678,
 'bit': 1280,
 'desparate': 2144,
 'when': 7222,
 'learned': 3893,
 'the': 6564,
 'job': 3669,
 'was': 7132,
 'fake': 2626,
 'am': 870,
 'here': 3263,
 'waiting': 7099,
 'come': 1801,
 'back': 1111,
 'my': 4487,
 'love': 4057,
 'hi': 3273,
 '07734396839': 24,
 'ibh': 3418,
 'customer': 2004,
 'loyalty': 4074,
 'offer': 4704,
 'new': 4569,
 'nokia6600': 4611,
 'mobile': 4367,
 'only': 4750,
 '10': 239,
 'at': 1041,
 'txtauction': 6844,
 'txt': 6840,
 'word': 7328,
 'start': 6205,
 'no': 4603,


In [11]:
# Count the words in the training set and store in a matrix
X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

<4179x7474 sparse matrix of type '<class 'numpy.int64'>'
	with 55755 stored elements in Compressed Sparse Row format>

In [12]:
# Initialize the model and fit with the training data
NBmodel = MultinomialNB()
NBmodel.fit(X_train_vectorized, Y_train)

In [13]:
# Make predictions onto the training and testing sets.
trainPredictions = NBmodel.predict(vectorizer.transform(X_train))
testPredictions = NBmodel.predict(vectorizer.transform(X_test))

In [14]:
# How does the model work on the training set?
confusion_matrix(Y_train, trainPredictions)

array([[3610,   10],
       [  18,  541]])

In [15]:
# Display that in terms of correct porportions
confusion_matrix(Y_train, trainPredictions, normalize='true')

array([[0.99723757, 0.00276243],
       [0.03220036, 0.96779964]])

In [16]:
# How does the model work on the test set?
confusion_matrix(Y_test, testPredictions, normalize='true')

array([[0.99585062, 0.00414938],
       [0.07446809, 0.92553191]])

In [17]:
# Predict some phrases. Add your own.
NBmodel.predict(
    vectorizer.transform(
        ["Big sale today! Free cash.",
        "I'll be there in 5"]))

array(['spam', 'ham'], dtype='<U4')