# Self-study try-it 11.2:  ‘Spam’ vs ‘not spam’ in Python


## Overview

In this assignment, you'll use the naïve Bayes technique to classify a set of text messages.

Naive Bayes is a classification method based on Bayes’ theorem, which assumes that all predictors are independent of one another.

Bayes’ theorem is written as:

$$P(A | B) = \frac{P(B|A)P(A)}{P(B)},$$

where:

 - $P(A| B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true
 - $P(B|A)$ is also a conditional probability: the likelihood of event $B$ occurring given that $A$ is true
 - $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$



## Outline

- [Part 1: Importing the data set and exploratory data analysis](#part1)
- [Part 2: Shuffling and splitting the text messages](#part2)
- [Part 3: Building a naive Bayes classifier from scratch](#part3)
- [Part 4: Explaining the code](#part4)
- [Part 5: Training the classifier `train`](#part5)
- [Part 6: Exploring the performance of the `train` classifier ](#part6)
- [Part 7: Training the `train2` classifier ](#part7)


The pseudo-algorithm for naive Bayes can be summarised as follows:

1. Load the training and test data.

2. Shuffle and split the messages.

3. Build a naive Bayes classifier from scratch.

4. Train the classifier and explore the performance.

## Building a naive Bayes spam filter

To build a naive Bayes spam filter for this assignment, you will use the `SMSSpamCollection` data set and complete the following steps:


[Back to top](#Index:)

<a id='part1'></a>

### Part 1: Importing the data set and exploratory data analysis

To begin, use the `pandas` library to import the data set. To do so, import `pandas` first. We read the file using the `.read_csv()` function by passing the name of the data set we want to read as a string.

Notice that because the rows in the  data set are separated using a `\t`, we specified the type of delimiter in the `.read_csv()` function (the default value is `,`). Additionally, we specified the list of column names to use (`"label"` and `"sms"`).

- Begin by importing the `pandas` library.

- Then, use the `.read_csv()` function to read the data set. Pass the name of the file as a string.

- Because the rows in the file are separated by tabs `(\t)`, specify the delimiter in the `.read_csv()` function. The default value is `,`.

- Provide a list of column names as `"label"` and `"sms"`.

In [None]:
import pandas as pd
import numpy as np

messages = pd.read_csv('data/SMSSpamCollection.csv', sep = '\t', names = ["label", "sms"])

Before applying any algorithm, it's good practice to perform some basic exploratory data analysis.

- Start by viewing the first ten rows of the data frame `df` using the `.head()` function. By default, `.head()` shows the first five rows, but you can display more by passing an integer as the desired number of rows.

- Complete the code cell below by passing 10 to `.head()` to view the first ten rows.

In [None]:
messages.head(10)



Next, use the properties `.shape` and `columns` and the function `.describe()` to retrieve more information about the data frame.

Here's a brief description of what each of the above functions does:

- `shape`: returns a tuple representing the dimensionality of the data frame

- `columns`: returns the column labels of the data frame

- `describe()`: returns summary statistics of the columns in the data frame provided, such as mean, count, standard deviation, etc.

### Question 1:
 Display the shape, columns and description of the data set, and assign them to `ans1a`, `ans1b` and `ans1c`, respectively

In [None]:

ans1a = messages.shape
ans1b = messages.columns
ans1c = messages.describe()

print("The shape of the dataset is ", ans1a)
print("The columns of the dataset are ", ans1b)
print("The description of the dataset is ", ans1c)

[Back to top](#Index:)

<a id='part2'></a>

### Part 2: Shuffling and splitting the text messages

In this section, you will shuffle the messages and split them into a training set (2,500 messages), a validation set (1,000 messages) and a test set (all remaining messages).

To begin, use the `pandas` function `sample` to shuffle the messages.

### Question 2: Complete the code cell below

- Complete the code cell below by applying the `.sample()` function to messages, with `frac = 1` and `random_state = 42`. The `frac` specifies the proportion of the data frame to return, while `random_state` sets a seed to ensure the results are reproducible.

- Use the `reset_index()` function to reset the index of messages so that it aligns with the shuffled order. Be sure to include the appropriate argument.

- Assign you answer to `ans2`.

You can find the documentation about `.reset_index()` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html).

In [None]:
ans2 = messages.sample(frac = 1, random_state = 42).reset_index(drop = True)

print(ans2.head())

In [None]:
messages = ans2

In the code cell below, the messages and their corresponding labels are defined.




In [None]:
msgs = list(messages.sms)
lbls =list(messages.label)

### Question 3: Complete the code cell to split the labels into a training set, a validation set and a test set

- Split the messages into three sets as instructed: a training set (2,500 messages), a validation set (1,000 messages) and a test set (the remaining messages).

- Assign these to `trainingMsgs`, `valMsgs` and `testingMsgs`, respectively.

- Split the corresponding labels in the same way and assign them to `trainingLbls`, `valLbls` and `testingLbls`.

In [None]:
trainingMsgs = msgs[:2500]
valMsgs = msgs[2500:3500]
testingMsgs = msgs[3500:]
trainingLbls = lbls[:2500]
valLbls = lbls[2500:3500]
testingLbls = lbls[3500:]

print("The number of training messages", len(trainingMsgs))
print("The number of validation messages",len(valMsgs))
print("The number of testing messages", len(testingMsgs))

Following the above syntax, complete the code cell below to split the labels into a training set, a validation set and a test set.

[Back to top](#Index:)

<a id='part3'></a>

### Part 3: Building a naive Bayes classifier from scratch

While Python’s `scikit-learn` library has a naive Bayes classifier (see [here](https://scikit-learn.org/stable/modules/naive_bayes.html) for more information), it works with continuous probability distributions and assumes numerical features.

Although it is possible to transform categorical variables into numerical features using a binary encoding, in this activity, you will build a naive Bayes classifier from scratch.

In [None]:



class NaiveBayesForSpam:
    def train (self, hamMessages, spamMessages):
        self.words = set (' '.join (hamMessages + spamMessages).split())
        self.priors = np.zeros (2)
        self.priors[0] = float (len (hamMessages)) / (len (hamMessages) + len (spamMessages))
        self.priors[1] = 1.0 - self.priors[0]
        self.likelihoods = []
        for i, w in enumerate (self.words):
            prob1 = (1.0 + len ([m for m in hamMessages if w in m])) / len (hamMessages)
            prob2 = (1.0 + len ([m for m in spamMessages if w in m])) / len (spamMessages)
            self.likelihoods.append ([min (prob1, 0.95), min (prob2, 0.95)])
        self.likelihoods = np.array (self.likelihoods).T

    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

[Back to top](#Index:)

<a id='part4'></a>

### Part 4: Explaining the code

Before exploring the code in Part 3 in further detail, it’s helpful to build some intuition about what a spam message might look like. Spam texts often include attention-grabbing words designed to tempt you to open them. You’ll also notice that they tend to use all capital letters and lots of exclamation marks.

In the training phase, you use the `train` function to calculate and store the prior probabilities and likelihoods based on your training data. In naïve Bayes, that’s all the training involves.

Next, you use the `predict` function that applies Bayes’ theorem to every word in the dictionary and uses the resulting posterior probabilities to classify each message as 'spam' or 'ham'.

Finally, you call the `score` function to evaluate your classifier. It runs `predict` on multiple messages, compares the predictions to the `ground truth` labels and returns a confusion matrix.

[Back to top](#Index:)

<a id='part5'></a>

### Part 5: Training the `train` classifier

Looking at the definition of the function `train`, you can see that the training functions require the `ham` and `spam` messages to be passed on separately.

### Question 4: How do you construct the list `hammsgs`?

How do you construct the list `hammsgs` so that it includes every message in `trainingMsgs` whose corresponding label in `trainingLbls` contains the substring 'ham'?

You can do this by using `zip()` to pair each message with its label, then iterating through the pairs and selecting the messages where the label includes 'ham'.

Hint: Use list comprehension to solve this task.

In [None]:
hammsgs = [m for (m, l) in zip(trainingMsgs, trainingLbls) if 'ham' in l]

print(hammsgs[:5])

### Question 5: How do you construct the list `spammsgs`?

How do you construct the list `spammsgs` so that it includes every message in `trainingMsgs` whose corresponding label in `trainingLbls` contains the substring 'spam'?

You can do this by using `zip()` to pair each message with its label, then iterating through the pairs and selecting the messages where the label includes 'spam'.

Hint: Use list comprehension to solve this task.

In [None]:

spammsgs = [m for (m, l) in zip(trainingMsgs, trainingLbls) if 'spam' in l]

print(spammsgs[:5])


Run the cell below to see the number of `ham` and `spam` messages.

In [None]:
print(len(hammsgs))
print(len(spammsgs))

The sum should be 2,500 messages.

### Question 6: Create the classifier for your analysis using the function `NaiveBayesForSpam`().

- Complete the code cell below to create the classifier `clf`.

- Train `hammsgs` and `spammsgs` using the function `train`.

Hint: For this last part, look at the definition of the function `.train()`.

In [None]:


clf = NaiveBayesForSpam()
clf.train(hammsgs, spammsgs)



[Back to top](#Index:)

<a id='part6'></a>

### Part 6: Exploring the performance of the `train` classifier

You can explore the performance of the two classifiers on the *validation set* by using the function `.score()`.

### Question 7: Complete the code cell below to compute the score and the confusion matrix.

**Note: The results in the following sections may vary. This is expected and happens due to the random shuffling. Each shuffle produces slightly different outcomes. To ensure your results are reproducible, define `random_state` in the `sample` method when shuffling the data in [Part 2: Shuffling and splitting the text messages](#part2).**

Note: This cell takes a couple of minutes to execute.

In [None]:
score, confusion = clf.score (valMsgs, valLbls)

print("The overall performance is:", score)
print("The confusion matrix is:\n", confusion)

The data is not equally divided into the two classes. As a baseline, let’s check the success rate if you always predicted 'ham'.

Run the code cell below to print the new score.

In [None]:
print('new_score', len([1 for l in valLbls if 'ham' in l]) / float (len ( valLbls)))


You can also calculate the sample error by computing the score and the confusion matrix on the *training set*.

### Question 8: Calculate the score and confusion matrix

Calculate the score of the `trainingMsgs` and `trainingLbls` and assign it to `score_train` and `confusion_train`, respectively.

Note: This cell may take a while to run.

In [None]:
score_train, confusion_train = clf.score (trainingMsgs, trainingLbls)


print("The overall performance is:", score_train)
print("The confusion matrix is:\n", confusion_train)


[Back to top](#Index:)

<a id='part7'></a>

### Part 7: Training the `train2` classifier

In this section, you will define a second classifier, `train2`, and compare its performances to the above classifier `train`.

The `train2` method builds a vocabulary from all words, then filters it to keep only those whose likelihood of appearing in 'spam' is at least 20 times greater than in 'ham'. This creates a focused set of strong spam indicators for training and prediction.

The `train2` classifier is defined in the code cell below.

In [None]:
class NaiveBayesForSpam:
    def train2 ( self , hamMessages , spamMessages) :
            self.words = set (' '.join (hamMessages + spamMessages).split())
            self.priors = np. zeros (2)
            self.priors [0] = float (len (hamMessages)) / (len (hamMessages) +len( spamMessages ) )
            self.priors [1] = 1.0 - self . priors [0]
            self.likelihoods = []
            spamkeywords = [ ]
            for i, w in enumerate (self.words):
                prob1 = (1.0 + len ([m for m in hamMessages if w in m])) /len ( hamMessages )
                prob2 = (1.0 + len ([m for m in spamMessages if w in m])) /len ( spamMessages )
                if prob1 * 20 < prob2:
                    self.likelihoods.append([min (prob1 , 0.95) , min (prob2 , 0.95) ])
                    spamkeywords . append (w)
            self.words = spamkeywords
            self.likelihoods = np.array (self.likelihoods).T

    def predict (self, message):
        posteriors = np.copy (self.priors)
        for i, w in enumerate (self.words):
            if w in message.lower():  # convert to lower-case
                posteriors *= self.likelihoods[:,i]
            else:
                posteriors *= np.ones (2) - self.likelihoods[:,i]
            posteriors = posteriors / np.linalg.norm (posteriors)  # normalise
        if posteriors[0] > 0.5:
            return ['ham', posteriors[0]]
        return ['spam', posteriors[1]]

    def score (self, messages, labels):
        confusion = np.zeros(4).reshape (2,2)
        for m, l in zip (messages, labels):
            if self.predict(m)[0] == 'ham' and l == 'ham':
                confusion[0,0] += 1
            elif self.predict(m)[0] == 'ham' and l == 'spam':
                confusion[0,1] += 1
            elif self.predict(m)[0] == 'spam' and l == 'ham':
                confusion[1,0] += 1
            elif self.predict(m)[0] == 'spam' and l == 'spam':
                confusion[1,1] += 1
        return (confusion[0,0] + confusion[1,1]) / float (confusion.sum()), confusion

### Question 9: Train the `train2` classifier

- Train the `train2` classifier using the function `NaiveBayesForSpam()` and assign it to `clf2`.

- Train `hammsgs` and `spammsgs` using the function `train2`.


In [None]:
clf2 = NaiveBayesForSpam()
clf2.train2(hammsgs, spammsgs)



### Question 10: Recompute the score and the confusion matrix on the *validation set* using the updated classifier.

Note: This cell might a take a while to run.

In [None]:

score_2, confusion_2 = clf2.score(valMsgs, valLbls)

print("The overall performance is:", score_2)
print("The confusion matrix is:\n", confusion_2)