### `A Spam Filter`

The computer uses the specifications of the Naive Bayes algorithm to learn how we classify messages (what counts as spam and non-spam for us), and then it uses that human knowledge to estimate probabilities for new messages. Following the specifications of the algorithm, the computer tries to answer two conditional probability questions:

![image.png](attachment:image.png)

In plain English, these two questions are:

* What's the probability that this new message is spam, given its content (its words, punctuation, letter case, etc.)?
* What's the probability that this new message is non-spam, given its content?

Once it has an answer to these two questions, the computer classifies the message as spam or non-spam based on the probability values. If the probability for spam is greater, then the message is classified as spam. Otherwise, it goes into the non-spam category.

### `Using Bayes' Theorem`

we saw an overview of how the computer may classify new messages using the Naive Bayes algorithm:

1. The computer learns how humans classify messages.
2. Then it uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Finally, the computer classifies a new message based on the probability values it calculated in step 2 — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may want a human to classify the message — we'll come back to this issue in the guided project).

We saw on the previous screen that when a new message comes in, the algorithm requires the computer to calculate the following probabilities:

![image.png](attachment:image.png)

Let's take the first equation and expand it using Bayes' theorem:

![image-2.png](attachment:image-2.png)

Now let's do the same for the second equation:

![image-3.png](attachment:image-3.png)

If the computer knows these values, then it can calculate the probabilities it needs to classify a new message:

![image-4.png](attachment:image-4.png)

Since P(Spam|New message) > P(Spam<sup>C</sup>|New message) , the computer will classify the new message as spam.


### `Ignoring the Division`

we saw the computer can use these two equations to calculate the probabilities it needs to classify new messages:

![image.png](attachment:image.png)

Although we've taken a great first step so far, the actual equations of the Naive Bayes algorithm are a bit different — we'll gradually develop the equations throughout this lesson. Let's start by pointing out that both equations above have the same denominator: P(New message).

When a new message comes in, P(New message) has the same value for both equations. Since we only need to compare the results of the two equations to classify a new message, we can ignore the division:

![image-2.png](attachment:image-2.png)

This means our two equations reduce to:

![image-3.png](attachment:image-3.png)

Ignoring the division doesn't affect the algorithm's ability to classify new messages. For instance, let's repeat the classification we did on the previous cell using the new equations above. Recall that we assumed we already know these values:

![image-4.png](attachment:image-4.png)

Previously, the algorithm classified the new message as spam. Using the new equations, we see the conclusion is identical — the new message is spam because P(Spam|New message) > P(Spam<sup>C</sup>|New message):

![image-5.png](attachment:image-5.png)

The classification works fine, but ignoring the division changes the probability values, and some probability rules also begin to break. For instance, let's take this conditional probability rule that we've learned about in a previous lesson:

![image-6.png](attachment:image-6.png)

On the previous screen, we saw P(Spam|New message) = 0.6 and P(Spam<sup>C</sup>|New message) = 0.4,  and the rule holds with these values:

![image-7.png](attachment:image-7.png)

With the values we got from the new equations, however, the law breaks:

![image-8.png](attachment:image-8.png)

Even though probability rules break, the Naive Bayes algorithm still requires us to ignore the division by P(New message). This might not make a lot of sense, but there's actually a very good reason we do that.

The main goal of the algorithm is to classify new messages, not to calculate probabilities — calculating probabilities is just a means to an end. Ignoring the division by P(New message) means less calculations, which can make a lot of difference when we use the algorithm to classify 500,000 new messages.

It's true the probability values are not accurate anymore. However, this is not important with respect to the the goal of the algorithm — correctly classifying new messages (not to accurately estimate probabilities).

The classification itself remains completely unaffected because we ignore division for both equations (not just for one). The probability values change, but they change directly proportional with one another, so the result of the comparison doesn't change.

For instance, 8/4 > 4/4. If we ignore the division, the values change directly proportional with respect to one another such that the result of the comparison stays the same: 8 > 4.

The symbol for directly proportional is ∝, and it's more accurate to replace the equality sign with ∝ in our two equations:

![image-9.png](attachment:image-9.png)



### ` A One-Word Message`

We'll now look at how the algorithm can use messages that are already classified by humans to calculate the values it needs for:

* P(Spam) and P(Spam<sup>C</sup>)
* P(New message|Spam) and P(New message|Spam<sup>C</sup>).

We'll start with some examples that may look a bit too simplistic and unrealistic, but they will make it easier to understand the mathematics behind the algorithm.

Let's say we have three messages that are already classified:

![image.png](attachment:image.png)

Now let's say the one-word message "secret" comes in and we want to use the Naive Bayes algorithm to classify it — to tell whether it's spam or non-spam.


As we learned, we first need to answer these two probability questions (note that we changed New Message to "secret" inside the notation below) and then compare the values (recall that the ∝ symbol replaces the equal sign):
 
![image-2.png](attachment:image-2.png)

Let's begin with the first equation, for which we need to find the values of P(Spam) and P("secret"|Spam). To find P(Spam), we use the messages that are already classified and divide the number of spam messages by the total number of messages:

![image-3.png](attachment:image-3.png)
 
To calculate P("secret"|Spam), we only look at the spam messages and divide the number of times the word "secret" occurred in all the spam messages by the total number of words.

![image-4.png](attachment:image-4.png)

Notice that "secret" occurs four times in the spam messages:

![image-6.png](attachment:image-6.png)

We have two spam messages and there's a total of seven words in all of them, so P("secret"|Spam) is:

![image-7.png](attachment:image-7.png)

Now that we know the values for P(Spam) and P("secret"|Spam), we have all we need to calculate P(Spam|"secret"):

![image-8.png](attachment:image-8.png)

For the exercise below, we'll take the same steps as above to calculate P(Spam<sup>C</sup>|"secret"). Then, we can compare the values of P(Spam<sup>C</sup>|"secret") and P(Spam|"secret") to classify the message "secret" as spam or non-spam.


### `Instructions`

Using the table below (there are the same messages as above), classify the message "secret" as spam or non-spam.

![image-9.png](attachment:image-9.png)

1. Calculate P(Spam<sup>C</sup>) and assign the answer to p_non_spam.
2. Calculate P("secret"|Spam<sup>C</sup>) and assign the answer to p_secret_given_non_spam.
4. Calculate P(Spam<sup>C</sup>|"secret") and assign the answer to p_non_spam_given_secret.
5. Compare P(Spam<sup>C</sup>|"secret") with P(Spam|"secret") and classify the message "secret" — if the message is spam, then assign the string 'spam' to the variable classification, otherwise assign the string 'non-spam'.

In [1]:
p_non_spam = 1/3
p_secret_given_non_spam = 1/4
p_non_spam_given_secret = p_non_spam * p_secret_given_non_spam
p_spam_given_secret = 8/21

if p_spam_given_secret > p_non_spam_given_secret:
    print("Spam")
else:
    print("Non Spam")

Spam


### `Multiple Words`

Let's say we want to classify the message "secret place secret secret" based on four messages that are already classified (the four messages below are different than what what we saw on the previous screen):

![image.png](attachment:image.png)


To calculate the probabilities we need, we'll treat each word in our new message separately. This means that the word "secret" at the beginning is different and separate from the word "secret" at the end. There are four words in the message "secret place secret secret", and we're going to abbreviate them "w1", "w2", "w3" and "w4" (the "w" comes from "word").

![image-2.png](attachment:image-2.png)

Since we treat each word separately, these are the two equations we can use to calculate the probabilities:

![image-3.png](attachment:image-3.png)

Let's begin with calculating P(Spam|w1, w2, w3, w4). To calculate the probabilities we need, we'll look at the four messages that are already classified. We have four messages and two of them are spam, so:

![image-4.png](attachment:image-4.png)

The first word, w1, is "secret", and we see that "secret" occurs four times in all spam messages. There's a total of seven words in all the spam messages, so:

![image-5.png](attachment:image-5.png)

Applying a similar reasoning, we have:

![image-6.png](attachment:image-6.png)

We now have all the probabilities we need to calculate P(Spam|w1, w2, w3, w4):

![image-7.png](attachment:image-7.png)

Let's now take similar steps to calculate P(SpamC|w1, w2, w3, w4), and then classify the message "secret place secret secret" as spam or non-spam.

### `Instructions`

Using the table below (the same as above), classify the message "secret place secret secret" as spam or non-spam.

![image.png](attachment:image.png)

1. Calculate P(Spam<sup>C</sup>|w1, w2, w3, w4). Assign the answer to p_non_spam_given_w1_w2_w3_w4. Check the hint if you get stuck.
2. Compare P(Spam<sup>C</sup>|w1, w2, w3, w4) with P(Spam|w1, w2, w3, w4) and classify the message "secret place secret secret" — if the message is spam, then assign the string 'spam' to the variable classification. Otherwise, assign the string 'non-spam'.

In [2]:
p_non_spam = 2/4
p_w1_given_non_spam = 2/9
p_w2_given_non_spam = 1/9
p_w3_given_non_spam = 2/9
p_w4_given_non_spam = 2/9

p_non_spam_given_w1_w2_w3_w4 = (p_non_spam *
                                p_w1_given_non_spam * p_w2_given_non_spam *
                                p_w3_given_non_spam * p_w4_given_non_spam
                               )

p_spam_given_w1_w2_w3_w4 = 64/4802

if p_spam_given_w1_w2_w3_w4 > p_non_spam_given_w1_w2_w3_w4:
    print("Spam")
else:
    print("Non Spam")

Spam


### `Edge Cases`

we looked at a few messages that were already classified:

![image.png](attachment:image.png)

Above, we have four messages and nine unique words: "secret", "party", "at", "my", "place", "money", "you", "know", "the". We call the set of unique words a vocabulary.

Now, what if we receive a new message that contains words which are not part of the vocabulary? How do we calculate probabilities for these kind of words?

For instance, say we received the message "secret code to unlock the money".

![image-2.png](attachment:image-2.png)

Notice that for this new message:

* The words "code", "to", and "unlock" are not part of the vocabulary.
* The word "secret" is part of both spam and non-spam messages.
* The word "money" is only part of the spam messages and is missing from the non-spam messages.
* The word "the" is missing from the spam messages and is only part of the non-spam messages.

Whenever we have to deal with words that are not part of the vocabulary, one solution is to ignore them when we're calculating probabilities. If we wanted to calculate P(Spam|"secret code to unlock the money"), we could skip calculating P("code"|Spam), P("to"|Spam), and P("unlock"|Spam) because "code", "to", and "unlock" are not part of the vocabulary:

* P(Spam|"secret code to unlock the money") ∝ P(Spam) ⋅ P("secret"|Spam) ⋅ P("the"|Spam) ⋅ P("money"|Spam)

We can also apply the same reasoning for calculating P(SpamC|"secret code to unlock the money"):

* P(Spam<sup>C</sup>|"secret code to unlock the money") ∝ P(Spam<sup>C</sup>) ⋅ P("secret"|Spam<sup>C</sup>) ⋅ P("the"|Spam<sup>C</sup>) ⋅ P("money"|Spam<sup>C</sup>)


Let's now calculate P(Spam|"secret code to unlock the money") and P(Spam<sup>C</sup>|"secret code to unlock the money"), and see what we get.

### `Instructions`

P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(Spam<sup>C</sup>|"secret code to unlock the money").

![image-2.png](attachment:image-2.png)

1. Calculate P(Spam<sup>C</sup>|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.
2. Print p_spam_given_message and p_non_spam_given_message. Why do you think we got these values? We'll discuss more about this in the next screen.

In [3]:
p_spam = 2/4
p_secret_given_spam = 4/7
p_the_given_spam = 0/7
p_money_given_spam = 2/7
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)
p_non_spam = 2/4
p_secret_given_non_spam = 2/9
p_the_given_non_spam = 1/9
p_money_given_non_spam = 0/9
p_non_spam_given_message = (p_non_spam * p_secret_given_non_spam *
                            p_the_given_non_spam * p_money_given_non_spam)


print(p_spam_given_message)
print(p_non_spam_given_message)

0.0
0.0


### `Additive Smoothing`

we saw that both P(Spam|"secret code to unlock the money") and P(Spam<sup>C</sup>|"secret code to unlock the money") were equal to 0. This will always happen when we have words that occur in only one category — "money" occurs only in spam messages, while "the" only occurs in non-spam messages.

![image.png](attachment:image.png)

When we calculate P(Spam|"secret code to unlock the money"), we can see that P("the"|Spam) is equal to 0 because "the" is not part of the spam messages. Unfortunately, that single value of 0 has the drawback of turning the result of the entire equation to 0:

* P(Spam|"secret code to unlock the money") ∝ P(Spam) ⋅ P("secret"|Spam) ⋅ P("the"|Spam) ⋅ P("money"|Spam)

![image-2.png](attachment:image-2.png)

To fix this problem, we need to find a way to avoid these cases where we get probabilities of 0. Let's start by laying out the equation we're using to calculate P("the"|Spam):

![image-3.png](attachment:image-3.png)

We're going to add some notation and rewrite the equation above as:

![image-4.png](attachment:image-4.png)

To fix the problem, we're going to use a technique called additive smoothing, where we add a smoothing parameter α. In the equation below, we'll use  α = 1 (below, NVocabulary represents the number of unique words in all the messages — both spam and non-spam).

![image-5.png](attachment:image-5.png)

The additive smoothing technique solves the issue and gets us a non-zero result, but it introduces another problem. We're now calculating probabilities differently depending on the word — take P("the"|Spam) and P("secret"|Spam) for instance:

![image-6.png](attachment:image-6.png)

Words like "the" are thus given special treatment and their probability are increased artificially to avoid non-zero cases, while words like "secret" are treated normally. To keep the probability values proportional across all words, we're going to use the additive smoothing for every word:

![image-7.png](attachment:image-7.png)

In more general terms, this is the equation that we'll need to use for every word:

![image-8.png](attachment:image-8.png)

### `Instructions`


P(Spam|"secret code to unlock the money") is already calculated for you. Use the table below (the same as above) to calculate P(SpamC|"secret code to unlock the money").

![image-9.png](attachment:image-9.png)

1. Using the additive smoothing technique, calculate P(SpamC|"secret code to unlock the money"). Assign your answer to p_non_spam_given_message.
2. Compare p_spam_given_message and p_non_spam_given_message to classify the message as spam or non-spam. If you think it's spam, then assign the string 'spam' to classification. Otherwise, assign 'non-spam'.

In [4]:
p_spam = 2/4
p_secret_given_spam = (4 + 1) / (7 + 9)
p_the_given_spam = (0 + 1) / (7 + 9)
p_money_given_spam = (2 + 1) / (7 + 9)
p_spam_given_message = (p_spam * p_secret_given_spam *
                        p_the_given_spam * p_money_given_spam)
p_non_spam = 2/4
p_secret_given_non_spam = (2 + 1) / (9 + 9)
p_the_given_non_spam = (1 + 1) / (9 + 9)
p_money_given_non_spam = (0 + 1) / (9 + 9)
p_non_spam_given_message = (p_non_spam * p_secret_given_non_spam *
                            p_the_given_non_spam * p_money_given_non_spam)

if p_spam_given_message > p_non_spam_given_message:
    print("Spam")
else:
    print("Non Spam")

Spam
