### TODO
- [ ] Create table of contents mimicking [this](https://sebastianraschka.com/Articles/2014_ipython_internal_links.html)
- [ ] Export utils functions, like play_audio and tada, to outer file

# Table of Contents:

<!-- [Building the Spam Filter](Building the Spam Filter) <br> -->
<!-- [Tokenization](Tokenization and Context) -->

## How to Build a Spam Filter using Bayes Theorem
---
We want to build a Spam filter that classifies e-mails in two groups: **fraud** and **no fraud** sales.
Our e-mails can have features like "has_promotional_codes" or "has_giftcards" that we could use to make better judgments about if an e-mail contains a fraudulent sale or not.
But how do we feed this information to a spam classifier?

We can convert this information to proportions (i.e., probabilities) and use Bayes theorem to train a classifier.

A **Naive Bayes Classifier** is a supervised and *probabilistic* learning model. It works well for problems where:
* Data for which the inputs are independent from one another
* Data where the probability of any attribute is greater than zero, always

### Conidtional Probabilities

\begin{equation}
P(A|B) = \frac{P(A \cap B)}{P(B)}
\end{equation}

The **intersection function $ \cap $** (informally called *cap* due to its shape) can also be thought of as the Boolean operator **AND** applied on sets. In Python it looks like:
```python
a = [1, 2, 3]; b = [1, 4, 5]

set(a) & set(b)  # => {1}
set(a).intersection(set(b)) == set(a) & set(b)  # => True
```

The **union function $\cup$** (informally called *cup* due to its shape) can be thought of as the **OR** operator:
```python
set(a) | set(b)  # => {1, 2, 3, 4, 5}
set(a).union(set(b)) == set(a) | set(b)  # => True
```
We can see the idea of conditional probability in Python like this:
```python
A, B = set(a), set(b)
total_cases = len(A) + len(B)

p_A_cap_B = len(A & B) / total_cases  # => 0.16, probability of intersection happening
p_B = len(B) / total_cases  # => 0.5, probability of B happening

# So, according to the conditional probability formula above
p_A_given_B = p_A_cap_B / p_B  # => 0.333

# which always satisfies the following inequality:
p_B > p_A_given_B > p_A_cap_B  # => True
```

### Inverse Conditional Probability (*aka* Bayes Theorem)

But what if we want to know the opposite case, that is, what if we want to know the probability of B, given A?.
In that case, we can reason backwards and recalling that $ A \cap B = B \cap A $:

From the conditional probability formula above, we know that
$$ P(A \cap B) = P(A|B)P(B) $$

So, starting again with the conditional probability and substituting the expression for $ P(B \cap A) $ we have
$$ P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{P(A|B)P(B)}{P(A)} $$

which is known as **Bayes Theorem**. We state it again for simplicity:

$$ P(B|A) = \frac{P(A|B)P(B)}{P(A)} $$

### Naive Bayesian Classifier

#### The Chain Rule

Using the joint probability, the previous result transforms into the chain rule.<br>
Joint probabilities are the probability that all the events will happen at the same time. The generic form is:
$$
P(A_1, A_2, ..., A_n) = P(A_1)P(A_2|A_1)P(A_3 | A_1, A_2)···P(A_n | A_1, A_2, ..., A_{n-1})
$$


##### Navieté in Bayesian Reasoning

Sometimes, when we strictly apply Bayes Theorem, we can run into trouble because we need to use values of probabilities that we cannot compute. For example, $ P(Promos | Giftcards, Fraud) $ can be very hard, or even impossible, to get.
But we still need to get going. We solve this problem by doing a rather big assumption (the *Navieté assumption*) to help us proceed in cases where some information is too difficult or impossible to get. The **Navieté** assumption consists in **considering only the individual effect that each event (Giftcard, Promos, ...) has in the event we want to predict**, in our case, fraud sales e-mails.

$$ 
\text{what we want to know} = P(Fraud | Giftcard, Promos) = \frac{P(Giftcard, Promos, Fraud)}{P(Giftcard, Promos)}
$$

$ P(Giftcard, Promos) $ is easy to obtain, and thanks to the Navieté assumption (no interdependence of things like *having promo codes* and *having giftcards* (which really is unrealistic) we can simplify the numerator to an expression that we can really work with, given our data. So our numerator becomes:

$$ 
P(Giftcard, Promos, Fraud) = P(Fraud)P(Giftcard|Fraud)P(Promos|Fraud)
$$

and our final model being:

$$
P(Fraud | Giftcard, Promos) = \frac{P(Fraud)P(Giftcard | Fraud)P(Promos | Fraud)}{P(Giftcard, Promos)}
$$



But what if something like $ P(Promos | Fraud) = 0 $? That could happen due to new information, or not considering enough data. In that case, since probabilites here are just divisions of counts, we use what's called a **pseudocount**, which is just the real count of a class plus 1. In that way, we ensure that every computed probability will be always greater than zero.

# Building the Spam Filter



Here we have the coding design for this exercise:

<img src="images/coding_design.jpeg" style="width: 190px;">

Each `Email` object takes an `.eml` text file that then tokenizes into something that the `SpamTrainer` can utilize. <br>
When testing, we will focus on the tradeoff between false positives and false negatives, since in this scenario, a **false positive** (filtering out an e-mail as SPAM when it is not) **could be very harmful for a business**. Thus, we will try to **minimize the false positive rate**.

#### Data source

* **CSDMC2010 SPAM corpus** <br>
This data set has 4,327 total messages, of which 2,949 are ham and 1,378 are spam.

## Intro to Test-Driven Development

Let's first define the class that will appropriately parse the e-mails in the `.eml` text files:

In [49]:
import unittest
import io
import re
import email  # used in getting results
from bs4 import BeautifulSoup  # used in getting expected results

Test # 1: `EmailObject`

In [57]:
class TestPlaintextEmailObject(unittest.TestCase):
    """ Assumes there is a class EmailObject imported """
    
    CRLF = "\r\n\r\n"  # carriage return and line feed. Separates headers
    
    def setUp(self):
        """
        Doc
        """
        self.plain_file = './tests/fixtures/plain.eml'
        with open(self.plain_file, 'rb') as plaintext:
            self.text = plaintext.read().decode('utf-8')
            plaintext.seek(0)
            self.plain_email = EmailObject(plaintext)  # expects binary files, not strings
        
    def test_parse_plain_body(self):
        body_expected = self.CRLF.join(self.text.split(self.CRLF)[1:])  # split on CRLF, take all except first one
        body_actual = self.plain_email.body()
        # compare method result with expected result
        self.asserEqual(body_actual, body_expected)
        
    def test_parses_the_subject(self):
        subject_expected = re.search(pattern="Subject: (.*)", string=self.text).group(1)  # Take group 1
        subject_actual = self.plain_email.subject()
        # compare
        self.asserEqual(subject_actual, subject_expected)

In [58]:
x = TestPlaintextEmailObject()

Test # 2: `HTMLEmail`

In [None]:
class TestHTMLEmail(unittest.TestCase):
    """ Assumes existence of an EmailObject """
    
    def setUp(self):
        # Prepare attributes to do the expected-vs-actual comparisons
        with open('./tests/fixtures/html.eml') as html_file:
            self.html = html_file.read().decode('utf-8')
            self.html.seek(0)
            
            # save object output to be tested below
            self.html_email = EmailObject(html_file)  # accepts binary text files
            
    def test_parses_stores_inner_text_html(self):
        body = "\n\n".join(self.html.split("\n\n")[1:])  # take all except first one
        body_expected = BeautifulSoup(body).text
        body_actual = self.html_email.body()  # tests object
        self.assertEqual(body_actual, body_expected)
        
    def test_stores_subject(self):
        subject_expected = re.search(pattern="Subject: (.*)", string=self.html).group(1)
        subject_actual = self.html_email.subject()  # tests object
        self.assertEqual(subject_actual, subject_expected)

Try methods here before putting them in the class

In [60]:
class EmailObject:
    """ Parses incoming email messages """
    
    CRLF = "\r\n\r\n"  # carriage return and line feed. Separates headers

    def __init__(self, infile, category=None):
        """ Initializes an Email with its filepath, label and data (i.e., text) """
        self.infile = infile
        self.category = category
        self.mail = email.message_from_binary_file(self.infile)
    
    def subject(self) -> str:
        """ Extracts the subject of the email """
        return self.mail.get("Subject")
        
    def body(self):
        """ Extracts the body of the email """
        payload = self.mail.get_payload(decode=True)
        content_type = self.mail.get_content_type()
        if content_type == 'text/html':
            return BeautifulSoup(payload).text
        elif content_type == 'text/plain':
            return payload
        else:
            return ''

### Tokenization and Context

<img src="images/Tokenization.png" style="width: 400px;">

In [74]:
def test_parse_plain_body(self=None):
        """
        Doc
        """
        

## [References](./references_ch4.md)