# [Naive Bayes from scratch](https://philippmuens.com/naive-bayes-from-scratch/)

# Bayes Theorem Example
#### Problem
- What's the probability that it'll be sunny throughout the day given that we saw clouds? Given
    + The chance to see clouds is: 60%
    + Only 7 out of 30 days can be considered sunny
    + 50% of the sunny days started out with clouds

#### Solution
- P(Cloud) = 0.6
- P(Sun) = 0.23
- P(Cloud | Sun) = 0.5
- Probability that it'll be sunny throughout the day given that we saw clouds:
    + P(Sun | Cloud) = $\frac{P(Cloud\ |\ Sun)\ *\ P(Sun)}{P(Cloud)} = \frac{0.5*0.23}{0.6} = 19.17 \%$

# Bayes Theorem and spammy words
## 1. Detect if an message given a word is a spam mail?
- Assume that we're working with the word "prize" and want to figure out what the probability is that a given message which contains the word "prize" is spam
- Formulating this problem via Bayes Theorem: $P(Spam\ |\ Prize)\ =\ \frac{P(Prize\ |\ Spam)\ *\ P(Spam)}{P(Prize)}$
    - $P(Spam\ |\ Prize)$: probability of a message being spam given that we see the word "Prize" in it
    - $P(Prize\ |\ Spam)$: the probability of finding the word "Prize" in our already seen spam messages
    - $P(Spam)$: the probability of the message being spam
    - $P(Prize)$: probability of finding the word "Prize" in any of our already seen messages

- Expanded Bayes Theorem: $P(Spam\ |\ Prize)\ =\ \frac{P(Prize\ |\ Spam)\ *\ P(Spam)}{P(Prize\ ∣\ Spam)\ *\ P(Spam)\ +\ P(Prize\ ∣\ Ham)\ *\ P(Ham)}$
- Initially we dont know the the probability of the message being spam or ham: $P(Spam)\ =\ P(Ham)\ =\ 0.5$

$$\begin{equation}
    \begin{split}
        P(Spam\ |\ Prize)\ &=\ \frac{P(Prize\ |\ Spam)\ *\ 0.5}{P(Prize\ ∣\ Spam)\ *\ 0.5\ +\ P(Prize\ ∣\ Ham)\ *\ 0.5} \\
            & = \frac{P(Prize\ |\ Spam)}{P(Prize\ ∣\ Spam)\ +\ P(Prize\ ∣\ Ham)}
    \end{split}
\end{equation}$$

## 2. Scaling it up to whole sentences
- A naive approach = insert the whole message as the conditional probability part
    - A sentence = a succession of single words
    + Apply single-word model to every word we come across in our message
    + Computing the overall probability for the whole message = multiplication of all those individual probabilities

- Example:
    $$\begin{split}
        P(Message\ |\ Spam)\ &=\ P(word_1, word_2, ..., word_n\ |\ Spam) \\
                            &=\ P(word_1\ |\ Spam)\ *\ P(word_2\ |\ Spam)\ *\ ...\ *\ P(word_n\ |\ Spam)
    \end{split}$$

## 3. Difficulties
- Difficulty 1: Vanishing Probility in computation
    + Multiply a lot of small probabilities
    + Computer architectures are only capable to deal with a certain amount of precision  
        $\to$ "arithmetic underflow": eventually turn our number into a 0
    - Trick:
        + $log(a\ *\ b) = log(a)\ +\ log(b)$
        + $exp(log(a\ *\ b)) = a\ *\ b$

- Difficulty 2: Vanishing Probility in unseen words
    + If we've never seen the word W in a spam message, then P(W ∣ Spam) = 0
    - Trick: Introduce factor k
        + $P(W\ |\ S)\ =\ \frac{k\ +\ \text{spam that containing W}}{(2\ *\ k)\ +\ \text{total spam}}$
        + Example: 100 spam examples but found the word W 0 times
            + k = 0: No factor
                + $P(W\ |\ S)\ =\ \frac{0}{100}\ = 0$
            + k = 1
                + $P(W\ |\ S)\ =\ \frac{k\ +\ 0}{(2*k)\ +\ 100}\ = 0.001$

## 4. Implementation
- We need to "train" our Naive Bayes classifier with training data

### 1. Dataset

#### Download

In [1]:
%%bash
if [[ ! -d "data/enron1" ]]; then
    wget -nc \
        -P data \
        http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron1.tar.gz 2> /dev/null
    tar -xzf data/enron1.tar.gz -C data 2> /dev/null
fi

In [2]:
%%bash
ls data/enron1

ham
spam
Summary.txt


In [3]:
%%bash
ls data/enron1/ham | head -5

0001.1999-12-10.farmer.ham.txt
0002.1999-12-13.farmer.ham.txt
0003.1999-12-14.farmer.ham.txt
0004.1999-12-14.farmer.ham.txt
0005.1999-12-14.farmer.ham.txt


In [4]:
%%bash
ls data/enron1/spam | head -5

0006.2003-12-18.GP.spam.txt
0008.2003-12-18.GP.spam.txt
0017.2003-12-18.GP.spam.txt
0018.2003-12-18.GP.spam.txt
0026.2003-12-18.GP.spam.txt


#### Preprocessing data

In [5]:
from glob import glob
from typing import List


spam_message_paths: List[str] = glob('data/enron1/spam/*.txt')
ham_message_paths: List[str] = glob('data/enron1/ham/*.txt')
message_paths: List[str] = spam_message_paths + ham_message_paths

class Message():
    def __init__(self, text: str, is_spam: bool) -> None:
        self.text: str = text
        self.is_spam: bool = is_spam

messages: List[Message] = []
for path in message_paths:
    with open(path, errors='ignore') as file:
        is_spam: bool = True if 'spam' in path else False
        text: str = ''
        for line in file.readlines():
            text += line.replace('Subject:', '').strip()
            text += ' '
        messages.append(
            Message(text, is_spam))

#### Stats

In [6]:
len(messages)

5172

In [7]:
spam_messages = 0
for message in messages:
    if message.is_spam == True: spam_messages += 1
print("Number of spam messages: {}".format(spam_messages))

print('''---- Example spam ----
{}'''.format(messages[0].text))

Number of spam messages: 1500
---- Example spam ----
we sell regalis for an affordable price hi , regalis , also known as superviagra or cialis - half a pill lasts all weekend - has less sideeffects - has higher success rate now you can buy regalis , for over 70 % cheaper than the equivilent brand for sale in us we ship world wide , and no prescription is required ! ! even if you ' re not impotent , regalis will increase size , pleasure and power ! try it today you wont regret ! get it here : http : / / koolrx . com / sup / best regards , jeremy stones no thanks : http : / / koolrx . com / rm . html 


In [8]:
ham_messages = 0
for message in messages:
    if message.is_spam == False: ham_messages += 1
print("Number of ham messages: {}".format(spam_messages))

print('''---- Example ham ----
{}'''.format(messages[1505].text))

Number of ham messages: 1500
---- Example ham ----
re : killing ena to ena deals in sitara jay , if a deal is killed it poses a problem for us in unify if there are any paths associated with the deal ; therefore , we request the deals be zeroed out . call me if this is a problem . also , we would appreciate further details on why these deals are being killed . in addition , i have copied rita and mark from volume management for their input . regards , tammy x 35375 - - - - - original message - - - - - from : pena , matt sent : thursday , december 13 , 2001 3 : 39 pm to : krishnaswamy , jayant ; pinion , richard ; jaquet , tammy cc : severson , russ ; truong , dat ; aybar , luis ; ma , felicia subject : re : killing ena to ena deals in sitara thanks jay ! tammy / richard : you may want to let the schedulers know , although they may already . - - - - - original message - - - - - from : krishnaswamy , jayant sent : thursday , december 13 , 2001 3 : 38 pm to : pinion , richard ; jaquet , t

#### train_test_split

In [9]:
from typing import List, Tuple
from random import shuffle

def train_test_split(messages: List[Message], pct=0.8) -> Tuple[List[Message], List[Message]]:
    shuffle(messages)
    num_train = int(round(len(messages) * pct, 0))
    return (messages[:num_train], messages[num_train:])

In [10]:
train: List[Message]
test: List[Message]

train, test = train_test_split(messages, 0.85)
print(len(train))
print(len(test))

4396
776


### 2. Model
#### Tokenize
- Input: a raw message
- Output: a list of words

In [11]:
import re

def tokenize(text: str) -> set():
    words = []
    for word in re.findall(r'[A-Za-z0-9\']+', text):
        words.append(word.lower())
    return set(words)

In [12]:
assert tokenize(
    'Is this a text? If so, Tokenize this text!...') == \
    {'is', 'this', 'a', 'text', 'if', 'so', 'tokenize'}

tokenize(
    'Is this a text? If so, Tokenize this text!...')

{'a', 'if', 'is', 'so', 'text', 'this', 'tokenize'}

#### Naive Bayes model

In [13]:
from collections import defaultdict
from typing import Dict, List, Set
from math import log, exp

class NaiveBayes:
    def __init__(self, k=1) -> None:
        self._k: int = k
        self._num_spam_messages: int = 0
        self._num_ham_messages: int = 0
        self._num_word_in_spam: Dict[int] = defaultdict(int)
        self._num_word_in_ham: Dict[int] = defaultdict(int)
        self._spam_words: Set[str] = set()
        self._ham_words: Set[str] = set()
        self._words: Set[str] = set()


    def train(self, messages: List[Message]) -> None:
        for msg in messages:
            tokens: Set[str] = tokenize(msg.text)
            self._words.update(tokens)
            if msg.is_spam:
                self._num_spam_messages += 1
                self._spam_words.update(tokens)
                for token in tokens:
                    self._num_word_in_spam[token] += 1
            else:
                self._num_ham_messages += 1
                self._ham_words.update(tokens)
                for token in tokens:
                    self._num_word_in_ham[token] += 1


    def _p_word_spam(self, word: str) -> float:
        return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)


    def _p_word_ham(self, word: str) -> float:
        return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)


    def predict(self, text: str) -> float:
        text_words: Set[str] = tokenize(text)
        log_p_spam: float = 0.0
        log_p_ham: float = 0.0

        for word in self._words:
            p_spam: float = self._p_word_spam(word)
            p_ham: float = self._p_word_ham(word)
            if word in text_words:
                log_p_spam += log(p_spam)
                log_p_ham += log(p_ham)
            else:
                log_p_spam += log(1 - p_spam)
                log_p_ham += log(1 - p_ham)

        p_if_spam: float = exp(log_p_spam)
        p_if_ham: float = exp(log_p_ham)
        
        try:
            return p_if_spam / (p_if_spam + p_if_ham)
        except:
            return 1.0

### 3. Train and Test

#### Train

In [14]:
nb: NaiveBayes = NaiveBayes()
nb.train(train)

#### Test

In [15]:
true_positives: int = 0
true_negatives: int = 0

false_negatives: int = 0
false_positives: int = 0

for cnt, message in enumerate(test, 1):
    if cnt%100 == 0 or cnt == len(test): print("Progress: {}/{}".format(cnt, len(test)))
    
    scr: int = nb.predict(message.text)
    test_result: bool = True if scr > 0.5 else False

    if test_result == True and message.is_spam == True:
        true_positives += 1
    elif test_result == False and message.is_spam == False:
        true_negatives += 1
    elif test_result == True and message.is_spam == False:
        false_negatives += 1
    elif test_result == False and message.is_spam == True:
        false_positives += 1

Progress: 100/776
Progress: 200/776
Progress: 300/776
Progress: 400/776
Progress: 500/776
Progress: 600/776
Progress: 700/776
Progress: 776/776


In [16]:
print('Accuracy: {:2f} %'.format((true_positives + true_negatives)*100.0 / len(test)))

Accuracy: 82.345361 %


In [17]:
print('''Confusion matrix:
    {}\t{}
    {}\t{}'''.format(true_positives, false_positives, false_negatives, true_negatives))

Confusion matrix:
    117	93
    44	522
