# Niave Bayes Theorem

## Background
Bayes theorem: calculates the probability of an event based on the probabilities of certain related events. It is important to consider the independence of the features. 

Bayes theorem converts the results from a test into the real probability of the event. 

TODO: add more.

## Data 
The data we will be using is from the UCI ML repository. You can access it [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

In [2]:
import pandas as pd

In [4]:
# using '!' allows us to run bash commands
print('List all the files in the current directory:\n')
!ls
print('List of all the files inside the data directory:')
!ls data

List all the files in the current directory:

[34mdata[m[m                         naive-bayes-classifier.ipynb
List of all the files inside the data directory:
SMSSpamCollection


In [5]:
# Read in the data using read_table pandas function
df = pd.read_table('data/SMSSPamCollection',
                    sep='\t',
                    header=None,
                    names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Data wrangling
Lets convert our label column to numerical values. 
where:
- ham: 0
- spam: 1

In [7]:
df['label'] = df.label.map({
    'ham': 0,
    'spam': 1
})

print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Bag of Words
As most ML algos (even within scikit learn) relly on numerical data, we will need to convert our text data to numerical data.

Bag of Words (BoW) concept is a term used to describe the problems that have a bag of words, or a collection of text data that needs to be worked with. The idea is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

We will convert it into a matrix, so that each document/sms is a row, and each word (token) being the column. Therefore, each sms in a row will indicate which words they have by indicating the column words it has. 

To do this, we will be using sklearns count-vectorizer method. Which does:
- it tokenizes the string (by separating the string (sms in our case) into separate words). It then gives an integer ID to each token
- it counts the occurance of each of these tokens

we will create our own BoW, then we will implement sklearns version.



## Bag Of Words: our implementation

In [9]:
# We begin by setting all words to lower case
sms = ['Hi! how are you.',
        'Win money, win money!',
        'call me bro',
        'Can you call me tomorrow?']

lower_case_sms = []

for s in sms:
    lower_case_sms.append(s.lower())
print(lower_case_sms)

['hi! how are you.', 'win money, win money!', 'call me bro', 'can you call me tomorrow?']


In [10]:
# remove all punctuation
import string
punctuation_removed = []

for s in lower_case_sms:
    # translate() method returns a string where each character is mapped to its corresponding character in the translation table
    # maketrans method returns a translation table with a 1-to-1 mapping of a Unicode ordinal to its translation/replacement
    punctuation_removed.append(s.translate(str.maketrans('', '', string.punctuation)))
print(punctuation_removed)

['hi how are you', 'win money win money', 'call me bro', 'can you call me tomorrow']


In [11]:
# Tokenize: split up sentence into individual words using a delimiter
split_words = []
for word in punctuation_removed:
    split_words.append(word.split(' '))
print(split_words)

[['hi', 'how', 'are', 'you'], ['win', 'money', 'win', 'money'], ['call', 'me', 'bro'], ['can', 'you', 'call', 'me', 'tomorrow']]


In [12]:
# Count occurance/frequency of each word

# Counter method from the collections class will be used
# counter counts the occurance of each item in the list and returns a dictionary
import pprint
from collections import Counter

frequency_list = []

for i in split_words:
    frequence_counts = Counter(i)
    frequency_list.append(frequence_counts)

pprint.pprint(frequency_list)

[Counter({'hi': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 2}),
 Counter({'call': 1, 'me': 1, 'bro': 1}),
 Counter({'can': 1, 'you': 1, 'call': 1, 'me': 1, 'tomorrow': 1})]


## Scikit-learn
Now, lets use scikit learns implementation instead

In [13]:
sample_sms = ['Hiya, how are you?',
                'Win money, win from home',
                'call me now man',
                'hi, call hello you tomorrow?']

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [16]:
count_vector.fit(sample_sms)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'hi',
 'hiya',
 'home',
 'how',
 'man',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [19]:
# Create a matrix with the rows being each of the 4 documents, and the columns being each word
sms_matrix = count_vector.transform(sample_sms).toarray()
sms_matrix

array([[1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]])

Now the data has been cleaned into a format we can deal with. It has returned a matrix where the rows indicate a sms, and the columns indicate whether that word is present in that sms.

Lets clean up the matrix and turn it into a dataframe with the right column names.

In [21]:
sms_df = pd.DataFrame(sms_matrix, columns = count_vector.get_feature_names())
sms_df

Unnamed: 0,are,call,from,hello,hi,hiya,home,how,man,me,money,now,tomorrow,win,you
0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1
1,0,0,1,0,0,0,1,0,0,0,1,0,0,2,0
2,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0
3,0,1,0,1,1,0,0,0,0,0,0,0,1,0,1


# Train and Testing sets

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state=1)

In [23]:
print(f'Number of rows in the df: {df.shape[0]}')
print(f'Number of rows in the training set: {X_train.shape[0]}')
print(f'Number of rows in the test set: {X_test.shape[0]}')

Number of rows in the df: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


Apply BoW to our dataset.

What we will need to do:
- Training: Learn a vocabulary dictionary (similar to the df/matrix we created earlier) for the training dataset, then transform the data into a document-term matrix
- Testing: transform the data into a document-term matrix using the learned vocabulary from the training set

In [24]:
count_vector = CountVectorizer()
# Fit on training data
training_data = count_vector.fit_transform(X_train)
# fit the test data
testing_data = count_vector.transform(X_test)

<hr>

# Example
## Bayes Theorem Implementation from scratch
We will now build the algorithm that we need to make our predictions to classify whether a message is a spam or not. 

But what is Bayes Theorem?
It calculates the probability of an event occuring, based on certain other probabilities that are related to the event in question. This includes a `prior`, the probabilities that we know previously or is given to us, and the `posterior`, the probabilities that we are looking to compute using the priors.


>$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

Lets define the terms above:
- P(A|B): This is the `posterior probability` that event A happens given event B.
- P(A): This is the probability of event A happening independentaly.
- P(B): This is the probability of event B happening independentaly. 
- P(B|A): This is the likelihood probability of event B happening, given event A. 

### Example: What is the odds of an individual having diabetes, given that he/she tested a positive result
Lets assume the following:
- P(Diabetic): Is the probability of a person having diabetes. This has been given to us as 0.01 or 1%. This means, that the probability of a person in the population having diabetes is 1%
- P(Positive): Is the probability of getting a positive result on the test
- P(Negative): Is the probability of getting a negative result on the test
- P(Positive|Diabetic): Is the probability of getting a positive test result given that the individual in question has diabetes. This is 0.9 or 90%. Which means, the probability of someone testing positive given that they actually have diabetes is 90%. This is also called **Sensitivity** or **TPR: True Positive Rate**.
- P(Negative|~Diabetic): Is the probability of getting a negative test result given that the individual in question does not have diabetes. This is 0.9 or 90%. Which means the probability of someone testing negative given that they actually do not have diabetes is 90%. This is called **specificty** or **TNR: True Negative Rate**.

This example would like us to find out: P(Diabetic|positive). 

Hence, using Bayes theorem we get:

$$ P(Diabetic|Positive) = \frac{P(Positive|Diabetic)P(Diabetic)}{P(Positive)} $$


we can calculate P(positive) by using both sensitivity and specificity:

P(Positive) = [P(Diabetic) * Sensitivity] + [P(~Diabetic) * (1 - Specificity)]


In [27]:
# the information below was given to us

# P(Diabetic)
p_diabetic = 0.01 # given to us above
# P(~Diabetic) = 1 - P(Diabetic)
p_not_diabetic = 1 - p_diabetic
# Sensitivity AKA P(Positive|Diabetic)
p_positive_given_diabetic = 0.9 # given to us above
# Specificity or P(Negative|~Diabetic)
p_negative_given_not_diabetic = 0.9 # given to us above

In [30]:
# Lets work out the P(Positive)
# This is the probability that you tested positive: combine TP and FN

# TP:P(Diabetic) AND P(Positive|Diabetic)
true_positive = p_diabetic * p_positive_given_diabetic
# FN: P(Diabetic) AND P(Positive|Not Diabetic)
p_positive_given_not_diabetic = 1 - p_positive_given_diabetic
false_negative = p_not_diabetic * p_positive_given_not_diabetic

# P(Positive) = TP + FN
p_positive = true_positive + false_negative
print(f'The probability of getting a positive test result is: P(Positive) = {round(p_positive,3)}')

The probability of getting a positive test result is: P(Positive) = 0.108


We can now calculate our posteriors for when we have a positive test result. 
The probability of someone being diabetic, given that they have a positive test result can be determined by *using Bayes Theorem*:

When we want to find out whether someone is diabetic given that they tested positive:

$$ P(Diabetic|Positive) = \frac{P(Positive|Diabetic)P(Diabetic)}{P(Positive)} $$

And when we want to find out if someone is not diabetic given that they tested positive:

$$ P(~Diabetic|Positive) = \frac{P(Positive|~Diabetic)P(~Diabetic)}{P(Positive)} $$

The sum of the above posteriors add up to 1 (as the sum of posteriors always add up to 1).

our goal was to determine **what is the probability of an individual having diabetes, given that they tested positive**.

In [34]:
# using bayes rule
p_diabetic_given_positive = ( p_positive_given_diabetic * p_diabetic ) / p_positive
print(f'The probability that an individual is diabetic given that they tested positive is: P(Diabetic|Positive) = {round(p_diabetic_given_positive,3)}')

The probability that an individual is diabetic given that they tested positive is: P(Diabetic|Positive) = 0.083


Lets now work out the probability of an individual not having diabetes, given that the individual has a positive test result:
P(~Diabetic|Positive).

This can be calculated in two ways:
- Bayes method
- 1 - P(Diabetic|Positive)

In [41]:
# Using bayes rule
p_not_diabetic_given_positive = ( p_positive_given_not_diabetic * p_not_diabetic ) / p_positive
print(f'The probability that an individual is not diabetic given that they tested positive is: P(~Diabetic|Positive) = {round(p_not_diabetic_given_positive,3)}')
print(f'This is the same as working it out by doing: P(~Diabetic|Positive) = 1 - P(Diabetic|Positive = {1 - round(p_diabetic_given_positive,3)}')

The probability that an individual is not diabetic given that they tested positive is: P(~Diabetic|Positive) = 0.917
This is the same as working it out by doing: P(~Diabetic|Positive) = 1 - P(Diabetic|Positive = 0.917


The above now shows that even if you get a positive test result, there is only 8% chance that you actually have diabetes, and 91% chance you do not. BUT dont forget, this is under the assumption that 1% of the population as a whole has diabetes.

<hr>

### What does Naive Bayes mean?

The term 'Naive' in naive bayes comes from the assumption that the features that are used in the algorithm to make predictions are independent of each other, which is not always the case.

## Naive Bayes Implementation Example
Lets implement Naive Bayes from scratch.

Lets say we have two political party candidates:
- Trump of the Replublican party
- Biden of the Democratic party

Lets say we are looking at the probabilities of each candidate saying one of the following words:
- environment
- immigration
- reform

Probabilities that the candidate says one of the following words:
- Trump:
    - P(Freedom|Trump): P(F|T) = 0.1
    - P(Immigration|Trump): P(I|T) = 0.1
    - P(Environment|Trump): P(E|T) = 0.8
- Biden:
    - P(Freedom|Biden): P(F|B) = 0.7
    - P(Immigration|Biden): P(I|B) = 0.2
    - P(Environment|Biden): P(E|B) = 0.1

Lets also assume that the probability that Trump or Biden is giving a speech is: 
- P(T) = 0.5
- P(B) = 0.5

Now, given the above, what if we have to find the `probabilities of a candidate saying the words 'Freedom' or 'immigration'?`

Now, we will use the Naive Bayes Theorem:

>$$ P(y|x_{1}, ... , x_{n}) = \frac{ P(y)P(x_{1},...,x_{n}|y) }{ P(x_{1},...,x_{n}) } $$

- y: predictor: in our case this is the name of our candidate
- $x_{1}%,...,$x_{n}$: are the feature vectors, the individual words

The Naive Bayes Theorem makes the assumption that each of the feature vectors/words are independent of each other.

The goal of this question was the work out the probability that the candidate said one of the words from Freedom or immigration. To do this, we need to calculate the posterior probabilities:
- P(T|F,I)
    - probability that the speech was given by trump given that the words Freedom or immigration are mentioned
- P(B|F,I)
    - probability that the speech was given by Biden, given that the words Freedom or immigration are mentioned

We can use naive bayes theorem to calulcate it:

1. $$ P(T|F,I) = \frac{ P(T)P(F,I|T) }{ P(F,I) } $$

2. $$ P(B|F,I) = \frac{ P(B)P(F,I|B) }{ P(F,I) } $$

In [49]:
# Trump
# P(T) = 0.5 
p_t = 0.5
# P(Freedom|Trump): P(F|T) = 0.1
p_f_given_t = 0.1
# P(Immigration|Trump): P(I|T) = 0.6
p_i_given_t = 0.1


# Probability that Trump says either environment or immigration - P(Trump)*P(Freedom)*P(Immigration)
p_fi_given_t =  p_t * p_f_given_t * p_i_given_t
print(p_fi_given_t)


0.005000000000000001


In [51]:
# Biden
# P(B) = 0.5 
p_b = 0.5
# P(Freedom|Biden): P(F|B) = 0.7
p_f_given_b = 0.7
# P(Immigration|Biden): P(I|B) = 0.2
p_i_given_b = 0.2


# Probability that Biden says either environment or immigration - P(Biden)*P(Freedom)*P(Immigration)
p_fi_given_b =  p_b * p_f_given_b * p_i_given_b
print(p_fi_given_b)


NameError: name 'p_fi_given' is not defined

In [47]:
# calculate probability of either freedom or immigration being said: P(F,I)
p_fi = p_t_given_e_i + p_b_given_e_i
print(p_fi)

0.075


In [48]:
# P(T|F,I) = P(F,I|T)P(T) / P(F,I)
p_t_given_f_i = ( p_fi_given_t * p_t ) / p_fi
print(f'P(T|F,I) = {p_t_given_f_i}')
# P(B|F,I) = P(F,I|B)P(B) / P(F,I)
p_b_given_f_i = ( p_fi_given_b * p_b ) / p_fi
print(f'P(B|F,I) = {p_b_given_f_i}')

NameError: name 'p_fi_given_t' is not defined