# Naive Bayes Spam Filter, based off some work from AIDevNepal's AI Saturday Workshop #8

In [1]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd

%matplotlib inline

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Naive Bayes Classifier
It is one of the simplest machine learning model for text classification. It uses the probabilistic distribution of tokens/words (counts) to classify documents. It is based on infamous **Bayes Theorem,**

$$P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}$$

## Text Classification
Say we have document $D$ that belongs to class $C$ (spam or ham, for example). So, using Bayes' Theorem we can infer:  

$$P(C \mid D) = \frac{P(D \mid C) P(C)}{P(D)}$$

We know a document is made up of tokens (combination of tokens, commonly referred to as **ngram language model**):  
$$D = [d_1, d_2, d_3, ...].$$
So,
$$
P(C | D)
= \frac{P(d_1 | C) P(d_2 | C) P(d_3 | C) ... P(C)}{P(D)}
$$

Remember, we have seggregated $P(D | C)$ to individual probabilities of individual tokens (ngrams) making up the document $D$. This is why Naive Bayes classifier is **Naive** - it assumes  each tokens are independent of each other.  
  
Think it of as two independent events $A$ and $B$. What's the probability of both events occuring simultaneously?  
$$P(A \text{ and } B) = P(A) P(B)$$

Now we can infer the Probabities $P(d_i | C)$ as :  
` (count(di) that belongs to class C) / (total number of tokens).` This is saying that $P(d_i \mid C)$ is a multinomial distribution.

### Putting Things Into Perspective
And that is how we can find the probability of document $D$ belonging to class $C$ assuming independence of individual features(ngrams). 

Now, say we have classes
$C_1, C_2, C_3, ...,$
And we want to classify a test document $D$. All we have to do is find the probabilty of this document $D$
beloning to each of the classes. And we choose the class where $P(D \mid C)$ is the highest.

#### Training Steps
It's nothing but counting the "stuff" that matter.
- tokenize the documents for each classes (words, or pairs of words, etc)
- each token can be unigram, [bigram](https://en.wikipedia.org/wiki/Bigram)
- extract features for each token -> counts

#### Let's classify
- extract features (count) for the document to be classified
- calculate $P(C_1 | D)$
- Calculate $P(C_2 | D)$
- Calculate $P(C_3 | D)$
- choose the Class $C_i$ that has max probability

**Side note**:  
Since $P(D)$ is constant, we can ignore the denominator part and just focus on the numerator's products.  

So, all we are doing is choosing class $C_i$ according to $\argmax{ P(C_i | D) }

In [3]:
# noob documents for training :P
spam = [
    "you have won a lottery",
    "congratulations! you have a bonus",
    "this is bomb",
    "to use the credit, please click the link",
    "thank you for subscription. please click the link",
    "bomb"
]
Y_spam = [1 for i in range(len(spam)) ]

non_spam = [
    "i am awesome",
    "i have a meeting tomorrow",
    "you are smart",
    "get me out of here",
    "call me later"
]
Y_non_spam = [0 for i in range(len(non_spam)) ]

In [5]:
print(non_spam)

['i am awesome', 'i have a meeting tomorrow', 'you are smart', 'get me out of here', 'call me later']


In [6]:
print(Y_non_spam)

[0, 0, 0, 0, 0]


In [7]:
spam+non_spam

['you have won a lottery',
 'congratulations! you have a bonus',
 'this is bomb',
 'to use the credit, please click the link',
 'thank you for subscription. please click the link',
 'bomb',
 'i am awesome',
 'i have a meeting tomorrow',
 'you are smart',
 'get me out of here',
 'call me later']

In [4]:
# feature extraction
count_vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(spam + non_spam)
X_train_vectorized = count_vectorizer.transform(spam + non_spam)

In [8]:
count_vectorizer.get_feature_names_out()

array(['am', 'am awesome', 'are', 'are smart', 'awesome', 'bomb', 'bonus',
       'call', 'call me', 'click', 'click the', 'congratulations',
       'congratulations you', 'credit', 'credit please', 'for',
       'for subscription', 'get', 'get me', 'have', 'have bonus',
       'have meeting', 'have won', 'here', 'is', 'is bomb', 'later',
       'link', 'lottery', 'me', 'me later', 'me out', 'meeting',
       'meeting tomorrow', 'of', 'of here', 'out', 'out of', 'please',
       'please click', 'smart', 'subscription', 'subscription please',
       'thank', 'thank you', 'the', 'the credit', 'the link', 'this',
       'this is', 'to', 'to use', 'tomorrow', 'use', 'use the', 'won',
       'won lottery', 'you', 'you are', 'you for', 'you have'],
      dtype=object)

In [8]:
print(X_train_vectorized.toarray())

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1]
 [0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 1 1 0 0 0 0 0 2 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

In [5]:
# Naive Bayes Model
model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_spam + Y_non_spam)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [6]:
documents = [
    "call you",
    "you have won"
]
predictions = model.predict(count_vectorizer.transform(documents))
print(predictions)

[0 1]


In [9]:
# convert to pandas dataframe for seamless training
spam_df = pd.DataFrame(spam, columns=['text'])
spam_df['target'] = 1
non_spam_df = pd.DataFrame(non_spam, columns=['text'])
non_spam_df['target'] = 0

# final data
data = pd.concat([spam_df, non_spam_df], ignore_index=True)
data

# feature extraction
count_vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(data['text'])
X_train_vectorized = count_vectorizer.transform(data['text'])
model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_spam + Y_non_spam)
documents = [
    "call you",
    "you have won"
]
predictions = model.predict(count_vectorizer.transform(documents))
print(predictions)

[0 1]


### Let's do real training

In [22]:
# load training set
data = pd.read_csv('data/spam.csv')
data.head()
# the csv has two columns, text and target, where target is spam or ham. Convert the target to 1 or 0,
# your code here

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [None]:
data['target'] = np.where(data['target']=='spam',1, 0)
print(len(data))
data.head(10)

In [None]:
#use scikit-learn's train_test_split to split the data into training and test sets
X_train, X_test, Y_train, Y_test = # your code here. Use test_size=0.2 and random_state=42

In [None]:
# fit and transform a CountVectorizer to the training data as we did in the example above.
count_vectorizer = #your code here.
X_train_vectorized = #your code here.
X_train_vectorized.toarray().shape # should be (4457, 42915)

In [None]:
# create a Multinomial Naive Bayes model as we did above. Use alpha=0.1, but also try other values
model = #your code here
#fit the model
#your code here.

In the space below, [read about different ways to score classifiers](https://scikit-learn.org/stable/api/sklearn.metrics.html). Then, score your model using at least 4 different metrics, one of which should be a confusion matrix. At the top, I've loaded in some scores, but there are others you can use. Read about each score, and interpret your score in the context of the model.

In [None]:
# Calculate predictions as we did above. Then, compare the predictions to the actual values in Y_test using at least 4 different metrics, one of which should be a confusion matrix.
predictions = #your code here

Now, have some fun and imagine your own legit and fraudulent emails, and see how the classifier you built does.

In [45]:
test_docs = [
    "you have won a lottery",
    "click the link",
    "Hi Rashid, how about lunch at the dining hall tomorrow to talk this over?",
]
predictions = model.predict(vectorizer.transform(test_docs))
predictions

array([1, 1, 0])

Now, write some concluding thoughts to your future self. What have you learned in doing this activity? Why does Naive Bayes' with multinomial distributions work well in this setting? Why aren't we using a more complicated algorithm? What are the pros and cons of the approach we took?