# Processing Text Part I

Text is one of the most widespread forms of sequence data. It can be understood as either a sequence of characters or a sequence of words, but it is most common to work at the level of words. Some applications in which we will have to work with text sequences are document classification, sentiment analysis, author identification, and even question-answering (QA) (in a constrained context).

As expected, a model is not going to work if the input is raw text, so we have to convert text into something that computers can handle, that is, numbers. There are several options for this, such as **Bag of Words** and **Term Frequency-Inverse Document Frequency (TF-IDF)**. 

In this notebook we will talk about these two approaches and see how they can be used for detecting spam emails. However, before using them, we need to process the text data. As usual, we will begin importing some modules and libraries. 

In [2]:
import pandas as pd
import numpy as np
import re
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split

Now we need to load the data we will work with. This data can be downloaded from `https://www.kaggle.com/uciml/sms-spam-collection-dataset`.

In [3]:
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Cleaning the data

Before going any further, it is clear that our data needs some cleaning. For instance, the **unnamed columns** can be removed. Speaking of columns, some "renaming" would be desirable for the sake of clarity. Also, we would like use a "binary variable" for categorizing the emails: 0 for **not spam** and 1 for **spam**. 

In [4]:
data_clean = data
data_clean['spam'] = data_clean['v1'].map({'ham': 0, 'spam': 1})
data_clean = data_clean.drop(columns=['v1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'])
data_clean = data_clean.rename(columns={'v2': 'email'})
data_clean.head()

Unnamed: 0,email,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


It looks nicer, doesn't it? But this is just the beggining. At this point we need to process the emails and turn them into something that our model will "digest" much more easily. In order to do this we need some **Natural Language Processing** (NLP): "NLP is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data," according to Wikipedia.

## Text Processing

Text preprocessing is crucial before building a proper NLP model. Here are the important steps we are going to carry out:

1. Converting words to lower case.
2. Removing special characters.
3. Removing stopwords.
4. Stemming and lemmatization.

More on steps three and four later. For now let us proceed with step number one.

### Lower case and special characters

In [5]:
data_clean['email'] = data_clean['email'].apply(lambda x : x.lower())
data_clean

Unnamed: 0,email,spam
0,"go until jurong point, crazy.. available only ...",0
1,ok lar... joking wif u oni...,0
2,free entry in 2 a wkly comp to win fa cup fina...,1
3,u dun say so early hor... u c already then say...,0
4,"nah i don't think he goes to usf, he lives aro...",0
...,...,...
5567,this is the 2nd time we have tried 2 contact u...,1
5568,will ì_ b going to esplanade fr home?,0
5569,"pity, * was in mood for that. so...any other s...",0
5570,the guy did some bitching but i acted like i'd...,0


Let us do step number two:

In [6]:
data_clean['email'] = data_clean['email'].apply(lambda x : re.sub('[^a-z0-9 ]+', ' ', x))
data_clean

Unnamed: 0,email,spam
0,go until jurong point crazy available only i...,0
1,ok lar joking wif u oni,0
2,free entry in 2 a wkly comp to win fa cup fina...,1
3,u dun say so early hor u c already then say,0
4,nah i don t think he goes to usf he lives aro...,0
...,...,...
5567,this is the 2nd time we have tried 2 contact u...,1
5568,will b going to esplanade fr home,0
5569,pity was in mood for that so any other sug...,0
5570,the guy did some bitching but i acted like i d...,0


Notice that we have assumed that it is "safe" to turn the characters of the emails into lower case letters and that special characters do not posses relevant information. This may be okay for this type of application, but for, say, sentiment analysis, we might need to reconsider this since special characters like exclamation points are used to convey certain emotions. 

### Stop words

At this point you migh be wondering "what are stop words?" Well, these are words that are encountered very frequently in a given language but do not carry useful information, thus it is a good practice to remove them. Before doing this, let us take a look into the stop words of the English language:

In [13]:
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Now onto removing stop words.

In [18]:
def remove_stop_words(message):
    
    words = word_tokenize(message)
    words = [word for word in words if word not in stop_words]
    
    return words

In [19]:
data_clean['email'] = data_clean['email'].apply(remove_stop_words)
data_clean

Unnamed: 0,email,spam
0,"[go, jurong, point, crazy, available, bugis, n...",0
1,"[ok, lar, joking, wif, u, oni]",0
2,"[free, entry, 2, wkly, comp, win, fa, cup, fin...",1
3,"[u, dun, say, early, hor, u, c, already, say]",0
4,"[nah, think, goes, usf, lives, around, though]",0
...,...,...
5567,"[2nd, time, tried, 2, contact, u, u, 750, poun...",1
5568,"[b, going, esplanade, fr, home]",0
5569,"[pity, mood, suggestions]",0
5570,"[guy, bitching, acted, like, interested, buyin...",0


Notice that apart from removing stop words we did something else, that "something else" is called **tokenization**: Tokenization is defined as splitting a text into small units known as **tokens**. We might think that this is as simple as taking a text and each time we find a space between words we split there, but the process is more involved than that. The method `word_tokenize` is clever enough to do thing such as this:

In [14]:
word_tokenize("There's something I'd like to know, dude.")

['There', "'s", 'something', 'I', "'d", 'like', 'to', 'know', ',', 'dude', '.']

In [17]:
"There's something I'd like to know, dude.".split()

["There's", 'something', "I'd", 'like', 'to', 'know,', 'dude.']

### Stemming and lemmatization

It is natural that in any language we will use variations of the same word, e.g., "run", "ran", and "running". These variations are called **inflections**. Even more, there are words that have similar meanings such as "democracy", "democratic", and "democratization". The goal of both stemming and lemmatization is to turn either inflections or derivationally related forms of a word into a common base form. For instance:

*Lemmatization:* am, are, is $\Rightarrow$ be.

*Stemming:* car, cars $\Rightarrow$ car.

Stemming is considered a crude heuristic process that chops off parts of a word by taking into account common prefixes and suffixes. On the other hand, lemmatization takes into consideration the grammar of the word and attemps to find the root word. 

In [20]:
#nltk.download('wordnet')

Porter = PorterStemmer()
Lemma = WordNetLemmatizer()

print(Porter.stem("car"))
print(Porter.stem("cars"))

print(Lemma.lemmatize("am", wordnet.VERB))
print(Lemma.lemmatize("are", wordnet.VERB))
print(Lemma.lemmatize("is", wordnet.VERB))

car
car
be
be
be


In the meantime, for this application, we will stick to *stemming*.

In [21]:
data_clean['email'] = data_clean['email'].apply(lambda x : [Porter.stem(word) for word in x])
data_clean

Unnamed: 0,email,spam
0,"[go, jurong, point, crazi, avail, bugi, n, gre...",0
1,"[ok, lar, joke, wif, u, oni]",0
2,"[free, entri, 2, wkli, comp, win, fa, cup, fin...",1
3,"[u, dun, say, earli, hor, u, c, alreadi, say]",0
4,"[nah, think, goe, usf, live, around, though]",0
...,...,...
5567,"[2nd, time, tri, 2, contact, u, u, 750, pound,...",1
5568,"[b, go, esplanad, fr, home]",0
5569,"[piti, mood, suggest]",0
5570,"[guy, bitch, act, like, interest, buy, someth,...",0


## Training and testing sets

When we are developing a model we do not use all of our data for training, what we do is that we divide the data we posses into two sets: the training set and the testing set. A general rule of thumb is to use 80% of the data for training and 20% for testing our model. There are variations of this depending on the circumstances, but, in general, this is a good starting point. By the way, all the examples of our training data should be picked randomly to avoid any bias; it is not a good practice to pick these examples in a deterministic fashion.

In [22]:
X = data_clean
y = X['spam']
X = X.drop(columns='spam')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_set = X_train
train_set['spam'] = y_train 
test_set = X_test
test_set['spam'] = y_test

print(train_set.shape)
print(test_set.shape)

(4457, 2)
(1115, 2)


## Naive Bayes Classifier

Let us talk about emails now. Let $D=\{d_1,d_2,\dots,d_k\}$ be a set of documents, and let $W=\{w_1,w_2,\dots,w_m\}$ be the set of distinct words contained in $D$. Let an email $d\in D$ be a set of words that belong to $W$: $d=\{w_1,w_2,\dots,w_n\}$. If we want to know what is the probability that said email $d$ is spam we can use Bayes' Theorem:

$$
\begin{align}
P(spam|d)&=\frac{P(d|spam)P(spam)}{P(d)}\\
\\
&=\frac{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n)}\\
\\
&=\frac{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)+P(w_1\cap w_2\cap\cdots\cap w_n|not~spam)P(not~spam)}.
\end{align}
$$

At this point it is a good idea to focus our attention on the numerator of the last expression. Notice that we have $P(w_1\cap w_2\cap\cdots\cap w_n|spam)P(spam)$, which is equivalent to the joint probability distribution of $P(w_1\cap w_2\cap\cdots\cap w_n\cap spam)$. By the multiplication rule, this expression can be rewritten as follows:

$$
\begin{align}
P(w_1\cap w_2\cap\cdots\cap w_n\cap spam) = P(spam)P(w_1|spam)P(w_2|w_1\cap spam)\cdots P(w_n|\cap_{i=1}^{n-1}w_i\cap spam).
\end{align}
$$

And here it comes the "naive assumption": given the spam category, we assume that all features of the model, in this case the words of the email, are **mutually and conditionally independent** on the spam category:

$$
\begin{align}
P(w_i|w_{i+1}\cap\cdots\cap w_n\cap spam) = P(w_i|spam).
\end{align}
$$

What this expression is telling us is that the probability of having word $w_i$ in a spam message is not affected by the presence of the set of words $\{w_{i+1},\dots,w_n\}$ in said message, what we just need to consider is that such email is spam. Consider the sentence "we need your info" and assume that we know we are dealing with an email that is spam. Then, if the naive assumption is true, this could happen:

$$
\begin{align}
P(\text{need}|\text{we}\cap\text{your}\cap\text{info}\cap spam) = P(\text{need}|spam).
\end{align}
$$

However, this is not usually true, what we have, in general, is this:

$$
\begin{align}
P(\text{need}|\text{we}\cap\text{your}\cap\text{info}\cap spam) \neq P(\text{need}|spam).
\end{align}
$$

For this reason we say that this assumption is naive. Nevertheless, in practice, this classifier works very well in many situations.

Let us go back to the numerator. Taking into account our naive premise, the joint probability distribution can be expressed as

$$
\begin{align}
P(w_1\cap w_2\cap\cdots\cap w_n\cap spam) = P(spam)P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam).
\end{align}
$$

Therefore, the probability that a given message $d=\{w_1,w_2,\dots,w_n\}$ is spam can be computed with this expression:

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) = \frac{P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam)P(spam)}{P(w_1\cap w_2\cap\cdots\cap w_n)}.
\end{align}
$$

You migh be asking, well, how can we classify an email as spam with all this? There are two options: the **Probabilistic Model** and the **Maximum A Posteriori Model (MAP)**.

#### Probabilistic Model 

Given a threshold $p$, we classify an email as spam if this condition holds:

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) > p.
\end{align}
$$

#### Maximum A Posteriori Model (MAP)

An email is categorized as spam if 

$$
\begin{align}
P(spam|w_1\cap w_2\cap\cdots\cap w_n) > P(not~spam|w_1\cap w_2\cap\cdots\cap w_n),
\end{align}
$$

which is equivalent to

$$
\begin{align}
P(w_1|spam)P(w_2|spam)\cdots P(w_n|spam)P(spam) > P(w_1|not~spam)P(w_2|not~spam)\cdots P(w_n|not~spam)P(not~spam).
\end{align}
$$

Notice that it is not necessary to calculate $P(w_1\cap w_2\cap\cdots\cap w_n)$. For classifying emails we will employ this method.


## Training the Model

Let $W_{\text{t}}$ be the set that contains all the words that belong to the training set. As expected, $W_{\text{t}}=W_{\text{t-~s}}~\cup W_{\text{t-s}}$ and $W_{\text{t-~s}}~\cap W_{\text{t-s}}=\emptyset$, where $W_{\text{t-~s}}~$ and $W_{\text{t-s}}~$ are the subsets that contain the words of non-spam and spam emails, respectively. In the training phase we need to compute the following probabilities for the training set:

$$
\begin{align}
P(w_i|spam), & ~\forall w_i\in W_{\text{t-s}}\\
\\
P(w_i|not~spam), & ~\forall w_i\in W_{\text{t-~s}},
\end{align}
$$

### Bag of Words

Notice that the previous probabilites can be computed as follows:

$$
\begin{align}
P(w_i|spam)=\frac{\text{number of ocurrences of $w_i$ in spam emails}}{\text{total number of words of spam emails}}.
\end{align}
$$

Similarly, 

$$
\begin{align}
P(w_i|not~spam)=\frac{\text{number of ocurrences of $w_i$ in non-spam emails}}{\text{total number of words of non-spam emails}}.
\end{align}
$$

Also, we need to calculate $P(spam)$ and $P(not~spam)$:

$$
\begin{align}
P(spam)&=\frac{|W_{\text{t-s}}~|}{|W_{\text{t}}|}\\
\\
P(not~spam)&=\frac{|W_{\text{t-~s}}~~|}{|W_{\text{t}}|}.
\end{align}
$$

This way of computing the probabilities is based on the **Bag of Words** model, in which we are interested in the frequencies of each of the words of a corpus without taking into consideration neither grammar  nor order.

In [23]:
p_spam = train_set[train_set['spam'] == 1].shape[0] / train_set.shape[0]
p_spam

0.13394660085259144

In [24]:
p_not_spam = train_set[train_set['spam'] == 0].shape[0] / train_set.shape[0]
p_not_spam

0.8660533991474085

In [25]:
def bag_of_words(corpus):
    
    """
    This function receives a corpus that contains a set of messages and returns 
    a dictionary in which each item is a unique word and its corresponding number 
    of ocurrences in the corpus. 
    """
    bag_of_words = {}
    
    for email in corpus:
        for word in email:
            if word not in bag_of_words:
                bag_of_words[word] = 1
            else:
                bag_of_words[word] += 1
    
    return bag_of_words    

In [26]:
def probability_words(df):
    
    baggie_of_words = bag_of_words(df['email'])
    
    number_of_words = df['email'].apply(len).sum()
    
    probability_words = {}
    
    for item in baggie_of_words.items():
        probability_words[item[0]] = item[1] / number_of_words
        
    return probability_words   

In [27]:
probability_spam_words = probability_words(train_set[train_set['spam'] == 1])
probability_spam_words

{'bank': 0.00018293240647580719,
 'granit': 0.00018293240647580719,
 'issu': 0.00018293240647580719,
 'strong': 0.0002743986097137108,
 'buy': 0.00036586481295161437,
 'explos': 0.00018293240647580719,
 'pick': 0.00036586481295161437,
 'member': 0.0008231958291411324,
 '300': 0.00036586481295161437,
 'nasdaq': 0.00018293240647580719,
 'symbol': 0.00018293240647580719,
 'cdgt': 0.00018293240647580719,
 '5': 0.0011890606420927466,
 '00': 0.0007317296259032287,
 'per': 0.003658648129516144,
 'privat': 0.000914662032379036,
 '2003': 0.0007317296259032287,
 'account': 0.0010061282356169396,
 'statement': 0.000914662032379036,
 '07973788240': 9.146620323790359e-05,
 'show': 0.0023781212841854932,
 '800': 0.0014634592518064575,
 'un': 0.0007317296259032287,
 'redeem': 0.0007317296259032287,
 'point': 0.000914662032379036,
 'call': 0.027073996158419465,
 '08715203649': 9.146620323790359e-05,
 'identifi': 0.000914662032379036,
 'code': 0.002012256471233879,
 '40533': 0.00018293240647580719,
 'e

In [28]:
probability_non_spam_words = probability_words(train_set[train_set['spam'] == 0])
probability_non_spam_words

{'boat': 9.544109693634079e-05,
 'still': 0.003944898673368753,
 'mom': 0.0005408328826392645,
 'check': 0.001336175357108771,
 'yo': 0.0010180383673209683,
 'half': 0.0005726465816180447,
 'nake': 9.544109693634079e-05,
 'r': 0.003404065790729488,
 'give': 0.0025769096172812015,
 'second': 0.0005090191836604842,
 'chanc': 0.00022269589285146184,
 'rahul': 3.181369897878026e-05,
 'dengra': 3.181369897878026e-05,
 'play': 0.0007953424744695065,
 'smash': 6.362739795756052e-05,
 'bro': 6.362739795756052e-05,
 'lt': 0.00849425762733433,
 'gt': 0.00849425762733433,
 'religi': 3.181369897878026e-05,
 'g': 0.0005090191836604842,
 'say': 0.003849457576432412,
 'never': 0.0011134794642573091,
 'answer': 0.0005090191836604842,
 'text': 0.00238602742340852,
 'confirm': 0.0002863232908090224,
 'deni': 0.00012725479591512104,
 'okey': 9.544109693634079e-05,
 'dokey': 6.362739795756052e-05,
 'bit': 0.00120892056119365,
 'sort': 0.0004453917857029237,
 'stuff': 0.0011134794642573091,
 'come': 0.0076

### Term Frequency-Inverse Document Frequency (TF-IDF)

Bag of Words is not the only model at our disposal, another popular option is the **Term Frequency-Inverse Document Frequency (TF-IDF)** model, which is based on information theory. In this approach, the probability of term $t\in W$ is estimated by multiplying the **Term Frequency** and the **Inverse Document Frequency** of $t$:

$$P(t)=\frac{TF(t)IDF(t)}{\sum_{\hat{t}\in W}TF(\hat{t})IDF(\hat{t})},$$

where 

$$TF(t)=\frac{\text{number of ocurrences of $t$ in $W$}}{|W|},$$

and

$$IDF(t)=\log\left(\frac{|D|}{|d\in D:t\in d|}\right).$$

Notice that $|d\in D:t\in d|$ is the number of documents in $D$ that contain the term $t$. 

Say we have three documents and suppose that the term "donut" appears in all of them. Then, the IDF of "donut" would be equal to

$$IDF(donut)=\log\left(\frac{3}{3}\right)=0.$$

That is, if a term is very common across documents, it is considered that it will not carry a lot of information. In other words, we tend to filter common terms. By the way, to avoid division by zero, it is common practice to do this:

$$IDF(t)=\log\left(\frac{|D|}{|d\in D:t\in d|+1}\right).$$

Given the latter, one way in which we can estimate $P(w_i|spam)$ is the following:

$$P(w_i|spam)=\frac{TF(w_i|spam)IDF(w_i)}{\sum_{\hat{w}_i\in W_t}TF(\hat{w}_i|spam)IDF(\hat{w}_i)}.$$

Similarly, we have that

$$P(w_i|not~spam)=\frac{TF(w_i|not~spam)IDF(w_i)}{\sum_{\hat{w}_i\in W_t}TF(\hat{w}_i|not~spam)IDF(\hat{w}_i)}.$$

In case you want to know more, this is a good starting point: https://en.wikipedia.org/wiki/Tf–idf.

In [29]:
def tf_idf(df):
    
    tf = probability_words(df)    
    
    number_of_emails = df.shape[0]
    tf_idf = {}
    
    for word in tf:
        tf_idf[word] = tf[word] * np.log(number_of_emails / (df['email'].apply(lambda x: word in x).sum()))
        
    return tf_idf        

In [30]:
def probability_words_tf_idf(df):
    
    tf_idf_words = tf_idf(df)
    
    total_tf_idf = sum(tf_idf_words.values())
    
    probability_words_tf_idf = {}
    
    for item in tf_idf_words.items():
        probability_words_tf_idf[item[0]] = item[1] / total_tf_idf
        
    return probability_words_tf_idf

In [31]:
probability_spam_words_tf_idf = probability_words_tf_idf(train_set[train_set['spam'] == 1])
probability_spam_words_tf_idf

{'bank': 0.0002735847594113102,
 'granit': 0.0002735847594113102,
 'issu': 0.0002735847594113102,
 'strong': 0.0003811789765242756,
 'buy': 0.0004806167339707286,
 'explos': 0.0002735847594113102,
 'pick': 0.0004806167339707286,
 'member': 0.0009061986758780009,
 '300': 0.0004806167339707286,
 'nasdaq': 0.0002735847594113102,
 'symbol': 0.0002735847594113102,
 'cdgt': 0.0002735847594113102,
 '5': 0.0012191824634012309,
 '00': 0.0008281278982376738,
 'per': 0.0026196406694686176,
 'privat': 0.0009815968447050474,
 '2003': 0.0008281278982376738,
 'account': 0.0010545906124861594,
 'statement': 0.0009815968447050474,
 '07973788240': 0.00015343057591862803,
 'show': 0.001955817177624579,
 '800': 0.0013900446570677806,
 'un': 0.0008281278982376738,
 'redeem': 0.0008281278982376738,
 'point': 0.0009815968447050474,
 'call': 0.005797564971291922,
 '08715203649': 0.00015343057591862803,
 'identifi': 0.0009815968447050474,
 'code': 0.0017431409082869144,
 '40533': 0.0002735847594113102,
 'expir

In [32]:
probability_non_spam_words_tf_idf = probability_words_tf_idf(train_set[train_set['spam'] == 0])
probability_non_spam_words_tf_idf

{'boat': 0.0001340787571686418,
 'still': 0.002673817129382681,
 'mom': 0.0005821417879882206,
 'check': 0.0011852153497664995,
 'yo': 0.0009840129493963474,
 'half': 0.0006031513820190732,
 'nake': 0.0001340787571686418,
 'r': 0.002495752635206427,
 'give': 0.001979299495530657,
 'second': 0.000554343944182714,
 'chanc': 0.0002758274542360523,
 'rahul': 5.155066931771417e-05,
 'dengra': 5.155066931771417e-05,
 'play': 0.0007864456057031392,
 'smash': 9.444782127034245e-05,
 'bro': 9.444782127034245e-05,
 'lt': 0.004908691025774448,
 'gt': 0.004925192800279663,
 'religi': 5.155066931771417e-05,
 'g': 0.000554343944182714,
 'say': 0.0026603547777699706,
 'never': 0.0010338455854026877,
 'answer': 0.0005478981534006781,
 'text': 0.0018576594123121863,
 'confirm': 0.00034051651915242316,
 'deni': 0.00017877167622485572,
 'okey': 0.0001340787571686418,
 'dokey': 9.444782127034245e-05,
 'bit': 0.0010960778285755124,
 'sort': 0.0004910802869165033,
 'stuff': 0.0010338455854026877,
 'come': 0

## Evaluating the model

So we have implemented the Naive Bayes Classifier with two different approaches for computing the conditional probabilities, but what approach is the best one? We can know this by using some **evaluation metrics**. You already know them, you know the drill. Once again, we will be using the `performance_metrics` function for evaluating these two models: bag of words and TF-IDF.

In [33]:
def classify_email(email, method):
    
    likelihood_spam = 1
    likelihood_non_spam = 1
    
    if method == 'bow':
        probability_spam = probability_spam_words
        probability_non_spam = probability_non_spam_words
    elif method == 'tfidf':
        probability_spam = probability_spam_words_tf_idf
        probability_non_spam = probability_non_spam_words_tf_idf
        
    for word in email:
        if word in probability_spam:
            likelihood_spam *= probability_spam[word]
        else:
            likelihood_spam = 0
        if word in probability_non_spam:
            likelihood_non_spam *= probability_non_spam[word]
        else:
            likelihood_non_spam = 0
    
    likelihood_spam *= p_spam
    likelihood_non_spam *= p_not_spam
    
    if likelihood_spam > likelihood_non_spam:
        return 1
    else:
        return 0

In [34]:
test_set_bow = test_set.copy()
test_set_bow['prediction'] = test_set['email'].apply(lambda x: classify_email(x, 'bow'))
test_set_bow

Unnamed: 0,email,spam,prediction
3245,"[funni, fact, nobodi, teach, volcano, 2, erupt...",0,0
944,"[sent, score, sopha, secondari, applic, school...",0,0
1044,"[know, someon, know, fanci, call, 09058097218,...",1,0
2484,"[promis, get, soon, text, morn, let, know, mad...",0,0
812,"[congratul, ur, award, either, 500, cd, gift, ...",1,1
...,...,...,...
4264,"[lt, decim, gt, common, car, better, buy, chin...",0,0
2439,"[rightio, 11, 48, well, arent, bright, earli, ...",0,0
5556,"[ye, u, text, pshew, miss, much]",0,0
4205,"[get, door]",0,0


In [35]:
test_set_tf_idf = test_set.copy()
test_set_tf_idf['prediction'] = test_set['email'].apply(lambda x: classify_email(x, 'tfidf'))
test_set_tf_idf

Unnamed: 0,email,spam,prediction
3245,"[funni, fact, nobodi, teach, volcano, 2, erupt...",0,0
944,"[sent, score, sopha, secondari, applic, school...",0,0
1044,"[know, someon, know, fanci, call, 09058097218,...",1,0
2484,"[promis, get, soon, text, morn, let, know, mad...",0,0
812,"[congratul, ur, award, either, 500, cd, gift, ...",1,1
...,...,...,...
4264,"[lt, decim, gt, common, car, better, buy, chin...",0,0
2439,"[rightio, 11, 48, well, arent, bright, earli, ...",0,0
5556,"[ye, u, text, pshew, miss, much]",0,0
4205,"[get, door]",0,0


In [36]:
def performance_metrics(results):
    
    positives = results[['spam', 'prediction']][results['spam'] == 1]
    negatives = results[['spam', 'prediction']][results['spam'] == 0]
    
    true_negatives = negatives[negatives['spam'] == negatives['prediction']].shape[0]
    false_positives = negatives[negatives['spam'] != negatives['prediction']].shape[0]
    true_positives = positives[positives['spam'] == positives['prediction']].shape[0]
    false_negatives = positives[positives['spam'] != positives['prediction']].shape[0]
    
    confusion_matrix = {'actual positives' : [true_positives, false_negatives], 
                        'actual negatives' : [false_positives, true_negatives]}
    
    confusion_matrix_df = pd.DataFrame.from_dict(confusion_matrix, orient='index', 
                                                 columns=['predicted positives', 'predicted negatives'])
    
    accuracy = (true_positives + true_negatives) / (true_positives + false_positives +  true_negatives + false_negatives)
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)
    
    metrics = {'Accuracy' : accuracy, 'Precision' : precision, 'Recall' : recall, 'F1 Score' : f1_score}
    
    metrics_df = pd.DataFrame.from_dict(metrics, orient='index', columns=['Metrics'])
    
    return confusion_matrix_df, metrics_df   

In [37]:
confusion_matrix, metrics = performance_metrics(test_set_bow)
confusion_matrix

Unnamed: 0,predicted positives,predicted negatives
actual positives,64,86
actual negatives,10,955


In [38]:
metrics

Unnamed: 0,Metrics
Accuracy,0.913901
Precision,0.864865
Recall,0.426667
F1 Score,0.571429


In [39]:
confusion_matrix, metrics = performance_metrics(test_set_tf_idf)
confusion_matrix

Unnamed: 0,predicted positives,predicted negatives
actual positives,64,86
actual negatives,5,960


In [40]:
metrics

Unnamed: 0,Metrics
Accuracy,0.918386
Precision,0.927536
Recall,0.426667
F1 Score,0.584475


As we can see, TF-IDF does improve the precision of the classifier, which is good in this case, however recall remains the same. This suggests that if we want to improve the sensitivity of our model we should try other options such as getting more examples of spam emails, include n-grams, etc. 

## N-grams

Speaking of **n-grams**, these are groups of n, or fewer, consecutive words that you can extract from a sentence. The same concept may also be applied to characters instead of words. Consider the sentence "The cat sat on the mat." It may be decomposed into the following set of 2-grams:

`"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat".`

Extracting n-grams is a form of feature engineering that are really useful when using lightweight, shallow text-processing models such as logistic regression and random forests. Nevertheless, one-dimensional CNNs and **Recurrent Neural Networks** (RNNs) are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups.

## References

[1] Chollet, Francois. *Deep learning with Python*. Simon and Schuster, 2021.

[2] Aizawa, Akiko. *An information-theoretic perspective of tf–idf measures*. Information Processing & Management 39(1), 2003.

[3] https://towardsdatascience.com/spam-classifier-in-python-from-scratch-27a98ddd8e73