## SMS Spam filter with Multinomial Naive Bayes Algorithm


This project is done as a part of Conditional Probability class on DataQuest.

* **Concepts learned:** Naive Bayes Algorithm, Pandas-Numpy usage, Impact of different data cleaning schemes on the model's outcome
* **Main challenges:** Underflow of the probability values for long texts, Finding vectorized solutions, Understanding the difference between the results for different cleaning methods.

The goal of this project is to determine whether an SMS message is spam or not with an accuracy greater than 80%.To achieve this purpose, a multinomial naive Bayes model is employed.

The data used in this project is from [UCI machine learning repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It contains English, real and non-enconded messages, labeled according being legitimate (ham) or spam.


### Data Exploration
Let us take a look at the content of the dataset:

In [1]:
import pandas as pd, numpy as np, re as regex
from IPython.display import display 

data_DF = pd.read_csv('SMSSpamCollection',sep='\t',names=['Label', 'SMS'])
data_DF.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
labels = data_DF["Label"]
print(labels.unique())

['ham' 'spam']


In [3]:
data_DF.isnull().sum()

Label    0
SMS      0
dtype: int64

In [4]:
labels.describe()

count     5572
unique       2
top        ham
freq      4825
Name: Label, dtype: object

* The data is composed of 5572 entries labeled as either "ham" (legitimate) or "spam". 
* There are no null values in the dataset
* The SMS texts contain grammar errors, numbers and punctuations. 
* Finally, 87% of the messages in the set are ham.

### Data Pre-Processing

Before we start developing the model, we will clean up the dataset and tokenize the words. We can do different levels of data cleaning which would eventually impact the results. 

In [5]:
#Remove punctuations, lower case all the content and tokenize the words 
def normalized (txt):
    clean_txt = regex.sub('\W',' ',txt).lower().split()   
    return clean_txt

#Remove punctuations and tokenize the words 
def keep_capitals (txt): 
    clean_txt = regex.sub('\W',' ',txt).split() 
    return clean_txt

#Remove punctuations and stop words, lower case all the content and tokenize the words 
def no_stop_words (txt):

    clean_txt = regex.sub('\W',' ',txt).lower().split() 

    file = open("stop_words_english.txt")  
    stopW_set = set(file.read().split())
    file.close()

    new_list = []
    for each_word in clean_txt:
        if each_word not in stopW_set:
            new_list.append(each_word)
    clean_txt = new_list

    return clean_txt

#Remove punctuations, lower case all the content, tokenize and stem the words
def stemmed (txt):
    
    from nltk.stem.porter import PorterStemmer 
    porter = PorterStemmer()
        
    clean_txt = regex.sub('\W',' ',txt).lower().split() 

    new_list = []
    for each_word in clean_txt:
        new_list.append(porter.stem(each_word))
    clean_txt = new_list

    return clean_txt


Beyond the scope of the original project, I experimented with these different data cleaning methods and I reported the results in the last section. Based on those results, the normalized version gives us the best accuracy. Hence, we will showcase it here: 

In [6]:
functions =[normalized,keep_capitals,no_stop_words,stemmed] 

data_DF["clean_SMS"]=data_DF["SMS"].apply(functions[0])  #you can change the function to experiment
data_DF.head(10)

Unnamed: 0,Label,SMS,clean_SMS
0,ham,"Go until jurong point, crazy.. Available only ...","[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, i, don, t, think, he, goes, to, usf, he,..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, there, darling, it, s, been, 3,..."
6,ham,Even my brother is not like to speak with me. ...,"[even, my, brother, is, not, like, to, speak, ..."
7,ham,As per your request 'Melle Melle (Oru Minnamin...,"[as, per, your, request, melle, melle, oru, mi..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, as, a, valued, network, customer, you..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[had, your, mobile, 11, months, or, more, u, r..."


## Create Training and Testing Data
Now that we have the clean data set, we can randomize and divide it into the training and test datasets with a 4:1 ratio.


In [7]:
seed =1

rnd_data_DF = data_DF.sample(frac=1, replace=False, random_state=seed)

divider = round(len(rnd_data_DF)*0.8)

train_DF= rnd_data_DF[0:divider].copy().reset_index(drop=True)
test_DF= rnd_data_DF[divider:len(rnd_data_DF)].copy().reset_index(drop=True)

train_DF.head(5)

Unnamed: 0,Label,SMS,clean_SMS
0,ham,"Yep, by the pretty sculpture","[yep, by, the, pretty, sculpture]"
1,ham,"Yes, princess. Are you going to make me moan?","[yes, princess, are, you, going, to, make, me,..."
2,ham,Welp apparently he retired,"[welp, apparently, he, retired]"
3,ham,Havent.,[havent]
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [8]:
train_label= train_DF["Label"]
train_label.value_counts(dropna=True)

ham     3858
spam     600
Name: Label, dtype: int64

In [9]:
test_label = test_DF["Label"]
test_label.value_counts(dropna=True)

ham     967
spam    147
Name: Label, dtype: int64

Both sets have a distribution similar to the original dataset and are ready-to-go.

### Naive Bayes Implementation

#### The Algorithm
As a part of the Naive Bayes Algorithm, we will calculate the probability of a new SMS message being spam given the words inside it (P(Spam|w1,w2,w3 ..)) and compare it to the probability of the same message being ham (P(Ham|w1,w2,w3 ..)). If probability of the message being spam is higher, the SMS is going to be categorized as spam and vice versa.

The probabilites are supposed to be calculated as follows:

\begin{equation}
  P(Spam|w_1,...,w_n) = \frac{ P(Spam \cap (w_1,...,w_n)) }{ P((w_1,...,w_n)) }
\end{equation}

\begin{equation}
  P(Ham|w_1,...,w_n) = \frac{ P(Ham \cap (w_1,...,w_n)) }{ P((w_1,...,w_n)) }
\end{equation}

but we will make two simplifications: 

**1)** The first one is nott to calculate the denominator of the functions above. This is mainly because we will compare (P(Spam|w1,w2,..,wn)) to (P(Ham|w1,w2,..,wn)) and both will have the same denominator. Hence, the equation will be simplified to:

\begin{equation}
  P(Spam|w_1,...,w_n) \propto P(Spam \cap (w_1,...,w_n))
\end{equation}

\begin{equation}
  P(Ham|w_1,...,w_n) \propto P(Ham \cap (w_1,...,w_n))
\end{equation}

    which eventually can be calculated as:

\begin{equation}
  P(Spam \cap (w_1,...,w_n)) = P(w1| w_2 \cap ... \cap w_n \cap Spam)* 
                               P(w2| w_3 \cap ... \cap w_n \cap Spam)* ...* P(wn|Spam)* P(Spam) 
\end{equation}

\begin{equation}
  P(Ham \cap (w_1,...,w_n)) = P(w1| w_2 \cap ... \cap w_n \cap Ham)* 
                              P(w2| w_3 \cap ... \cap w_n \cap Ham)* ...* P(wn|Ham)* P(Ham) 
\end{equation}
<br/> <br/>   

**2)** The second simplification is to assume (naively) that the probabilities of the words occuring in a message is conditionally independent given the message is spam (or ham). That is to say, the equation can above can be written as:

\begin{equation}
  P(Spam \cap (w_1,...,w_n)) =  P(Spam) * P(w_1|Spam) * ... * P(w_n|Spam) 
\end{equation}

\begin{equation}
  P(Ham \cap (w_1,...,w_n)) =  P(Ham) * P(w_1|Ham) * ... * P(w_n|Ham) 
\end{equation}

    or simply:
    
\begin{equation}
  P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
  P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

    where (P(wi|Spam) can be calculated as: 

\begin{equation}
  P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
  P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}


    Here:

  - N<sub>wi |Spam</sub> (or N<sub>wi |Ham</sub>) is the number of times a word appears in spam (or ham) messages
  - $\alpha $ is the Laplace smoothing factor of 1
  - N<sub>Spam</sub> (or N<sub>Ham</sub>) is the number of words in spam (or ham) messages
  - N<sub>Vocabulary</sub> is the number of unique words in the training data set




#### The Implementation

Let us first create the vocabulary of the unique words in the training dataset:

In [10]:
train_SMS = train_DF["clean_SMS"]

tokens =[]
for each_row in train_SMS:
    for each_word in each_row:
        tokens.append(each_word)

vocab_set = set(tokens)
train_vocab = list(vocab_set)
    
train_vocab.sort()
print(train_vocab[:10])
print(train_vocab[-10:])

['0', '00', '000', '000pes', '008704050406', '0089', '01223585334', '02', '0207', '02072069400']
['zindgi', 'zoe', 'zogtorius', 'zouk', 'zyada', 'é', 'ú1', 'ü', '〨ud', '鈥']


Now , we can caluclate low haning fruits for (P(wi|Spam)  calculations:

In [11]:
ratios = train_label.value_counts(normalize=True)      
p_ham = ratios["ham"]
p_spam = ratios["spam"]

counts = train_label.value_counts()
n_ham = counts["ham"]
n_spam = counts["spam"]

n_vocab = len(train_vocab)
alpha = 1

print(("p_ham %.3f,p_spam %.3f\n"+"n_ham %.2e,n_spam %.2e\n"+"n_vocabulary %i") 
      %(p_ham,p_spam,n_ham,n_spam,n_vocab))

p_ham 0.865,p_spam 0.135
n_ham 3.86e+03,n_spam 6.00e+02
n_vocabulary 7783


Next step is to calculate is the number of times a word appears in spam (or ham) messages **:** N<sub>wi |Spam</sub> (or N<sub>wi |Ham</sub>) 

In [12]:
train_word_freq={}
for unique_word in train_vocab:
    train_word_freq[unique_word] = np.zeros(len(train_SMS))

for index,each_row in enumerate(train_SMS):
    for each_word in each_row:
        train_word_freq[each_word][index] += 1

twf = pd.DataFrame(train_word_freq)

train_freq_DF = pd.concat([train_label,twf],axis=1)  
train_freq_DF.set_index('Label',inplace=True)
train_freq_DF.head(5)


Unnamed: 0_level_0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0


In [13]:
tf_DF = train_freq_DF.groupby("Label").sum()
display(tf_DF)

Unnamed: 0_level_0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,1.0,1.0,0.0,1.0,4.0,0.0,128.0,1.0,1.0
spam,3.0,9.0,25.0,0.0,1.0,1.0,2.0,7.0,3.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


Last parameters we need are the number of words in spam (or ham) messages **:** N<sub>Spam</sub> (or N<sub>Ham</sub>) 

In [14]:
n_ham_words = tf_DF.loc["ham"].sum()
n_spam_words = tf_DF.loc["spam"].sum()
print("Number of words in ham messages: %i , spam messages: %i" %(n_ham_words,n_spam_words))

Number of words in ham messages: 57237 , spam messages: 15190


Now that we have all we need, we can calculate the P(wi|Spam) ans P(wi|Ham)

In [15]:
dict_div_by ={'ham': n_ham_words+(alpha*n_vocab),'spam' : n_spam_words+(alpha*n_vocab)}
div_by = tf_DF.index.to_series().map(dict_div_by)

p_word_DF= (tf_DF+alpha).div(div_by, axis = 0)
display(p_word_DF)

Unnamed: 0_level_0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,1.5e-05,1.5e-05,1.5e-05,3.1e-05,1.5e-05,1.5e-05,1.5e-05,1.5e-05,1.5e-05,1.5e-05,...,4.6e-05,3.1e-05,3.1e-05,1.5e-05,3.1e-05,7.7e-05,1.5e-05,0.001984,3.1e-05,3.1e-05
spam,0.000174,0.000435,0.001132,4.4e-05,8.7e-05,8.7e-05,0.000131,0.000348,0.000174,8.7e-05,...,4.4e-05,4.4e-05,4.4e-05,8.7e-05,4.4e-05,4.4e-05,8.7e-05,4.4e-05,4.4e-05,4.4e-05


Moving forward, we will: 
* Calculate the P(Spam|w1,...,wn) from P(wi|Spam) **and** P(Ham|w1,...,wn) from P(wi|Ham)
* Assign the message as: 
    - spam      if P(Spam|w1,...,wn) > P(Ham|w1,...,wn)
    - ham       if P(Spam|w1,...,wn) < P(Ham|w1,...,wn)
    - ambigious if  P(Spam|w1,...,wn) == P(Ham|w1,...,wn)

In [16]:
def spam_or_ham (txt,debug=False):
    
    if debug: 
        print('\n#--------------------------------------------')
        print('SMS:\n',txt) 
    
    txt_in_vocab = []
    for word in txt:
        if word in train_vocab:
             txt_in_vocab.append(word)   

    if debug: print('Words found in vocab:\n',txt_in_vocab)

    p_ham_given_words  = p_ham
    p_spam_given_words = p_spam   
    
    if (txt_in_vocab):
        p_ham_given_words  *= p_word_DF.loc["ham",txt_in_vocab].product(skipna=False)
        p_spam_given_words *= p_word_DF.loc["spam",txt_in_vocab].product(skipna=False) 
   
    if(p_ham_given_words>p_spam_given_words):                      
        category = "ham"
    elif (p_ham_given_words<p_spam_given_words):
        category = "spam"
    else:
        category = "ambigious"
    
       
    if debug:
        print()
        print('p_ham_prior: %.3e' %p_ham)
        print('p_ham_given_words ~ %.3e' %p_ham_given_words)
        print('Word','\t','P(wi|ham)')
        print(p_word_DF.loc["ham",txt_in_vocab],'\n')
        print('p_spam_prior: %.3e' %p_spam)
        print('p_spam_given_words ~ %.3e' %p_spam_given_words)
        print('Word','\t','P(wi|spam)')
        print(p_word_DF.loc["spam",txt_in_vocab])
        
    return category


Let us test the function above for the first entry in the test_data:

In [17]:
spam_or_ham(test_DF.loc[0,'clean_SMS'],debug=True) 


#--------------------------------------------
SMS:
 ['later', 'i', 'guess', 'i', 'needa', 'do', 'mcat', 'study', 'too']
Words found in vocab:
 ['later', 'i', 'guess', 'i', 'do', 'study', 'too']

p_ham_prior: 8.654e-01
p_ham_given_words ~ 4.253e-19
Word 	 P(wi|ham)
later    0.001523
i        0.036942
guess    0.000323
i        0.036942
do       0.004860
study    0.000108
too      0.001400
Name: ham, dtype: float64 

p_spam_prior: 1.346e-01
p_spam_given_words ~ 3.483e-26
Word 	 P(wi|spam)
later    0.000044
i        0.002220
guess    0.000305
i        0.002220
do       0.001045
study    0.000044
too      0.000087
Name: spam, dtype: float64


'ham'

The SMS seems to have been parsed and filtered correctly. The parameters for the calculations also look right.

Now, it is time to apply the function to the whole test dataset and get the accuracy of the model's predictions:

In [18]:
test_DF['Prediction']=test_DF['clean_SMS'].apply(spam_or_ham, debug = False)

In [19]:
test_DF.head(5)

Unnamed: 0,Label,SMS,clean_SMS,Prediction
0,ham,Later i guess. I needa do mcat study too.,"[later, i, guess, i, needa, do, mcat, study, too]",ham
1,ham,But i haf enuff space got like 4 mb...,"[but, i, haf, enuff, space, got, like, 4, mb]",ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,"[had, your, mobile, 10, mths, update, to, late...",spam
3,ham,All sounds good. Fingers . Makes it difficult ...,"[all, sounds, good, fingers, makes, it, diffic...",ham
4,ham,"All done, all handed in. Don't know if mega sh...","[all, done, all, handed, in, don, t, know, if,...",ham


In [20]:
check= (test_DF['Label']==test_DF['Prediction']).value_counts()
correct = check[True]
error = check[False]
accuracy = correct/(correct+error)

print("Total number of entries:",correct+error)
print("Number of errors:",error)
print("Accuracy: %.4f" %accuracy)


Total number of entries: 1114
Number of errors: 14
Accuracy: 0.9874


Let us check for which cases the model fails to predict correctly.

In [21]:
test_DF[test_DF['Prediction']!=test_DF['Label']]

Unnamed: 0,Label,SMS,clean_SMS,Prediction
114,spam,Not heard from U4 a while. Call me now am here...,"[not, heard, from, u4, a, while, call, me, now...",ham
135,spam,More people are dogging in your area now. Call...,"[more, people, are, dogging, in, your, area, n...",ham
152,ham,Unlimited texts. Limited minutes.,"[unlimited, texts, limited, minutes]",spam
159,ham,26th OF JULY,"[26th, of, july]",spam
284,ham,Nokia phone is lovly..,"[nokia, phone, is, lovly]",spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,"[a, boy, loved, a, gal, he, propsd, bt, she, d...",ambigious
302,ham,No calls..messages..missed calls,"[no, calls, messages, missed, calls]",spam
319,ham,We have sent JD for Customer Service cum Accou...,"[we, have, sent, jd, for, customer, service, c...",spam
504,spam,Oh my god! I've found your number again! I'm s...,"[oh, my, god, i, ve, found, your, number, agai...",ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...","[hi, babe, its, chloe, how, r, u, i, was, smas...",ham


At first sight, the message categorized as ambigious looks troublesome. Let us dive deeper and see what went wrong:

In [22]:
test_DF.loc[test_DF['Prediction']!=test_DF['Label'],'clean_SMS'].apply(spam_or_ham,debug=True)


#--------------------------------------------
SMS:
 ['not', 'heard', 'from', 'u4', 'a', 'while', 'call', 'me', 'now', 'am', 'here', 'all', 'night', 'with', 'just', 'my', 'knickers', 'on', 'make', 'me', 'beg', 'for', 'it', 'like', 'u', 'did', 'last', 'time', '01223585236', 'xx', 'luv', 'nikiyu4', 'net']
Words found in vocab:
 ['not', 'heard', 'from', 'u4', 'a', 'while', 'call', 'me', 'now', 'am', 'here', 'all', 'night', 'with', 'just', 'my', 'on', 'make', 'me', 'beg', 'for', 'it', 'like', 'u', 'did', 'last', 'time', 'xx', 'luv', 'net']

p_ham_prior: 8.654e-01
p_ham_given_words ~ 1.182e-84
Word 	 P(wi|ham)
not      0.005291
heard    0.000123
from     0.001784
u4       0.000015
a        0.013396
while    0.000308
call     0.003045
me       0.009843
now      0.003768
am       0.002599
here     0.001476
all      0.003107
night    0.001292
with     0.003230
just     0.003676
my       0.009197
on       0.004783
make     0.001184
me       0.009843
beg      0.000031
for      0.006337
it       

114          ham
135          ham
152         spam
159         spam
284         spam
293    ambigious
302         spam
319         spam
504          ham
546          ham
741          ham
876          ham
885          ham
953          ham
Name: clean_SMS, dtype: object

#### Log Transformation

Entry 293, is a long text. Given the probabilities are very small, the longer the text is, the higher the risk for underflow of the values. It wasn't in the original scope of the project but we can solve it by using log transformation to calculate the P(Spam| wi) as follows: 

In [23]:
def spam_or_ham_log (txt,debug=False):

    import math
    
    if debug: 
        print('\n#--------------------------------------------')
        print('SMS:\n',txt)  
    
    txt_in_vocab = []
    for word in txt:
        if word in train_vocab:
             txt_in_vocab.append(word)   

    if debug: print('Words found in vocab:\n',txt_in_vocab)

    p_ham_given_words  = math.log(p_ham)
    p_spam_given_words = math.log(p_spam)   
   
    if (txt_in_vocab):               
        p_ham_given_words  += p_word_DF.loc["ham",txt_in_vocab].apply(np.log).sum()
        p_spam_given_words += p_word_DF.loc["spam",txt_in_vocab].apply(np.log).sum()

    if(p_ham_given_words>p_spam_given_words):                      
        category = "ham"
    elif (p_ham_given_words<p_spam_given_words):
        category = "spam"
    else:
        category = "ambigious"
    
       
    if debug:
        print()
        print('log(p_ham_prior): %.3e' %math.log(p_ham))
        print('log(p_ham_given_words) ~ %.3e' %p_ham_given_words)
        print('Word','\t','log(P(wi|ham))')
        print(p_word_DF.loc["ham",txt_in_vocab].apply(np.log),'\n')
        print('log(p_spam_prior): %.3e' %math.log(p_spam))
        print('log(p_spam_given_words) ~ %.3e' %p_spam_given_words)
        print('Word','\t','log(P(wi|spam))')
        print(p_word_DF.loc["spam",txt_in_vocab].apply(np.log))
        
    return category


Let us check again whether the result looks reasonable:

In [24]:
spam_or_ham_log(test_DF.loc[0,'clean_SMS'],debug=True) 


#--------------------------------------------
SMS:
 ['later', 'i', 'guess', 'i', 'needa', 'do', 'mcat', 'study', 'too']
Words found in vocab:
 ['later', 'i', 'guess', 'i', 'do', 'study', 'too']

log(p_ham_prior): -1.446e-01
log(p_ham_given_words) ~ -4.230e+01
Word 	 log(P(wi|ham))
later   -6.487330
i       -3.298393
guess   -8.037928
i       -3.298393
do      -5.326708
study   -9.136540
too     -6.571591
Name: ham, dtype: float64 

log(p_spam_prior): -2.006e+00
log(p_spam_given_words) ~ -5.862e+01
Word 	 log(P(wi|spam))
later   -10.042075
i        -6.110249
guess    -8.096165
i        -6.110249
do       -6.864021
study   -10.042075
too      -9.348928
Name: spam, dtype: float64


'ham'

Let us also check whether we could fix the "ambigious" issue:

In [25]:
test_DF['Prediction']=test_DF['clean_SMS'].apply(spam_or_ham_log, debug = False)

check= (test_DF['Label']==test_DF['Prediction']).value_counts()
correct = check[True]
error = check[False]
accuracy = correct/(correct+error)

print("Total number of entries:",correct+error)
print("Number of errors:",error)
print("Accuracy: %.4f" %accuracy)


Total number of entries: 1114
Number of errors: 13
Accuracy: 0.9883


In [26]:
test_DF[test_DF['Prediction']!=test_DF['Label']]

Unnamed: 0,Label,SMS,clean_SMS,Prediction
114,spam,Not heard from U4 a while. Call me now am here...,"[not, heard, from, u4, a, while, call, me, now...",ham
135,spam,More people are dogging in your area now. Call...,"[more, people, are, dogging, in, your, area, n...",ham
152,ham,Unlimited texts. Limited minutes.,"[unlimited, texts, limited, minutes]",spam
159,ham,26th OF JULY,"[26th, of, july]",spam
284,ham,Nokia phone is lovly..,"[nokia, phone, is, lovly]",spam
302,ham,No calls..messages..missed calls,"[no, calls, messages, missed, calls]",spam
319,ham,We have sent JD for Customer Service cum Accou...,"[we, have, sent, jd, for, customer, service, c...",spam
504,spam,Oh my god! I've found your number again! I'm s...,"[oh, my, god, i, ve, found, your, number, agai...",ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...","[hi, babe, its, chloe, how, r, u, i, was, smas...",ham
741,spam,"0A$NETWORKS allow companies to bill for SMS, s...","[0a, networks, allow, companies, to, bill, for...",ham


Looks good! The ambigious entry is gone from the error list.

#### Impact of Different Cleaning Methods

The following are the results I found while experimenting with different data cleaning methods:

In [27]:
FOMs = pd.DataFrame({
                "N_errors":[14,17,19,15],
                "Accuracy(%)":[98.83,98.47,98.29,98.65],
                "Vocab_size":[7783,9656,7639,6593]
             },
             index=["normalized","keep_capitals", "no_stop_words","stemmed"]
            )
FOMs.sort_values(by='Accuracy(%)',ascending=False)

Unnamed: 0,N_errors,Accuracy(%),Vocab_size
normalized,14,98.83,7783
stemmed,15,98.65,6593
keep_capitals,17,98.47,9656
no_stop_words,19,98.29,7639


Stemming the words and removing the stop words seem to have taken away meaningful data from the dataset and thus caused a reduction in accuracy of the model. They both reduced the size of the vocabulary and potentially the runtime of the model. Since Naive Bayes is a relatively fast algorithm and my dataset is rather small, this figure of merit is less of importance. For a larger dataset stemming might be needed.

I was expecting keeping the capitalizations to improve the accuracy since more data would be available but I was wrong. The spam messages had a large amount on capitalized words that was pushing the model to assume any sentence with capitalization to be a spam. On top of that this method increased the vocabulary size and the run time which made this option the least desirable.