# Naive Bayes Classifier 

Naive Bayes Classifer is one of the most frequently used classification methods. This method is based on the bayes theorem. 

Given certain word, say, "reward" appeared in one email, then the probability of this email is spam equals the probability of the prop of W when the email is a spam/the Probability of the this word appears in this doc. 

$P(S|W)$ = $\frac{P(W|S)*P(S)}{P(W)}$

Which the P(W) equals the probability of W when S is spam * P(spam) + the probability of W when N is not spam * P(not spam)

$P(W) = P(W|S)*P(S)+P(W|N)*P(N)$

[http://www.gatsby.ucl.ac.uk/~porbanz/teaching/UN3106S18/slides_25Jan.pdf]

[https://towardsdatascience.com/how-to-build-and-apply-naive-bayes-classification-for-spam-filtering-2b8d3308501]

https://towardsdatascience.com/create-a-sms-spam-classifier-in-python-b4b015f7404b

In [4]:
import pandas as pd 
import os

In [5]:
#os.getcwd()


'/Users/wenxuanzhang/LocalDoc/GitHub/Learning Algorithm'

In [13]:
import glob

txtfiles = []
for file in glob.glob("*.csv"):
    txtfiles.append(file)

In [14]:
txtfiles

['SMSSpamCollection.csv']

In [20]:
sms_data = pd.read_csv('SMSSpamCollection.csv',names=["Label","SMS"])

In [21]:
sms_data.head

<bound method DataFrame.head of      Label                                                SMS
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
5     spam  FreeMsg Hey there darling it's been 3 week's n...
6      ham  Even my brother is not like to speak with me. ...
7      ham  As per your request 'Melle Melle (Oru Minnamin...
8     spam  WINNER!! As a valued network customer you have...
9     spam  Had your mobile 11 months or more? U R entitle...
10     ham  I'm gonna be home soon and i don't want to tal...
11    spam  SIX chances to win CASH! From 100 to 20,000 po...
12    spam  URGENT! You have won a 1 week FREE membership ...
13     ham  I've been searching for the right words to tha...
14     ham                I HAVE A DAT

In [22]:
sms_data.groupby('Label').count()


Unnamed: 0_level_0,SMS
Label,Unnamed: 1_level_1
ham,4825
spam,747


In [23]:
#first remove white space \s+ and \W+ word
sms_data_clean = sms_data
sms_data_clean['SMS'] = sms_data_clean['SMS'].str.replace('\W+', ' ').str.replace('\s+', ' ').str.strip()
sms_data_clean['SMS'] = sms_data_clean['SMS'].str.lower()
sms_data_clean['SMS'] = sms_data_clean['SMS'].str.split()

In [25]:
sms_data.head

<bound method DataFrame.head of      Label                                                SMS
0      ham  [go, until, jurong, point, crazy, available, o...
1      ham                     [ok, lar, joking, wif, u, oni]
2     spam  [free, entry, in, 2, a, wkly, comp, to, win, f...
3      ham  [u, dun, say, so, early, hor, u, c, already, t...
4      ham  [nah, i, don, t, think, he, goes, to, usf, he,...
5     spam  [freemsg, hey, there, darling, it, s, been, 3,...
6      ham  [even, my, brother, is, not, like, to, speak, ...
7      ham  [as, per, your, request, melle, melle, oru, mi...
8     spam  [winner, as, a, valued, network, customer, you...
9     spam  [had, your, mobile, 11, months, or, more, u, r...
10     ham  [i, m, gonna, be, home, soon, and, i, don, t, ...
11    spam  [six, chances, to, win, cash, from, 100, to, 2...
12    spam  [urgent, you, have, won, a, 1, week, free, mem...
13     ham  [i, ve, been, searching, for, the, right, word...
14     ham         [i, have, a, date, 

In [27]:
#Split the train and testing set 
train_data = sms_data_clean.sample(frac=0.8,random_state=1).reset_index(drop=True)
test_data = sms_data_clean.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)
sms_data_clean['Label'].value_counts() / sms_data.shape[0] * 100


ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [28]:
train_data['Label'].value_counts() / train_data.shape[0] * 100


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [29]:
test_data['Label'].value_counts() / test_data.shape[0] * 100


ham     86.983842
spam    13.016158
Name: Label, dtype: float64

In [30]:
test_data.shape[0]

1114

In [31]:
vocabulary = list(set(train_data['SMS'].sum()))
word_counts_per_sms = pd.DataFrame([
    [row[1].count(word) for word in vocabulary]
    for _, row in train_data.iterrows()], columns=vocabulary)
train_data = pd.concat([train_data.reset_index(), word_counts_per_sms], axis=1).iloc[:,1:]

In [85]:
Pspam = train_data['Label'].value_counts()['spam']*1.0/ train_data.shape[0]
Pham = train_data['Label'].value_counts()['ham']*1.0 / train_data.shape[0]
Nspam = train_data.loc[train_data['Label'] == 'spam', 
                       'SMS'].apply(len).sum()
Nham = train_data.loc[train_data['Label'] == 'ham',
                      'SMS'].apply(len).sum()
Nvoc = len(train_data.columns) - 3
alpha = 1

print(Pspam)
print(Pham)
print(Nspam)
print(Nham)


0.134589502019
0.865410497981
15190
57105


In [84]:
train_data['Label'].value_counts()['spam']*1.0/train_data.shape[0]

0.13458950201884254

In [86]:
def p_w_spam(word):
    if word in train_data.columns:
        return (train_data.loc[train_data['Label'] == 'spam', word].sum() + alpha)*1.0 / (Nspam + alpha*Nvoc)
    else:
        return 1
      
def p_w_ham(word):
    if word in train_data.columns:
        return (train_data.loc[train_data['Label'] == 'ham', word].sum() + alpha)*1.0 / (Nham + alpha*Nvoc)
    else:
        return 1

In [87]:
def classify(message):
    p_spam_given_message = Pspam
    print(Pspam)
    p_ham_given_message = Pham
    for word in message:
        p_spam_given_message *= p_w_spam(word)
        p_ham_given_message *= p_w_ham(word)
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [94]:
# using test data 
test_rest= [classify(test_sm) for test_sm in test_data['SMS']]


0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.13458950

0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.13458950

0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019
0.134589502019


In [95]:
test_data['predict'] = test_rest

In [104]:
sum(test_data['Label']==test_data['predict'])*1.0/test_data.shape[0]

0.99102333931777375

In [100]:
sum([True,True])

2