# Spam Classifier using Multinomial Naive Bayes from scratch

In this notebook I am going to develop a simple email or sms spam classifier using Multinomial Naive Bayes Algorithm written from scratch. For that purpose we first need to know basic algorithm behind naive bayes. 
 Here I am using SMS dataset form UCI ,that contains almost 5500 examples.

In [34]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [35]:
df = pd.read_csv("spam.csv",encoding='latin-1')

In [36]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


I will map the Y_label which is 'v1' column to {0,1}.

In [37]:
dict = {'ham':0,'spam':1}
df['v1'] = df['v1'].map(dict)
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,


Deleting unecessary columns from the dataframe.

In [38]:
del df['Unnamed: 2']
del df['Unnamed: 3']
del df['Unnamed: 4']

Now first we need to find a way to represent the text data to a numerical form.To do this I will be using CountVectorizer that creates a dictionary of all the words present in that corpus(i.e. the whole text document).Then we can transform the text data into a matrix form whose (i,j)th elemment is nothong but the number of times the jth word has appeared in the ith document or example.

In [39]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
c_vec = CountVectorizer(lowercase=1,min_df=.00001,stop_words='english')
c_vec.fit(df['v2'].values)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=1, max_df=1.0, max_features=None, min_df=1e-05,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Spliting the dataframe into train and test dataframe .

In [40]:
train_df = df[0:5000]
test_df = df[5000:]
test_df.index=(range(test_df.shape[0]))
Y_train = train_df['v1'].values


First method we have to write for naive bayes is for calculating probability of spam and ham class.We just have to find how many spam sms and ham sms is there in data and then divide it by total number of examples.

In [41]:
def prob_y(Y_train,num_class=2):
    p_y = np.zeros([num_class,])
    n_y = np.zeros([num_class,])
    d_y = Y_train.shape[0]
    for i in range(Y_train.shape[0]):
        n_y[Y_train[i]] = n_y[Y_train[i]]+1
    p_y = n_y/d_y
    return p_y

In [42]:
p_y = prob_y(Y_train)
p_y

array([ 0.8654,  0.1346])


The next method is prob_xy which is P(X|Y) . It is the probabilty of getting the word X in the class Y.This function produces a 
num_class * num_of_words matrix. First column of the matrix contains P(X|Y=0) and second column P(X|Y=1).

In [43]:
def prob_xy(c_vec,train_df,Y_train,num_class=2):
    d_y = np.zeros([num_class,])+len(c_vec.vocabulary_)
    p_xy = np.zeros([num_class,len(c_vec.vocabulary_)])
    for i in np.unique(Y_train):
        temp_df = train_df[train_df['v1']==i]
        temp_x = c_vec.transform(temp_df['v2'].values)
        n_xy = np.sum(temp_x,axis=0)+1
        d_y[i] = d_y[i]+np.sum(temp_x)
        p_xy[i] = n_xy/d_y[i] 
    return p_xy

In [44]:
p_xy = prob_xy(c_vec,train_df,Y_train,2)
p_xy

array([[  2.57944697e-05,   2.57944697e-05,   5.15889393e-05, ...,
          2.57944697e-05,   2.57944697e-04,   5.15889393e-05],
       [  5.77064316e-04,   1.52135138e-03,   5.24603924e-05, ...,
          1.04920785e-04,   5.24603924e-05,   5.24603924e-05]])

Now we come to final stage of this algorithm where we have to find P(Y|X) i.e. the probability of a document X to belong to class Y . From Bayes theorem in probability theory , P(Y|X) = P(X|Y) * P(Y)/P(X) . 
And then finally the class label Y for a document X will be accroding to  max(P(Y=0|X),P(Y=1|X)).

In [45]:
def classify(c_vec,test_df,p_xy,p_y,num_class=2):
    pred = []
    pre_yx = []
    for doc in test_df['v2'].values:
        temp_doc = (c_vec.transform([doc])).todense()
        temp_prob = np.zeros([num_class,])
        for i in range(num_class):
            temp_prob[i] = np.prod(np.power(p_xy[i],temp_doc))*p_y[i]
        pred.append(np.argmax(temp_prob))
    return pred

In [46]:
pred = classify(c_vec,test_df,p_xy,p_y,num_class=2)

Now that our classification is done , we will find the accuracy for both the training and test data.

In [47]:
def accuracy(pred,Y):
    return np.sum(pred==Y)/Y.shape[0]

In [48]:
Y_test = test_df['v1'].values
test_accuracy = accuracy(pred,Y_test)
print('Test Data Accuaracy = '+str(test_accuracy)) 

Test Data Accuaracy = 0.984265734266


In [49]:
pred_train = classify(c_vec,train_df,p_xy,p_y,num_class=2)
print('Train Data Accuracy = '+str(accuracy(pred_train,Y_train)))

Train Data Accuracy = 0.995
