<img src="bayes.jpeg">

## Naive Bayes Classification Algorithm :
-  Naive Bayes is a probabilistic classification algorithm which is primarily used for Text Classification. 
To Know More: [Naive Bayes](https://www.youtube.com/watch?v=sjUDlJfdnKM). 
- We are going to implement Naive Bayes from scratch to build a Spam Classifier. We will use the data set from Kaggle to train the classifier. [Kaggle Spam Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset)
-  Naive Bayes outperform many classification algorithms in case of small data sets.

#### Importing useful modules

In [255]:
import numpy as np
import pandas as pd

#### Visualization of Dataset

In [273]:
data = pd.read_csv("spam.csv")
data.columns = ['clas','mail']

In [274]:
data.head()

Unnamed: 0,clas,mail
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


-  [clas] contains classes as "ham" or "spam" ,
-  [mail] contains actual text which is in the mail.

In [285]:
#total number of emails
total_mails = len(data)

In [286]:
total_mails

5568

In [287]:
#distribution of emails
data.clas.value_counts()

ham     4823
spam     745
Name: clas, dtype: int64

In [288]:
spams_prob = data.clas.value_counts(normalize = True)['spam']
non_spams_prob = data.clas.value_counts(normalize = True)['ham']

In [289]:
#spams probability [P(A)]
spams_prob

0.13380028735632185

In [290]:
#non_spam probability [P(~A)]
non_spams_prob

0.8661997126436781

### Dataset splitting into Test & Train

In [291]:
#dataset splitting functionality of scikit-learn
from sklearn.model_selection import train_test_split

#x & y have datatype as python series
x = data['mail']
y = data['clas']

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size = 0.25)

### Naive Bayes

In [292]:
#training the model... calculating all needed probabilities which we can use later on to test the data i.e P(B|A) and P(B|~A)
#To be ra only once on training data

#word bag contains the frequency of each words in spam and non_spam
wordBag = { 'positive' : {} , 'negative' : {} }

#prob_bag contains prob: P(mail|spam) and P(mail|not_spam)
probBag = { 'positive' : {} , 'negative' : {} }

def train_model(neg_total,pos_total):
   
    #iterating through training data
    for (email,label) in zip(x_train,y_train):
        #iterating each word for email, calculating occurances and then probability
        for word in email.split():
            if label == 'spam':
                if word in wordBag['negative']:
                    wordBag['negative'][word] += 1
                else :
                    wordBag['negative'][word] = 1     
                neg_total +=1
            else :
                if word in wordBag['positive']:
                    wordBag['positive'][word] += 1
                else :
                    wordBag['positive'][word] = 1
                pos_total +=1
                
    return (pos_total,neg_total)
    

In [293]:
def populate_probabilities(): 
    for key, value in wordBag['negative'].items(): 
        probBag['negative'][key] = wordBag['negative'][key]/float(neg_total)
    for key, value in wordBag['positive'].items(): 
        probBag['positive'][key] = wordBag['positive'][key]/float(pos_total)    

In [294]:
#calling train function
pos_total,neg_total = train_model(0,0)
#calculating probabilities
populate_probabilities()

In [317]:
def classify(feature, spam_prob, non_spam_prob):
    for word in feature.split():
        if word in probBag['negative']:
            spam_prob = spam_prob * probBag['negative'][word]
        if word in probBag['positive']:
            non_spam_prob = non_spam_prob * probBag['positive'][word]
            
    return spam_prob > non_spam_prob