## Naive Bayes Classifier Homework
Student name: Dmitry Timerbaev

### Data preprocessing
Before creating Naive Bayes classifier, we need to preprocess the data:

1) Import libraries and load the dataset, divide text data and labels

2) Remove all punctuation (unnecessary for classification)

3) Make new list with processed text data

In [1]:
# import all necessary libraries
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

In [2]:
# load data
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
# prepare data for classification. make dummy variables for labels: 1 for spam, 0 for not spam
x = data['v2']
y = data['v1']
y = [1 if i == 'spam' else 0 for i in y]

In [4]:
# define function that removes any punctuation from strings
def punctuation(string): 
  
    # punctuation marks to be removed
    punctuations = '''!;:,.?()'[]<>'''
  
    # goes through each character in a string and if character belongs to punctuation - makes it null 
    for x in string.lower(): 
        if x in punctuations: 
            string = string.replace(x, "") 
  
    # returns the string in lowercase letters 
    return string.lower() 

In [5]:
# create a list of lists with processed strings; apply punctuation function
words_list = []
for i in range(len(x)):
    temporary_dict = []
    for t in x[i].split():
        stg = punctuation(t)
        temporary_dict.append(stg)
    words_list.append(temporary_dict)

### Hand-made NB Classifier

Creating a simple naive bayes classifier requires:

1) Making separate lists for text data (spam_list for spam texts, ham_list for non-spam texts)

2) Making dictionaries that count words within texts (spam_dict for spam texts, ham_dict for non-spam texts)

3) Calculating totals and percentages of total for spam and non-spam texts

4) Calculating probabilities for each text using bayes probability formula

For classification task, accuracy is the best metric of performance. 5-fold cross validation will be used to test the classifier; accuracy scores will be stored in accuracy1 list

In [6]:
# perform 5-fold CV on words_list
words_list = np.array(words_list)
y = np.array(y)
accuracy1 = []
kfold = StratifiedKFold(n_splits=5)
for train_index, test_index in kfold.split(words_list,y):
    x_train, x_test = words_list[train_index], words_list[test_index]
    y_train, y_test = y[train_index], y[test_index]
# create a list of words that occur in spam and non-spam messages    
    spam_list = []
    ham_list = []
    for i in range(len(y_train)):
        if y_train[i] == 1:
            spam_list.append(x_train[i])
        elif y_train[i] == 0:
            ham_list.append(x_train[i])
# count words that occur in spam and non-spam messages    
    spam_dict = {}
    ham_dict = {}
    
    for i in range(len(y_train)):
        if y_train[i] == 1:
            for element in x_train[i]:
                if element in spam_dict:
                    spam_dict[element] += 1
                else:
                    spam_dict[element] = 1
        elif y_train[i] == 0:
            for element in x_train[i]:
                if element in ham_dict:
                    ham_dict[element] += 1
                else:
                    ham_dict[element] = 1
# calculate probabilities    
    total_sms = len(spam_list) + len(ham_list)
    percent_spam = len(spam_list) / total_sms 
    percent_ham = len(ham_list) / total_sms
    total_spam = sum(spam_dict.values())
    total_ham = sum(ham_dict.values())
# make predictions    
    y_pred = []
    for i in range(len(x_test)):
        spam_probability = 1
        ham_probability = 1
        for word in x_test[i]:
            if word in spam_dict:
                word_in_spam = spam_dict[word]
            elif word not in spam_dict:
                word_in_spam = 0
        spam_probability *= (word_in_spam + 1) / (total_spam + len(spam_dict)) # smoothing
        for word in x_test[i]:
            if word in ham_dict:
                word_in_ham = ham_dict[word]
            elif word not in ham_dict:
                word_in_ham = 0
        ham_probability *= (word_in_ham + 1) / (total_ham + len(ham_dict)) # smoothing
    
        final_spam = spam_probability * percent_spam
        final_ham = ham_probability * percent_ham
    
        if final_spam > final_ham:
            y_pred.append(1)
        else:
            y_pred.append(0)
# calculate accuracy and append to accuracy1 list    
    score = accuracy_score(y_test,y_pred)
    accuracy1.append(score)

In [7]:
# check accuracy metrics for the k-folds
print(accuracy1)

[0.9130044843049328, 0.9255605381165919, 0.9192100538599641, 0.9048473967684022, 0.9066427289048474]


### Sklearn NB Classifier

Now we will use sklearn multinomial NB classifier. Before using the classifier, we need to apply count vectorizer on text data.

Predictions on k-folds will be recorded to accuracy2 list.

In [8]:
# create classifier and counter objects 
classifier = MultinomialNB()
counter = CountVectorizer()
x = data['v2'].tolist()
x = np.array(x)
y = np.array(y)
# perform 5-fold CV 
accuracy2 = []
kfold = StratifiedKFold(n_splits=5)
for train_index, test_index in kfold.split(x,y):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
# transform data     
    counter.fit(x_train)
    train_counts = counter.transform(x_train)
    test_counts = counter.transform(x_test)
# fit classifier and predict    
    classifier = MultinomialNB()
    classifier.fit(train_counts, y_train)
    y_pred = classifier.predict(test_counts)
    score = accuracy_score(y_test, y_pred)
    accuracy2.append(score)  

In [9]:
# check accuracy metrics for the k-folds
print(accuracy2)

[0.9865470852017937, 0.9865470852017937, 0.9847396768402155, 0.9829443447037702, 0.9856373429084381]


### Models evaluation
Both the hand-made and sklearn classifiers achieved high test accuracy. Sklearn NB was slightly more accurate

In [10]:
# create table to compare average accuracy on hand-made and sklearn classifiers
info = {'Model': ['Hand-Made','Sklearn'],
       'Average Score': [np.mean(accuracy1), np.mean(accuracy2)]}
df = pd.DataFrame(info,columns=['Model','Average Score'])

df

Unnamed: 0,Model,Average Score
0,Hand-Made,0.913853
1,Sklearn,0.985283
