# Spam or Ham Project using Naive Bayes

Here, we will be using naive bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam.

In [2]:
# importing essential libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# loading the dataset
df = pd.read_table('SMSSpamCollection',sep='\t',\
    header = None, names = ['label','sms_message'])
# printing shape of dataframe
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [22]:
# conversion of labels(spam or ham) into 1 or 0
df['label'] = df.label.map({'ham':0,'spam':1})
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Implementing Bag of Words from the scratch

In [23]:
# convert into lower case
documents = ['Hello, how are you!',\
             'Win money, win from home.',\
             'Call me now.',\
             'Hello, Call hello you tomorrow?']

lower_case_documents = []

for i in documents:
    lower_case_documents.append(i.lower())
    
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


In [24]:
# removing all punctuations 
sans_punctuation_documents = []
import string

for i in lower_case_documents:
    sans_punctuation_documents.append(i.translate(\
    str.maketrans("","",string.punctuation)))
    
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


In [25]:
# tokenisation
# means splitting up a sentence into individual words
preprocessed_documents = []

for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(" "))
    
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


In [26]:
# counting the occurrence of each word in each 
# document of the document set

frequency_list = []
import pprint
from collections import Counter

for i in preprocessed_documents:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
    
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


## Implementing Bag of Words in scikit-learn



In [27]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

In [29]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [30]:
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

The get_feature_names() method returns our feature names for this dataset, which is the set of words that make up our vocabulary for 'documents'.

Now, Create a matrix with the rows being each of the 4 documents, and the columns being each word. The corresponding (row, column) value is the frequency of occurrance of that word(in the column) in a particular document(in the row).

In [31]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [32]:
frequency_matrix = pd.DataFrame(doc_array,columns = \
                    count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


## Training and Testing sets

In [33]:
from sklearn.model_selection import train_test_split

# splitting into train and test sets
X_train, X_test, y_train , y_test = train_test_split(\
       df['sms_message'],df['label'],random_state = 1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


## Applying Bag of Words processing to our dataset

In [34]:
count_vector = CountVectorizer()

training_data = count_vector.fit_transform(X_train)

testing_data = count_vector.transform(X_test)

## Bayes Theorem implementation using scikit-learn

In [36]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()

# fitting the model
naive_bayes.fit(training_data,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [37]:
# predicting labels using test set
predictions = naive_bayes.predict(testing_data)

For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric.<br> We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score.<br> For such cases, precision and recall come in very handy.<br>These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores.<br>This score can range from 0 to 1, with 1 being the best possible F1 score.

In [38]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


One of the major advantages that Naive Bayes has over other classification algorithms is its <i>ability to handle an extremely large number of features</i>.<br>In our case,<i> each word is treated as a feature</i> and there are thousands of different words.<br> Also, it <i>performs well even with the presence of irrelevant features</i> and is relatively unaffected by them.<br> The other major advantage it has is its relative simplicity.<br> Naive Bayes' works well right out of the box and <i>tuning it's parameters is rarely ever necessary</i>, except usually in cases where the distribution of the data is known.<br> It <i>rarely ever overfits</i> the data.<br> Another important advantage is that its model training and prediction times are <i>very fast for the amount of data it can handle.</i>