# Building a Spam Filter Using the Naive Bayes Algorithm

THis project aims to build a spam filter for SMS messages. To classify messages, the computer:

* Learns how humans classify messages.
* Uses the human knowledge to estimate probabilities for new messages - probabilities for spam and non-spam.
* Classifies a new message based on these probability values - if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the probabilities are equal, then a human may need to classify the message)

The first task will be to "teach" the computer how to classify messages. This will be done using the multinomial Naive Bayes algorithm along with a [dataset](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) containing 5572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo. The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

Instead of following the guided project, I will attempt to use the scikit learn inbuilt Naive Bayes class using an example from [Stack A](https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/)

In [16]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import nltk

In [2]:
# Read in the data
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

First clean the data to give a format readable by the inbuilt functions

In [3]:
# Convert classifier labels from strings to binary

df['Label'] = df.Label.map({'ham': 0, 'spam': 1})

# Remove punctuation and convert all to lower case

df['SMS'] = df.SMS.str.replace('\W', ' ')

# Tokenize the messages into single words
df['SMS'] = df.SMS.apply(nltk.word_tokenize)

In [5]:
# Normalise our text for all variations of words carrying the same meaning regardless of tense
# Also known as word stemming

stemmer = nltk.stem.PorterStemmer()
df['SMS'] = df['SMS'].apply(lambda x: [stemmer.stem(y) for y in x])

# Convery word list into space separated strings
df['SMS'] = df['SMS'].apply(lambda x: ' '.join(x))
# Finally transfrom data into occurrences, the features we feed into model
count_vect = sklearn.feature_extraction.text.CountVectorizer()
counts = count_vect.fit_transform(df['SMS'])

We could leave it as a simple word-count per message but it is better to use the Term Frequency Inverse Document Frequency, better known as `tf-idf`.

In [7]:
transformer = sklearn.feature_extraction.text.TfidfTransformer().fit(counts)
counts = transformer.transform(counts)

Now that we have conducted feature extraction, we will build the model.

We start by splitting the data into training and test sets:

In [19]:
X_train, X_test, y_train, y_test = train_test_split(counts, df['Label'], 
                                                    test_size=0.1, random_state=1
                                                   )

Now just need to initialize the Naive Bayes Classifier and fit the data.

In [20]:
model = MultinomialNB().fit(X_train, y_train)

Now test its performance using the test set:

In [21]:
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

0.9695340501792115


Our simple Naive Bayes Classifier has 97% accuracy with this specific test set! But it is not enough by just providing the accuracy, since our dataset is imbalanced when it comes to the labels (86.6% legitimate in contrast to 13.4% spam). It could happen that our classifier is over-fitting the legitimate class while ignoring the spam class. To solve this uncertainty, let's have a look at the confusion matrix:

In [22]:
print(sklearn.metrics.confusion_matrix(y_test, predicted))

[[488   1]
 [ 16  53]]


The error doesn't seem particularly well balanced between legitimate and spam. There is 1 legitimate message predicted as spam and 16 spam messages predicted as legitimate.

In [23]:
predicted

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [25]:
y_test

1078    0
4028    0
958     0
4642    0
4674    0
       ..
3529    1
5488    0
5134    0
5       1
1289    0
Name: Label, Length: 558, dtype: int64