# Building a Spam Classifier

The goal of this project is to build a spam classifier for e-mail filtering.

For this task, we use a training data set composed of 2646 e-mails, where 1214 were previously labeled as "spam" and 1432 as "notspam".

A "bag of words" model is used to codify each e-mail message as a set of textual features, and then the Naive Bayes learning algorithm is used to learn the predictive model. Two variations of the Naive Bayes algorithm are tried: the so called Multinomial Naive Bayes and Bernoulli Naive Bayes.

The predictive models are evaluated in a test set containing 2554 e-mails, where 1185 are "spam" and 1369 are "notspam".

## 1 Preprocessing

When we work with textual data, lots of preprocessing may be required depending on the application. In this case, as we're going to use the "bag of words" model, we only need to get rid of some unnecessary symbols and words, do some stemming to reduce the feature domain, and then codify the results to TF-IDF (term frequency-inverse document frequency) vectors.

In [1]:
import os
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd

In [2]:
data_dir = 'data'
target_names = ['notspam', 'spam']

In [3]:
# These words won't add anything to the model, so they may be removed.
stopwords_en = stopwords.words('english')
stopwords_en

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

In [4]:
# Get all punctuation characters except $, % and &, because perhaps these
# ones (especially $) may provide good hints that a message is a spam.
punct = ''.join([x for x in string.punctuation if x not in '$%&'])
punct

'!"#\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
# Other punctuation characters are going to be erased.
punct_trans = str.maketrans(dict.fromkeys(punct))

In [6]:
# There are a bunch of stemming algorithms out there.
# I'll stick to Lancaster's one.
stemmer = LancasterStemmer()

In [7]:
# This dictionary will organize all cleaned data to make things easier later.
df_dict = {
    'train': {'target': [], 'text': []},
    'test': {'target': [], 'text': []},
}

In [8]:
def clean(text):
    """Apply some basic steps for text cleaning, by tokenizing words,
    removing punctuation, removing stopwords, and doing word stemming.
    """
    # Punctuation removal.
    text = text.translate(punct_trans)

    # Word tokenization + stopwords removal. 
    words = word_tokenize(text, language='english')
    words = [x for x in words if x not in stopwords_en]

    # Word stemming.
    stems = [stemmer.stem(x) for x in words]
    return ' '.join(stems)

In [9]:
# The training and test data are spread over multiple files and directories.
# File contents are encoded with latin-1 encoding.
for dataset in os.listdir(data_dir):
    dataset_dir = os.path.join(data_dir, dataset)

    for target_name in os.listdir(dataset_dir):
        target_dir = os.path.join(dataset_dir, target_name)

        for filename in os.listdir(target_dir):
            with open(os.path.join(target_dir, filename), encoding='latin-1') as file:                
                text = clean(file.read())
                target = target_names.index(target_name)

                df_dict[dataset]['text'].append(text)
                df_dict[dataset]['target'].append(target)

In [10]:
# Now it's easy to create DataFrames that put data together in a good shape.
df_train = pd.DataFrame(df_dict['train'])
df_test = pd.DataFrame(df_dict['test'])

df_train.head()

Unnamed: 0,target,text
0,1,from yourhealthhottmailcom fri sep 20 114132 2...
1,1,from ihjkhangel92470yahoocom mon jun 24 174856...
2,1,from nonenonecom mon sep 2 162801 2002 returnp...
3,1,from forkadminxentcom wed jul 3 120744 2002 re...
4,1,from tim3h85kg9bigfootcom wed dec 4 115840 200...


In [11]:
train_size = df_train.shape[0]
test_size = df_test.shape[0]

train_size, test_size

(2646, 2554)

In [12]:
# Finally, the TF-IDF feature encoding.
# We codify only the 1000 most frequent terms (at most) as features.
tfidf_vect = TfidfVectorizer(encoding='latin-1', max_features=1000)

In [13]:
# Learn the features and transform the training data.
X_train = tfidf_vect.fit_transform(df_train['text'])
X_train.shape

(2646, 1000)

In [14]:
# Get the known outputs.
y_train = df_train['target'].data
y_train.shape

(2646,)

## 3 Model training

After preprocessing the data, the Naive Bayes algorithm can finally be used to learn the predictive models.

What distinguishes Multinomial Naive Bayes from Bernoulli Naive Bayes is basically the way each algorithm takes the feature words into account in their computations: the former considers the relative frequency of each feature, while the latter just checks whether a feature is contained in the input text or not.

Thanks to Scikit-learn (and the fact that we don't need to worry with data sampling, cross validation and parameter tuning in this project), training the classifiers (predictive models) is extremely easy.

In [15]:
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

In [16]:
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
bnb_model = BernoulliNB()
bnb_model.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## 4 Model evaluation

Now both models can be evaluated in the test set, which contains only e-mail messages that are completely unknown to the models.

There are plenty of ways to evaluate the results of a binary classifier. Here we use the confusion matrix, the precision-recall scores, the F1-score and the area under the ROC curve. The results are tested both in the training set and in the test set (even though only the latter really matters).

In [18]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

In [19]:
# Transform the test data to the SAME space of features learned in the training step.
# This is crucial!
X_test = tfidf_vect.transform(df_test['text'])
X_test.shape

(2554, 1000)

In [20]:
# Get the expected results.
y_test = df_test['target'].data
y_test.shape

(2554,)

In [21]:
# Just some labels for the confusion matrices.
index = [np.array(['actual', 'actual']), np.array(['0', '1'])]
columns = [np.array(['predicted', 'predicted']), np.array(['0', '1'])]

### 4.1 MultinomialNB

In [22]:
mnb_pred_train = mnb_model.predict(X_train)
mnb_pred_test = mnb_model.predict(X_test)

In [23]:
# Evaluation in the training data set.
print(pd.DataFrame(confusion_matrix(y_train, mnb_pred_train), index=index, columns=columns))
print()
print(classification_report(y_train, mnb_pred_train))
print('AUC = %.3f' % roc_auc_score(y_train, mnb_pred_train))

         predicted      
                 0     1
actual 0      1343    89
       1        45  1169

             precision    recall  f1-score   support

          0       0.97      0.94      0.95      1432
          1       0.93      0.96      0.95      1214

avg / total       0.95      0.95      0.95      2646

AUC = 0.950


In [24]:
# Evaluation in the test data set.
print(pd.DataFrame(confusion_matrix(y_test, mnb_pred_test), index=index, columns=columns))
print()
print(classification_report(y_test, mnb_pred_test))
print('AUC = %.3f' % roc_auc_score(y_test, mnb_pred_test))

         predicted      
                 0     1
actual 0      1263   106
       1        65  1120

             precision    recall  f1-score   support

          0       0.95      0.92      0.94      1369
          1       0.91      0.95      0.93      1185

avg / total       0.93      0.93      0.93      2554

AUC = 0.934


### 4.2 BernoulliNB

In [25]:
bnb_pred_train = bnb_model.predict(X_train)
bnb_pred_test = bnb_model.predict(X_test)

In [26]:
# Evaluation in the training data set.
print(pd.DataFrame(confusion_matrix(y_train, bnb_pred_train), index=index, columns=columns))
print()
print(classification_report(y_train, bnb_pred_train))
print('AUC = %.3f' % roc_auc_score(y_train, bnb_pred_train))

         predicted      
                 0     1
actual 0      1310   122
       1        57  1157

             precision    recall  f1-score   support

          0       0.96      0.91      0.94      1432
          1       0.90      0.95      0.93      1214

avg / total       0.93      0.93      0.93      2646

AUC = 0.934


In [27]:
# Evaluation in the test data set.
print(pd.DataFrame(confusion_matrix(y_test, bnb_pred_test), index=index, columns=columns))
print()
print(classification_report(y_test, bnb_pred_test))
print('AUC = %.3f' % roc_auc_score(y_test, bnb_pred_test))

         predicted      
                 0     1
actual 0      1234   135
       1        69  1116

             precision    recall  f1-score   support

          0       0.95      0.90      0.92      1369
          1       0.89      0.94      0.92      1185

avg / total       0.92      0.92      0.92      2554

AUC = 0.922


### 5 Conclusions

A classifier that performs well only in the training data set would be useless. Fortunately, this was not the case at all. Even though we used a pretty simple approach which represents texts as a mere set of indepedent words (something that is definitely not true in the real world!), both models achieved great results. Precision, recall, F1-score and the AUC value got all above 0.90 even in test set. This shows how good the Naive Bayes algorithm (and its variations) can be for building text-based predictive models.

Moreover, we can see that the Multinomal Naive Bayes model is slightly better than the other one, and that most mistakes that both models made were false negatives (i.e., messages that were not spam but were classified as spams). Had we set specific values for the class priors (assuming, for example, that spams are probabilistic rarer than non-spam messages), then PERHAPS these results could be improved even more (they would if the assumption was correct).

Just to finish, I put below some additional examples of usage...

In [28]:
messages = [
    "Hey, dude, what's up? It would be nice to hear from you!",
    "Buy our new exclusive product for only $200.",
    "Congratulations, you just won a free ticket to Hell! Call 666 to claim it.",
    "Fishes really stink, scientists say.",
]

features = tfidf_vect.transform([clean(x) for x in messages])

In [29]:
mnb_model.predict(features)


array([0, 1, 1, 0])

In [30]:
bnb_model.predict(features)

array([1, 1, 1, 1])

The Multinomal Naive Bayes correctly classified all of them, but the Bernoulli Naive Bayes mistakenly took the 1st and 4th messages as spams. It looks like just checking the presence/abscence of words is not as a good approach as computing the word frequencies (at least for this particular example). By the way, the Bernoulli Naive Bayes model did perform very well in the real e-mails anyway.