### Spam Email Classification

In this notebook, I have created a simple Feed-forward Neural Network model using Keras to classify spam emails based on this [dataset](https://spamassassin.apache.org/old/publiccorpus/). I have used NLP techniques inspired from this [notebook](https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb) and tweaked it to process and transform the data. Furthermore, I used Scikit-learn pipelines to perform data cleaning, data transformation, and modelling seamlessly on the training and testing datasets.

Cross-validation score on training data: **98.35%**

Final accuracy score on unseen test data: **99.20%**

In [1]:
import email
import email.policy
import re
import urlextract
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from html import unescape
from nltk import PorterStemmer
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin
from os import path, listdir
from collections import Counter
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers import Dense
from keras import Sequential

Using TensorFlow backend.


#### Data Extraction

Extracting data from text files of specific format and categorizing them into spam and ham emails

In [2]:
PATH = "./dataset"

def categorize_files(root_path):
    """
    Categorizes files in the directories into spam and ham files
    """
    
    dirs = [dir_ for dir_ in listdir(PATH)]
    
    spam_files, ham_files = [], []
    for dir_ in dirs:
        if 'spam' in dir_:
            for file in listdir(path.join(PATH, dir_)):
                spam_files.append(path.join(PATH, dir_, file))
        else:
            for file in listdir(path.join(PATH, dir_)):
                ham_files.append(path.join(PATH, dir_, file))
                
    return spam_files, ham_files

spam_files, ham_files = categorize_files(PATH)

In [3]:
def parse_email(file):
    """
    Converts the email text into objects using Python's email module
    """
    
    with open(file, 'rb') as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)
    
spam_emails = list(map(lambda x: parse_email(x), spam_files))
ham_emails = list(map(lambda x: parse_email(x), ham_files))

**Peeking the extracted data**

In [4]:
print(spam_emails[1000].get_content().strip())

FUTURE TECH INTERNATIONAL

SPECIAL OFFER that "BLOWS AWAY TRADITIONAL MARKETING" - Advertising Age.

The most powerful fully exportable CD Fax Database on the market
1,500,000 Business Fax Numbers

OVER 1.5 MILLION FAX NUMBERS FULLY EXPORTABLE!
Usually Sells for $295. For a limited time only we are offering them for:

ONLY $49.95 USD
(Never before have this many fax numbers been sold for so cheap)

TARGETED EMAIL LIST
100 MILLION EMAIL ADDRESSES
Usually sells for $195.00. For a limited time only we are offering them for:

ONLY $79.95 USD

SPECIAL PACKAGE DEAL FOR BOTH DIRECTORIES:

ONLY $99.95 USD

MORE THAN 34 CATEGORIES SUCH AS:
-Multi level marketers
-Opportunity Seekers
-Telephone Area Code
-Country, City, State, etc...
-Travel & Vacations
-Opt-in
-People intersted in investments
-People or businesses who spent more than $1000 on the web in the last 2 months
-AND MANY MORE

*Everything o n this disk is in TEXT file format and fully Exportable.
*The CD is as easy to use as browsing 

In [5]:
print(ham_emails[1].get_content().strip())

Hiya, I always seem to get errors when I do an "apt update", is this a 
problem on the repository itself, or on my end, or possibly a timeout in 
the connection due to my connection being a crappy modem?

[root@spawn root]# apt-get update
Hit http://apt.nixia.no redhat/7.2/i386/base/pkglist.gnomehide
Hit http://apt.freshrpms.net redhat/7.2/i386/base/pkglist.os
Ign http://apt.freshrpms.net redhat/7.2/i386 release.os
Err http://apt.freshrpms.net redhat/7.2/i386/base/pkglist.updates
  Bad header line
Hit http://apt.freshrpms.net redhat/7.2/i386 release.updates
Err http://apt.freshrpms.net redhat/7.2/i386/base/pkglist.freshrpms
  400 Bad Request
Err http://apt.freshrpms.net redhat/7.2/i386 release.freshrpms
  Bad header line
Hit http://apt.freshrpms.net redhat/7.2/i386/base/srclist.freshrpms
Ign http://apt.nixia.no redhat/7.2/i386 release.gnomehide
Ign http://apt.nixia.no redhat/7.2/i386/base/mirrors
Hit http://apt.freshrpms.net redhat/7.2/i386 release.freshrpms
Ign http://apt.freshrpms.ne

#### Exploratory Data Analysis

Checking the distribution of spam and ham emails

In [6]:
print(f"Number of Spam emails: {len(spam_emails)}")
print(f"Number of Ham emails: {len(ham_emails)}")

Number of Spam emails: 1898
Number of Ham emails: 2501


Finding the different types of email structures in the dataset

In [7]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(email, list):
        return f"""multipart({', '.join(get_email_structure(sub_email) 
                                        for sub_email in payload)})"""
    else:
        return email.get_content_type()
    
def structure_counter(emails):
    counter = Counter()
    for email in emails:
        structure = get_email_structure(email)
        counter[structure] += 1
        
    return counter

In [8]:
print(structure_counter(spam_emails).most_common())

[('text/plain', 817), ('text/html', 772), ('multipart/alternative', 169), ('multipart/mixed', 99), ('multipart/related', 40), ('text/plain charset=us-ascii', 1)]


In [9]:
print(structure_counter(ham_emails).most_common())

[('text/plain', 2409), ('multipart/signed', 68), ('multipart/mixed', 10), ('multipart/alternative', 9), ('multipart/related', 3), ('multipart/report', 2)]


Checking the diiferent headers present in an email

In [10]:
for header, value in ham_emails[1].items():
    print(f"{header}: {value}\n")

Return-Path: <rpm-zzzlist-admin@freshrpms.net>

Delivered-To: yyyy@localhost.spamassassin.taint.org

Received: from localhost (jalapeno [127.0.0.1])	by jmason.org (Postfix) with ESMTP id 9D98A16EFC	for <jm@localhost>; Mon,  9 Sep 2002 18:00:20 +0100 (IST)

Received: from jalapeno [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for jm@localhost (single-drop); Mon, 09 Sep 2002 18:00:20 +0100 (IST)

Received: from auth02.nl.egwn.net (auth02.nl.egwn.net [193.172.5.4]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g11MGS812405 for    <jm-rpm@jmason.org>; Fri, 1 Feb 2002 22:16:28 GMT

Received: from auth02.nl.egwn.net (localhost [127.0.0.1]) by    auth02.nl.egwn.net (8.11.6/8.11.6/EGWN) with ESMTP id g11MF0308879;    Fri, 1 Feb 2002 23:15:00 +0100

Received: from drone4.qsi.net.nz (drone4-svc-skyt.qsi.net.nz    [202.89.128.4]) by auth02.nl.egwn.net (8.11.6/8.11.6/EGWN) with SMTP id    g11MEh308869 for <rpm-list@freshrpms.net>; Fri, 1 Feb 2002 23:14:43 +0100

Received: (qmail 9

Store the different types of headers present in both spam and non-spam emails as a set for later use

In [11]:
HEADERS = set()

for spam_email in spam_emails:
    for header in spam_email.keys():
        HEADERS.add(header)
        
for ham_email in ham_emails:
    for header in ham_email.keys():
        HEADERS.add(header)

Splitting the data into train and test set using **StratifiedShuffleSplit**

In [12]:
X = np.r_[spam_emails, ham_emails]
y = np.r_[np.ones(len(spam_emails)), np.zeros(len(ham_emails))]

stratified_splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in stratified_splitter.split(X, y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]
    
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (3519,)
X_test shape: (880,)
y_train shape: (3519,)
y_test shape: (880,)


**Data Cleaning**

Removing the HTML tags from the emails and converting them to plain text

In [13]:
def html_to_plain_text(email):
    
    text = re.sub('<head.*?>.*?</head>', '', email, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    
    return unescape(text)

def email_to_text(email):
    for part in email.walk(): # to walk through the parts of an email
        c_type = part.get_content_type()
        
        if not c_type in ['text/plain', 'text/html']: # skip if email content is not text
            continue
        try:
            content = part.get_content() # get the content of the email
        except:
            content = str(part.get_payload()) # in case of encoding issues
        if c_type == 'text/plain':
            return content # return the content in case of plain text
        else:
            return html_to_plain_text(content) # other return after removing HTML tags
    
    return "no content" # if the email doesn't contain any text content

Regex patterns and other constants to process the data

In [14]:
HEADERS_PATTERN = re.compile(f"({'|'.join(HEADERS)})" + r"\s*:\s*.*")
PUNCTUATIONS_PATTERN = re.compile(r"[^a-zA-Z0-9\s]")
NUMBERS_PATTERN = re.compile(r"\d+(?:\.\d+)?(?:[eE]-?\d+)?")
STOPWORDS = set(stopwords.words('english'))
URL_PATTERN = re.compile((r"(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}"
                          r"|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}"
                          r"|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}"
                          r"|www\.[a-zA-Z0-9]+\.[^\s]{2,}"
                          r"|[a-zA-Z0-9]+\.[a-zA-Z]{2,})"))
UNWANTED_WORDS = re.compile(r"(html|doctype|head|xxx|mime|transitionalen|http)")

Creating a custom data transformer to clean and process the contents of the emails

In [15]:
class EmailContentTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, to_lower=True, remove_punctuation=True,
                 replace_URLs=True, replace_num=True, stem_words=True,
                 remove_stopwords=True):
        self.strip_headers = strip_headers
        self.to_lower = to_lower
        self.remove_punctuation = remove_punctuation
        self.replace_URLs = replace_URLs
        self.replace_num = replace_num
        self.stem_words = stem_words
        self.remove_stopwords = remove_stopwords
        if self.stem_words:
            self.stemmer = PorterStemmer()
    
    def get_stemmed_words(self, email_content):
        """
        Returns the content with words stemmed
        """
        stemmed_words = ""
        for word in email_content.split():
            if len(word) <= 20:
                if self.remove_stopwords:
                    if word not in STOPWORDS:
                        stemmed_words += f"{self.stemmer.stem(word)} "
                else:
                    stemmed_words += f"{self.stemmer.stem(word)} "
        
        return stemmed_words
    
    def fit(self, X=None, y=None):
        return self
    
    def transform(self, X, y=None):
        """
        Returns an array of strings with transformed emails
        """
        X_transformed = []
        for index, email in enumerate(X): # loops through every email in the dataset
            email_content = email_to_text(email) # convert the email object to plain text
            
            if self.replace_URLs: # replaces the URLs with word URL
                email_content = URL_PATTERN.sub(" URL ", email_content)
            if self.strip_headers: # removes the headers from the text
                email_content = HEADERS_PATTERN.sub("", email_content)
            if self.to_lower: # converts the content to lower case
                email_content = email_content.lower()
            if self.remove_punctuation: # removes the punctuations
                email_content = PUNCTUATIONS_PATTERN.sub("", email_content)
            if self.replace_num: # replaces the numbers with word NUMBER
                email_content = NUMBERS_PATTERN.sub(" NUMBER ", email_content)
            email_content = UNWANTED_WORDS.sub(" ", email_content) # removes the unwanted words
            if self.stem_words: # stems the words in the email content
                email_content = self.get_stemmed_words(email_content)
                
            X_transformed.append(email_content)
            
        return np.array(X_transformed)

Creating a custom pipeline for data preparation and transformation

In [16]:
data_prep_pipeline = Pipeline([
    ('email_content_transformer', EmailContentTransformer()), # cleans the data
    ('tfidf_transformer', TfidfVectorizer()) # transforms the cleaned content into tf-idf vectors
])

#### Modelling

Function to build the feed-forward neural network model

In [17]:
def build_model():
    """
    Builds and returns the model
    """
    input_shape = len(data_prep_pipeline['tfidf_transformer']
                      .get_feature_names()) # to dynamically get the shape of input data as it varies 
                                            # during cross validation and final model training
    model = Sequential()
    # Adding layers to the model
    model.add(Dense(units=30, activation='relu',
                    input_shape=(input_shape,)))
    model.add(Dense(units=20, activation='relu'))
    model.add(Dense(units=1, activation='sigmoid'))
    
    # Specifying the loss function and optimizer
    model.compile(loss='binary_crossentropy', optimizer='adam',
                  metrics=['accuracy'])
    
    return model

Cross-validating the model with 5 folds to know how the model performs on the unseen data

In [18]:
# Using StratifiedKFold to ensure that folds contain samples from both classes
stratified_folds = StratifiedKFold(n_splits=5, shuffle=True,
                                   random_state=42)
scores = []
for fold, indices in enumerate(stratified_folds.split(X_train, y_train)):
    X_train_, y_train_ = X_train[indices[0]], y_train[indices[0]]
    X_test_, y_test_ = X_train[indices[1]], y_train[indices[1]]
    
    # Creating a model pipeline to train and predict data along with data preparation
    model_pipeline = Pipeline([
        ('data_prep_pipeline', data_prep_pipeline),
        ('model', KerasClassifier(build_model, epochs=3, batch_size=16))
    ])
    print(f"Fold {fold + 1}", end="\n\n")
    model_pipeline.fit(X_train_, y_train_)
    predictions = model_pipeline.predict(X_test_)
    score = accuracy_score(y_test_, predictions)
    scores.append(score)
    print(f"\nScore for Fold {fold + 1}: {score:.4f}",
          end="\n\n")

print(f"CV score: {np.mean(scores):.4f}")

Fold 1

Epoch 1/3
Epoch 2/3
Epoch 3/3

Score for Fold 1: 0.9872

Fold 2

Epoch 1/3
Epoch 2/3
Epoch 3/3

Score for Fold 2: 0.9844

Fold 3

Epoch 1/3
Epoch 2/3
Epoch 3/3

Score for Fold 3: 0.9886

Fold 4

Epoch 1/3
Epoch 2/3
Epoch 3/3

Score for Fold 4: 0.9872

Fold 5

Epoch 1/3
Epoch 2/3
Epoch 3/3

Score for Fold 5: 0.9701

CV score: 0.9835


Fitting a new model to the entire training data and estimating the model's score on test data

In [19]:
model_pipeline = Pipeline([
    ('data_prep_pipeline', data_prep_pipeline),
    ('model', KerasClassifier(build_model, epochs=5, batch_size=16))
])

# Fitting a model to the data
model_pipeline.fit(X_train, y_train)
# Making predictions with the trained model
predictions = model_pipeline.predict(X_test)

print((f"\nAccuracy score on test data:"
       f"{accuracy_score(y_test, predictions):.4f}"))                         

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Accuracy score on test data:0.9920
