# Building a Spam classifier

To build this classifier we will use the m Apache SpamAssassin’s public datasets: https://spamassassin.apache.org/old/publiccorpus/


## 1. Acessing the data

The data is available in tar files via the web. Each tar file contain many text files with each text file representing an email. We will download the tar files using the urllib and save them into the datasets folder.

In [3]:
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(ham_url=HAM_URL, spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(SPAM_PATH):
        os.makedirs(SPAM_PATH)
    for filename, url in (("ham.tar.bz2", ham_url), ("spam.tar.bz2", spam_url)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()


In [4]:
fetch_spam_data()

In [5]:
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [6]:
len(ham_filenames)

2500

In [7]:
len(spam_filenames)

500

## 2. Parsing the data

We can use python's email modules to parse these emails.

In [8]:
import email
import email.parser
import email.policy

def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [9]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]

In [10]:
print(ham_emails[1].get_content().strip())

Martin A posted:
Tassos Papadopoulos, the Greek sculptor behind the plan, judged that the
 limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the
 Mount Athos monastic community, was ideal for the patriotic sculpture. 
 
 As well as Alexander's granite features, 240 ft high and 170 ft wide, a
 museum, a restored amphitheatre and car park for admiring crowds are
planned
---------------------
So is this mountain limestone or granite?
If it's limestone, it'll weather pretty fast.

------------------------ Yahoo! Groups Sponsor ---------------------~-->
4 DVDs Free +s&p Join Now
http://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM
---------------------------------------------------------------------~->

To unsubscribe from this group, send an email to:
forteana-unsubscribe@egroups.com

 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/


In [11]:
print(spam_emails[1].get_content().strip())


1) Fight The Risk of Cancer!
http://www.adclick.ws/p.cfm?o=315&s=pk007

2) Slim Down - Guaranteed to lose 10-12 lbs in 30 days
http://www.adclick.ws/p.cfm?o=249&s=pk007

3) Get the Child Support You Deserve - Free Legal Advice
http://www.adclick.ws/p.cfm?o=245&s=pk002

4) Join the Web's Fastest Growing Singles Community
http://www.adclick.ws/p.cfm?o=259&s=pk007

5) Start Your Private Photo Album Online!
http://www.adclick.ws/p.cfm?o=283&s=pk007

Have a Wonderful Day,
Offer Manager
PrizeMama













If you wish to leave this list please use the link below.
http://www.qves.com/trim/?ilug@linux.ie%7C17%7C114258


-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie


In [12]:
print(spam_emails[7].get_content().strip())

<html>
<head>
<title>ReliaQuote - Save Up To 70% On Life Insurance</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body leftmargin="0" topmargin="0" link="#FFCC99" vlink="#FFCC99" alink="#FFCC00">
<table align="center" width="468" border="0" cellspacing="0" cellpadding="0" height="500" bgcolor="993366">
  <tr>
    <td align="left" valign="top" height="43" width="56%">
      <table width="100%" border="0" cellspacing="0" cellpadding="3">
        <tr>
          <td><a href="http://theadmanager.com/server/c.asp?ad_key=YUESBHWAKMLK&ext=1" target="_blank"><img src="http://www.reliaquote.com/banner/bannerads/images/logo6.gif" width="120" height="32" border="0"></a></td>
        </tr>
      </table>
    </td>
    <td align="left" valign="top" height="43" width="44%">&nbsp;</td>
  </tr>
  <tr>
    <td align="left" valign="top" width="56%" height="377">
      <table width="100%" border="0" cellspacing="0" cellpadding="0">
        <tr align="right" valig

Some emails are multi part and contain images, attachments (with their own attachments), html and other structures. We can use the email library to extract these structures.

In [13]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

In [14]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures


In [15]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [16]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

We will want to parse the html into plain text. Lets do this using regex. We will remove the head of html entirely, replace all \<a\> tags with the word HYPERLINK and then remove all remaining html tags but retain their text content. 

In [5]:
import re
from html import unescape

def html_to_plain_text(html):
    text = re.sub("<head.*?>.*?</head>", "", html, flags=re.M | re.S | re.I)
    text = re.sub("<a\s.*?>", " HYPERLINK ", text, flags=re.M | re.S | re.I)
    text = re.sub("<.*?>", "", text, flags=re.M | re.S)
    text = re.sub(r"(\s*\n)+", "\n", text, flags=re.M | re.S)
    return unescape(text)

Okay now we can write a function that converts an email to plain text whatever the format is.

In [18]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ('text/plain', 'text/html'):
            continue
        try:
            content = part.get_content()
        except:
            content = str(part.get_payload())
        if ctype == 'text/plain':
            return content
        else:
            html = content
        if html:
            return html_to_plain_text(html)

Now lets use the functionality we have built and some extra functionality provided by nltk and urlextract to build a tranformer class for our pipeline

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from nltk import PorterStemmer
from urlextract import URLExtract
urlextractor = URLExtract()
stemmer = PorterStemmer()


class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            
            if self.lower_case:
                text = text.lower()
            
            if self.replace_urls:
                urls = list(set(urlextractor.find_urls(text)))
                for url in urls:
                    text.replace(url, " URL ")
            
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)

            
            word_counts = Counter(text.split())
            if self.stemming:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                
                word_counts = stemmed_word_counts

            X_transformed.append(word_counts)
        return np.array(X_transformed)


In [20]:
#testing the transformer
some_word_counts = EmailToWordCounterTransformer().fit_transform(ham_emails[0:3])
some_word_counts

array([Counter({'number': 39, 'the': 15, 'pick': 9, 'lbrace': 6, 'rbrace': 6, 'i': 5, 'of': 5, 'list': 5, 'from': 4, 'com': 4, 'is': 4, 'sequenc': 4, 'hit': 4, 'thi': 3, 'inbox': 3, 'subject': 3, 'ftp': 3, 'mercuri': 3, 'command': 3, 'delta': 3, 'that': 3, 'version': 3, 'exmh': 3, 'worker': 3, 'date': 2, 'deepeddi': 2, 't': 2, 'error': 2, 'exec': 2, 's': 2, 'come': 2, 'nmh': 2, 'use': 2, 'on': 2, 'and': 2, 'mh_profil': 2, 'one': 2, 'redhat': 2, 'wed': 1, 'aug': 1, 'chri': 1, 'garrigu': 1, 'cwg': 1, 'numberfanumberd': 1, 'messag': 1, 'id': 1, 'tmda': 1, 'vircio': 1, 'can': 1, 'reproduc': 1, 'for': 1, 'me': 1, 'it': 1, 'veri': 1, 'repeat': 1, 'like': 1, 'everi': 1, 'time': 1, 'without': 1, 'fail': 1, 'debug': 1, 'log': 1, 'happen': 1, 'pick_it': 1, 'ftoc_pickmsg': 1, 'mark': 1, 'tkerror': 1, 'syntax': 1, 'in': 1, 'express': 1, 'int': 1, 'note': 1, 'if': 1, 'run': 1, 'by': 1, 'hand': 1, 'where': 1, 'obvious': 1, 'm': 1, 'compil': 1, 'fuchsia': 1, 'cs': 1, 'mu': 1, 'oz': 1, 'au': 1, 'at': 

Now we count words we need to turn them into vectors. For this we build another transformer whose fit method will build the vocab (an ordered list of the most common words) and whose transform method will use the vocabularly to convert word counts to vectors.

In [21]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, vocabulary_size=100):
        self.vocabulary_size = vocabulary_size

    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        
        most_common = total_count.most_common(self.vocabulary_size)
        self.vocabulary = {word:index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary.get(word, 0)) #words not in vocab go in column zero
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))


In [22]:
#testing transformer
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
some_vectors = vocab_transformer.fit_transform(some_word_counts)
some_vectors.toarray()

array([[184,  15,  39,   2,   1,   5,   4,   9,   4,   2,   1],
       [ 99,   5,   4,   3,   3,   3,   3,   0,   2,   3,   3],
       [212,  16,   5,  10,  11,   5,   3,   0,   2,   3,   4]])

Lets prepare our training and test sets

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Lets build our preprocessing pipeline for emails.

In [25]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ("email_to_word_count", EmailToWordCounterTransformer()),
    ("word_count_to_vector", WordCounterToVectorTransformer(1000))
])

In [26]:
len(X_train)

2400

In [27]:
X_train_transformed = preprocess_pipeline.fit_transform(X_train)
X_train_transformed[0].toarray()

array([[3, 0, 0, ..., 0, 0, 0]])

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [29]:
X_train_transformed.shape


(2400, 1001)

In [30]:
y_train.shape, X_train.shape

((2400,), (2400,))

In [31]:
log_clf = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)
score.mean()


[CV] END ................................ score: (test=0.984) total time=   0.0s
[CV] END ................................ score: (test=0.985) total time=   0.0s
[CV] END ................................ score: (test=0.994) total time=   0.1s


0.9874999999999999

We will come back to train the model on more data and test out other models. For now lets fit the model look at some test statistics and export the preprocess pipeline and the model so we can deploy them

In [32]:
from sklearn.metrics import precision_score, recall_score

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print('Precision: ', precision_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))

Precision:  0.9789473684210527
Recall:  0.9789473684210527


In [33]:
import joblib as jb

jb.dump(preprocess_pipeline, r"C:\Users\caine\OneDrive\Documents\spam_detection_app\backend\app\models\preprocess_pipeline.pkl")
jb.dump(log_clf, r"C:\Users\caine\OneDrive\Documents\spam_detection_app\backend\app\models\logistic_classifier.pkl")

['C:\\Users\\caine\\OneDrive\\Documents\\spam_detection_app\\backend\\app\\models\\logistic_classifier.pkl']

To load a pickled custom transformer, the transfomer must be imported from the file it is defined in, pickled and then when loaded also imported into the file it is loaded from - if that makes sense. Given this we will import the transformers into this notebook (rather than define them in the notebook), fit them and then pickle them from here.

In [6]:
from modelling.custom_transformers import WordCounterToVectorTransformer, EmailToWordCounterTransformer

In [35]:
preprocess_pipeline = Pipeline([
    ("email_to_word_count", EmailToWordCounterTransformer()),
    ("word_count_to_vector", WordCounterToVectorTransformer(1000))
])

In [36]:
preprocess_pipeline.fit_transform(X_train)

<2400x1001 sparse matrix of type '<class 'numpy.int32'>'
	with 282901 stored elements in Compressed Sparse Row format>

In [37]:
jb.dump(preprocess_pipeline, r"C:\Users\caine\OneDrive\Documents\spam_detection_app\backend\app\models\preprocess_pipeline.pkl")

['C:\\Users\\caine\\OneDrive\\Documents\\spam_detection_app\\backend\\app\\models\\preprocess_pipeline.pkl']

In [7]:
EmailToWordCounterTransformer().transform(['<a>caine</a>'])

array([Counter({'a': 2, 'cain': 1})], dtype=object)