# Spam Classifier

In this notebook we are going to build an spam classifier using spam and ham email examples from [Apache SpamAssassin's public datasets](http://spamassassin.apache.org/old/publiccorpus/)

### Import libraries

In [107]:
import numpy as np
import pandas as pd
import os, email, email.policy
import re

from bs4 import BeautifulSoup
import nltk
import urlextract

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

## Loading data

In [2]:
# Load emails using Python email module
def load_emails(is_spam):
    directory = "spam" if is_spam else "easy_ham"
    path = os.path.join('data', directory)
    emails_list = []
    for filename in os.listdir(path):
        with open(os.path.join(path, filename), "rb") as f:
            new_email = email.parser.BytesParser(policy=email.policy.default).parse(f)
            emails_list.append(new_email)
    return emails_list

In [3]:
spam_emails = load_emails(is_spam=True)
ham_emails = load_emails(is_spam=False)

In [4]:
len(spam_emails), len(ham_emails)

(501, 2501)

In [5]:
print(spam_emails[1].get_content())

You have been removed from our list.
You will NOT be able to recieve todays picks in the email
You will NOT be notified of any new sports pick websites.

IF YOU HAVE QUESTIONS ABOUT WHY YOUR ACCOUNT IS EXPIRED,
YOUR ACCOUNT WAS CLOSED FOR ONE OF THE FOLLOWING REASONS. 

1. YOU FAILED TO LOG INTO YOUR ACCCOUNT FOR OVER A MONTH. 
2. YOUR ACCOUNT WAS FOUND ON A SPAM LIST AND REJECTED. 
3. THE GIFT ACCOUNT SOMEONE SIGNED YOU UP FOR EXPIRED.

If you wish to rejoin please go to the following url:
http://www.freewebs.com/registar/

YOU DO NOT NEED TO DO ANYTHING TO BE REMOVED FROM THIS eMAIL LIST.
THIS IS A ONE TIME MAILING TO NOTIFY YOU THAT, YOU ARE REMOVED.
However, you may reply with the word "remove" in the subject line





## Data exploration

Let's see how emails are structured

In [6]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload]))
    else:
        return email.get_content_type()

In [7]:
from collections import Counter

def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [8]:
structures_counter(ham_emails).most_common()

[('text/plain', 2409),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, application/x-java-applet)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1)]

In [9]:
structures_counter(spam_emails).most_common()

[('text/plain', 219),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart/alternative', 1),
 ('multipart(text/html, text/plain)', 1)]

It seems that ham emails are mostly plain text while spam emails has a lot of HTML.

Email headers:

In [10]:
for header, value in spam_emails[0].items():
    print(header,':',value)

Return-Path : <ilug-admin@linux.ie>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id AA01B43F99	for <zzzz@localhost>; Fri, 23 Aug 2002 06:34:02 -0400 (EDT)
Received : from phobos [127.0.0.1]	by localhost with IMAP (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:34:02 +0100 (IST)
Received : from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7NAUdZ20252 for    <zzzz-ilug@jmason.org>; Fri, 23 Aug 2002 11:30:39 +0100
Received : from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id LAA21467; Fri, 23 Aug 2002 11:29:43 +0100
Received : from relay.dub-t3-1.nwcgroup.com    (postfix@relay.dub-t3-1.nwcgroup.com [195.129.80.16]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id LAA21432 for <ilug@linux.ie>; Fri,    23 Aug 2002 11:29:35 +0100
Received : from em

In [12]:
spam_emails[0]['Subject']

"[ILUG] Join the Web's Fastest Growing Singles Community 11.67"

## Data Preprocessing

First of all, let's split it into training and test set.

In [13]:
X = np.array(ham_emails + spam_emails)
y = np.array([0]* len(ham_emails) + [1]*len(spam_emails)) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
X_train.shape, X_test.shape

((2401,), (601,))

We will start defining a function for parsing html to plain text (using Beautiful Soup module)

In [15]:
def html_to_plain_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text()

In [16]:
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample = html_spam_emails[0]

In [17]:
html = sample.get_content()
type(html_to_plain_text(html))

str

Now let's build a function that converts an email into plain text using the previous function

In [99]:
def email_to_text(email):
    html = None
    for part in email.walk(): # remember that an email can be multipart
        part_type = part.get_content_type()
        if not part_type in ('text/plain', 'text/html'):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
            
        if part_type == "text/plain": # if already plain text
            return content
        else: 
            html = content
    if html:
        return html_to_plain_text(html)

In [32]:
print(email_to_text(sample)[:500])






Check the REPORTS you would like to receive















Check the Available REPORTS you would like to receive:

 Keep on TOP of the latest NEWS, Get
                Great Special DEALS now...
				 It is complimentary, it costs nothing, YOU can QUIT anytime !





Financial - Stocks - Loans - Mortgage







				Financial news & Stock market
			






                Government & Politics / discussions
			






				Credit Cards &  Mortgage Refinancing / Loans
			



Health - Fitness - Ho


We are gonna use "NLTK" for stemming the email's words and "urlextract" module in order to find urls in the text. Let's put all this together into a sklearn transformer that we use to convert emails to word counters.

In [100]:
from sklearn.base import BaseEstimator, TransformerMixin

url_extractor = urlextract.URLExtract()
stemmer = nltk.PorterStemmer()

class EmailtoWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, lower_case=True, remove_punctuation=True, replace_urls=True, 
                 replace_numbers=True, stemming=True):
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for idx, email in enumerate(X):
            text = email_to_text(email) or ""
            if text == None:
                print(idx)
            if self.lower_case:
                text = text.lower()
            if self.replace_urls:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

In [49]:
EmailtoWordCounterTransformer().fit_transform(X_train[:2])

array([Counter({'file': 7, 'number': 6, 'your': 5, 'and': 4, 'or': 4, 'on': 4, 'http': 3, 'to': 3, 'the': 3, 'person': 3, 'desktop': 3, 'a': 2, 'os': 2, 'x': 2, 'link': 2, 'folder': 2, 'laura': 2, 'it': 2, 'with': 2, 'all': 2, 'ani': 2, 'www': 2, 'com': 2, 'url': 1, 'boingbo': 1, 'net': 1, 'date': 1, 'not': 1, 'suppli': 1, 'sixdegress': 1, 'is': 1, 'app': 1, 'that': 1, 'data': 1, 'mine': 1, 'own': 1, 'hard': 1, 'drive': 1, 'tri': 1, 'build': 1, 'between': 1, 'peopl': 1, 'carpent': 1, 'at': 1, 'con': 1, 'wa': 1, 'talk': 1, 'up': 1, 'yesterday': 1, 'look': 1, 'way': 1, 'cool': 1, 'i': 1, 've': 1, 'just': 1, 'download': 1, 'demo': 1, 'play': 1, 'locat': 1, 'similar': 1, 'name': 1, 'revis': 1, 'anywher': 1, 'system': 1, 'show': 1, 'email': 1, 'thread': 1, 'relat': 1, 'view': 1, 'ha': 1, 'sent': 1, 'you': 1, 'regardless': 1, 'of': 1, 'where': 1, 'those': 1, 'are': 1, 'store': 1, 'comput': 1, 'creat': 1, 'dynam': 1, 'self': 1, 'updat': 1, 'project': 1, 'find': 1, 'misfil': 1, 'attach': 1, 'q

Now we have the word counts, we need to convert them to vectors. For this, we will build another transformer whose fit() method will build the vocabulary (an ordered list of the most common words) and whose transform() method will use the vocabulary to convert word counts to vectors

In [51]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows=[]
        cols=[]
        data=[]
        
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))
        

In [52]:
sample = EmailtoWordCounterTransformer().fit_transform(X_train[:2])
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(sample)
X_few_vectors

<2x11 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [53]:
X_few_vectors.toarray()

array([[114,   4,   3,   2,   3,   6,   7,   4,   5,   2,   1],
       [239,  11,  11,  11,   7,   2,   0,   3,   1,   4,   5]])

In [54]:
vocab_transformer.vocabulary_

{'and': 1,
 'the': 2,
 'a': 3,
 'to': 4,
 'number': 5,
 'file': 6,
 'or': 7,
 'your': 8,
 'all': 9,
 'in': 10}

## Model building

Transformation pipeline:

In [101]:
preprocess_pipeline = Pipeline([
    ('email_to_wordcount', EmailtoWordCounterTransformer()),
    ('wordcount_to_vector', WordCounterToVectorTransformer())
])

In [103]:
%%time
X_train_prepared = preprocess_pipeline.fit_transform(X_train)

CPU times: user 10.7 s, sys: 1.32 ms, total: 10.7 s
Wall time: 10.7 s


In [111]:
%%time
log_reg = LogisticRegression()
scores = cross_val_score(log_reg, X_train_prepared, y_train, cv=3, verbose=2)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV]  ................................................................
[CV] ................................................. , total=   0.2s
[CV]  ................................................................
[CV] ................................................. , total=   0.1s
[CV]  ................................................................
[CV] ................................................. , total=   0.1s
CPU times: user 415 ms, sys: 0 ns, total: 415 ms
Wall time: 414 ms


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s finished


In [113]:
scores.mean()

0.9812562421972535

### Test set

In [116]:
from sklearn.metrics import precision_score, recall_score

X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression()
log_clf.fit(X_train_prepared, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 98.02%
Recall: 95.19%
