# Spam / Ham Classification

This notebook is my attempt at creating a spam classifier using basic machine learning models. I'm going to train my classification algorithms on a small dataset of 1000 ham (real) and 1000 spam emails downloaded from Apache Spam Assassin. Each email is an individual text file, and the entire dataset consists of approximately 6000 files spread across 8 folders.

In [1]:
import os

In [2]:
root = r'C:\Users\Will\Desktop\Datasets\Apache Spam Ham\Spam Ham Dataset'
folders = ['easy_ham', 'spam3'] # Drawing from these two folders

In [3]:
def read_email(email):
    with open(email, 'r') as file:
        content = file.readlines()
        file.close()
    string = ''
    for x in content:
        print(x)

In [4]:
sample_path = os.path.join(root, folders[1])
sample_email = os.path.join(sample_path, os.listdir(sample_path)[0])

In [5]:
read_email(sample_email)

From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002

Return-Path: <ilug-admin@linux.ie>

Delivered-To: yyyy@localhost.netnoteinc.com

Received: from localhost (localhost [127.0.0.1])

	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD

	for <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)

Received: from phobos [127.0.0.1]

	by localhost with IMAP (fetchmail-5.9.0)

	for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)

Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by

    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for

    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100

Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org

    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100

Received: from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net

    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for

    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50

## Create Training and Testing sets

First, I'm going to label the 2000 files I'll use for my dataset and move them into separate directories.

In [6]:
# rename files to be prefixed with 'SPAM' or 'HAM' before merging into respective training and test dirs

os.chdir(root)
num_files = 1000 # 1000 spam, 1000 ham

for folder in folders:
    path = (os.path.join(root, folder))
    os.chdir(path)
    os.makedirs('Labeled') # Create new folder 'Labeled' to hold new labeled files
    
    # take 500 files from each dir
    for file in os.listdir()[0:num_files]:
        if file == 'Labeled':
            pass
        elif folder.startswith('easy'):
            os.rename(os.path.join(path, file), os.path.join(os.path.join(path, 'Labeled'), 'HAM_' + file))
        else:
            os.rename(os.path.join(path, file), os.path.join(os.path.join(path, 'Labeled'), 'SPAM_' + file))
            
        # Ham emails start with HAM_ and Spam emails start with SPAM_

Next, I'll create training and testing folders to move the emails into.

In [7]:
os.mkdir(os.path.join(root, 'Train')) # Create Train/Test directories
os.mkdir(os.path.join(root, 'Test'))

file_paths = [] # list of complete file paths for emails in their respective 'Labeled' directories
file_names = [] # list of only the file names
for folder in folders:
    path = (os.path.join(root, folder))
    os.chdir(os.path.join(path, 'Labeled'))
    for file in os.listdir():
        file_paths.append(os.path.abspath(file)) # populate both lists
        file_names.append(file)

2000 files in total.

In [8]:
file_count = len(file_names)
file_count

2000

I'll define the size of the training set to be 80% of the entire dataset, so 1600 emails.

In [9]:
train_size = int(file_count*0.8) # 80%
print('Size of training set: ' + str(train_size))
print('Size of testing set: ' + str(file_count - train_size))

train_path = os.path.join(root, 'Train') # Create paths to the training and testing directories
test_path = os.path.join(root, 'Test')

Size of training set: 1600
Size of testing set: 400


Now, I'll randomly split the dataset into the training and testing folders based on the defined train size. Since I'm just going to use numpy to generate random numbers it won't be a precise 50/50 split, but the slight difference between the two will have a negligible impact.

In [10]:
import numpy as np
import numpy.random as rnd

In [11]:
rnd.seed(42) # seed the generator to duplicate results
file_index = np.arange(len(file_names)) # 0, 1, 2, 3, 4, 5, ..., 2000
rnd.shuffle(file_index) # randomly shuffle all the files in the entire dataset

for file in file_index[:train_size]: # emails 1-1000
    try:
        os.rename(file_paths[file], os.path.join(train_path, file_names[file]))
    except:
        print('Copy error') # **
        
for file in file_index[train_size:]: # emails 1001-2000
    try:
        os.rename(file_paths[file], os.path.join(test_path, file_names[file]))
    except:
        print('Test copy error') # **
        
# ** Previous implementations of this method resulting in some files not being copied for some unknown reason. When I changed the
#    number of files that I was going to draw from each original spam/ham folder, this problem stopped occurring. I left it in
#    there just in case. A bug that seemed to iron itself out? For the purpose of this notebook, I decided to not investigate the
#    problem any further.


In [12]:
for path in [train_path, test_path]:
    print(path)
    spam = 0
    ham = 0
    for x in os.listdir(path):
        if x.startswith('HAM'):
            ham += 1
        else:
            spam += 1
    print(('Ham: {}, Spam: {}').format(ham, spam))

C:\Users\Will\Desktop\Datasets\Apache Spam Ham\Spam Ham Dataset\Train
Ham: 818, Spam: 782
C:\Users\Will\Desktop\Datasets\Apache Spam Ham\Spam Ham Dataset\Test
Ham: 182, Spam: 218


## Extract features and labels from the Training set

The algorithms I'm using to generate the features from the training set's emails are mostly not my own; they're ones I found in a similar Spam classification exercise from KDnuggets.com. (http://www.kdnuggets.com/2017/03/email-spam-filtering-an-implementation-with-python-and-scikit-learn.html) I've tweaked them to fit the context I'm working in.

The function below parses every file in the training dictionary and extracts each word. It then creates a dictionary of all the words found and counts the number of times they come up in the entire data set. I'm going to use the 3000 most common words as features for my algorithm, but there are some caveats. I'm going to ignore words that contain numbers or symbols of any type of formating in them, which show up in a lot of the emails (e.g. ASCII art and HTML formating). I'm also going to ignore all stop words (i.e. common words like 'it', 'the', 'is') by running the dictionary results through the stopwords dataset found in the NLTK (Natural Language Toolkit) module. Finally, I ignore all words that are a single character long (e.g. random letters and spaces). It returns a list of the 3000 most common words and the number of times they occurred.

In [13]:
from collections import Counter # counts frequency of words
from nltk.corpus import stopwords # stopwords dataset

In [14]:
def make_dictionary(train_dir):
    emails = [os.path.join(train_dir, f) for f in os.listdir(train_dir)]
    all_words = []
    for mail in emails:
        with open(mail) as m:
            for i, line in enumerate(m):
                words = line.split()
                all_words += words
        m.close()
                
    dictionary = Counter(all_words)            
    list_to_remove = list(dictionary)
    stop_words = set(stopwords.words('english')) # grab the English list of stop words from the NLTK module
    
    for item in list_to_remove:
        if item in stop_words: # remove common stopwords - both ham and spam will have lots of these
            del dictionary[item]
        if item.isalpha() == False: # remove words that aren't entirely alphabetical characters
            del dictionary[item]
        elif len(item) == 1: # remove words with a single character (spaces, symbols)
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
    
    return dictionary

Now let's pass the function the training directory and see which words came up most often.

In [15]:
word_dictionary = make_dictionary(train_path)
word_dictionary

[('id', 6504),
 ('ESMTP', 4757),
 ('Sep', 3765),
 ('Aug', 2582),
 ('localhost', 2127),
 ('Jul', 2062),
 ('The', 1865),
 ('From', 1543),
 ('Oct', 1472),
 ('May', 1355),
 ('Jun', 1150),
 ('email', 1027),
 ('This', 1009),
 ('SMTP', 991),
 ('bulk', 975),
 ('IMAP', 967),
 ('one', 924),
 ('get', 881),
 ('You', 837),
 ('like', 794),
 ('people', 793),
 ('If', 792),
 ('would', 791),
 ('We', 767),
 ('Microsoft', 715),
 ('To', 712),
 ('Mon', 711),
 ('list', 707),
 ('Rohit', 630),
 ('Khare', 616),
 ('Friends', 603),
 ('use', 597),
 ('Your', 586),
 ('make', 586),
 ('send', 539),
 ('jalapeno', 538),
 ('It', 535),
 ('want', 534),
 ('time', 514),
 ('receive', 513),
 ('Internet', 511),
 ('mailing', 505),
 ('new', 497),
 ('also', 494),
 ('us', 494),
 ('even', 484),
 ('information', 472),
 ('In', 471),
 ('may', 465),
 ('first', 458),
 ('Normal', 445),
 ('please', 434),
 ('New', 430),
 ('For', 430),
 ('know', 421),
 ('money', 415),
 ('could', 414),
 ('business', 405),
 ('free', 387),
 ('much', 382),
 ('On

A lot of the most common words are those that show up in email formatting: 'id', 'ESMTP', 'SMTP', 'email'. Some other common words are the names of months: 'Jun', 'Jul', 'Aug', 'Sep' - these occurences are likely the months when the individual emails were sent. Besides those, there are a few words that seem to be indicative of spam: 'bulk', 'receive', 'money', 'send', etc.

Now, we need to translate this list into a feature matrix. The function below creates an $M  x  N$ matrix where $M$ is the number of emails being passed to it (training or testing folder) and $N$ is the number of entries in our word dictionary (3000 features). It then parses each email for the words in the dictionary, adding them to the email rows' respective feature columns every time they're found.

In [24]:
def extract_features(mail_dir, dictionary):
    count = 0
    files = [os.path.join(mail_dir, file) for file in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files), len(dictionary)))
    docID = 0 # matrix row
    for file in files:
        with open(file) as f:
            for index, line in enumerate(f):
                words = line.split()
                for word in words:
                    wordID = 0 # matrix column
                    for i, d in enumerate(dictionary):
                        if d[0] == word:
                            wordID = i
                            features_matrix[docID, wordID] += words.count(word) # add total occurences of word to the feature
        docID += 1 # move to next row
        f.close()
        count += 1
        if count % 100 == 0:
            print(('{} files processed').format(str(count)))
    return features_matrix
                        

In [25]:
X = extract_features(train_path, word_dictionary) # matrix of training features

100 files processed
200 files processed
300 files processed
400 files processed
500 files processed
600 files processed
700 files processed
800 files processed
900 files processed
1000 files processed
1100 files processed
1200 files processed
1300 files processed
1400 files processed
1500 files processed
1600 files processed


This short function parses a folder and returns a target array. 1 if an email starts with 'HAM', else 0.

In [26]:
def extract_targets(mail_dir):
    targets = np.array([1 if file.startswith('HAM') else 0 for file in os.listdir(mail_dir)])
    return targets

In [27]:
targets = extract_targets(train_path)

In [28]:
targets

array([1, 1, 1, ..., 0, 0, 0])

Since the file explorer sorts the files alphabetically in each directory, all the ham emails appear before the spam. This would be a problem if I was using an index feature of some sort, because the classifier would associate ham emails with lower index numbers and spam with higher ones. Since I'm not, the targets appearing like this is fine.

## Model Selection, Training, and Evaluation

With the emails all processed and sitting pretty in a feature matrix, it's time to see if we can build a great classifier using machine learning algorithms. I'm going to experiment using a Support Vector Machine and a Decision Tree, two powerful algorithms that are well suited to high dimensionality feature space. The Decision Tree will also let me see the relative importance of the word features in making a classification. 

In [29]:
import sklearn

In [30]:
from sklearn.svm import SVC # Support Vector Classifier

In [31]:
svm_clf = SVC()
svm_clf.fit(X, targets)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [32]:
svm_clf.score(X, targets)

0.99750000000000005

Right off the bat, we can see that the SVC fit the training set nearly perfectly. It seems as though classifying these emails may be pretty straightforward... To check for overfit, lets do some cross validation.

In [58]:
from sklearn.model_selection import cross_val_score

In [64]:
svc_scores = cross_val_score(svm_clf, X, targets, cv=10)

In [62]:
def display_scores(scores):
    print('Scores:', scores)
    print('Mean:', scores.mean())
    print('Standard Deviation:', scores.std())

In [65]:
display_scores(svc_scores)

Scores: [ 0.9689441   0.98757764  0.98125     0.9875      1.          1.          0.99375
  0.9875      0.99371069  1.        ]
Mean: 0.990023243095
Standard Deviation: 0.00930649027043


The SVC still holds up really well, with the lowest accuracy on one of the folds being ~97%. With a standard deviation of less than 1%, we can see that the SVC boundary indeed is also a perfect fit on this training data. Let's see how the Decision Tree fares.

In [67]:
from sklearn.tree import DecisionTreeClassifier

In [68]:
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X, targets)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [70]:
tree_clf.score(X, targets)

1.0

A perfect fit. Not surprising given the SVC; lets check for overfit with validation:

In [71]:
tree_scores = cross_val_score(tree_clf, X, targets, cv=10)

In [72]:
display_scores(tree_scores)

Scores: [ 0.98757764  0.88819876  1.          1.          1.          1.          1.
  1.          1.          1.        ]
Mean: 0.987577639752
Standard Deviation: 0.0333326903478


Interesting... 9 out of the 10 folds basically scored perfectly, but one fold had a pretty significant 10% dip in accuracy, dropping to ~89%. Perhaps the absence of a particular fold of emails, with particular feature vectors, made all the difference.

## Exploratory Feature Analysis

To see which word features most significantly contributed to the Decision Tree's classifications, we can check the feature importances field in the classifier.

In [73]:
feature_importances = tree_clf.feature_importances_

We pair the importances with the word dictionary (they share the same index) and reverse the list to see which ones had the greatest contributions:

In [74]:
sorted(zip(feature_importances, word_dictionary), reverse=True)

[(0.68473698548256134, ('IMAP', 967)),
 (0.30778663520333421, ('Jul', 2062)),
 (0.002498212216882349, ('unknown', 275)),
 (0.0024921190163532041, ('From', 1543)),
 (0.0024860480808689676, ('MOST', 15)),
 (0.0, ('zzzzteana', 14)),
 (0.0, ('young', 20)),
 (0.0, ('yet', 61)),
 (0.0, ('yesterday', 15)),
 (0.0, ('yes', 15)),
 (0.0, ('years', 294)),
 (0.0, ('year', 144)),
 (0.0, ('wrote', 33)),
 (0.0, ('wrong', 38)),
 (0.0, ('written', 38)),
 (0.0, ('writing', 50)),
 (0.0, ('write', 88)),
 (0.0, ('would', 791)),
 (0.0, ('worth', 60)),
 (0.0, ('worst', 14)),
 (0.0, ('worry', 27)),
 (0.0, ('worldwide', 21)),
 (0.0, ('world', 165)),
 (0.0, ('works', 95)),
 (0.0, ('working', 176)),
 (0.0, ('workers', 17)),
 (0.0, ('worked', 50)),
 (0.0, ('work', 337)),
 (0.0, ('words', 62)),
 (0.0, ('word', 113)),
 (0.0, ('wondering', 14)),
 (0.0, ('wonderful', 21)),
 (0.0, ('wonder', 20)),
 (0.0, ('women', 77)),
 (0.0, ('woman', 28)),
 (0.0, ('without', 243)),
 (0.0, ('within', 263)),
 (0.0, ('wish', 212)),
 (0

Hmm... Only 5 out of the 3000 features created actually played into the Decision Tree's classifications. And only two of them had any substantial impact: 'IMAP' and 'Jul', with both 68% and 30% importances respectively. The presence of these two words seemed to give the model almost all the information it needed to make a correct prediction.

Let's see the frequency in which these words appeared in emails overall, and how they're used in them. The function below parses a directory of emails and returns a list of those which contain the keyword we're looking for. It also prints the number and percentage of the total files in which it occurs.

In [43]:
def find_keyword_files(keyword, mail_dir):
    size = len(os.listdir(mail_dir))
    files_with_keyword = []
    files = [os.path.join(mail_dir, file) for file in os.listdir(mail_dir)]
    for file in files:
        with open(file, 'r') as f:
            content = f.readlines()
            for line in content:
                words = line.split(' ')
                if keyword in words:
                    if file in files_with_keyword: # prevent duplicate entries in files_with_keyword
                        pass
                    else:
                        #print('"{}" found in file: {}'.format(keyword, file))
                        files_with_keyword.append(file)
        f.close()
    
    print(("'{}' found in {} of {} files ({}%)").format(keyword, len(files_with_keyword), size, len(files_with_keyword)/size))

    return files_with_keyword

In [44]:
imap_files = find_keyword_files("IMAP", train_path)

'IMAP' found in 967 of 1600 files (0.604375%)


'IMAP' is present in over half of the training set - but is it in mostly ham or spam? We can see this by just taking a look at the list itself.

In [46]:
imap_files

['C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0001.ea7e79d3153e7469e7a9c3e0af6a357e',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0003.acfc5ad94bbd27118a0d8685d18c89dd',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0004.e8d5727378ddde5c3be181df593f1712',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0005.8c3b9e9c0f3f183ddaf7592a11b99957',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0006.ee8b0dba12856155222be180ba122058',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0007.c75188382f64b090022fa3b095b020b0',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0008.20bc0b4ba2d99aae1c7098069f611a9b',
 'C:\\Users\\Will\\Desktop\\Datasets\\Apache Spam Ham\\Spam Ham Dataset\\Train\\HAM_0010.4996141de3f21e858c22f88231a9f463',
 'C:\\Us

It can be a bit hard to see just from scrolling or looking at it, but IMAP appears mostly (approx 3/4) in ham emails. Let's read one and see if we can find where its used.

In [45]:
read_email(imap_files[0])

From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002

Return-Path: <exmh-workers-admin@example.com>

Delivered-To: zzzz@localhost.netnoteinc.com

Received: from localhost (localhost [127.0.0.1])

	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36

	for <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)

Received: from phobos [127.0.0.1]

	by localhost with IMAP (fetchmail-5.9.0)

	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)

Received: from listman.example.com (listman.example.com [66.187.233.211]) by

    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for

    <zzzz-exmh@example.com>; Thu, 22 Aug 2002 12:34:53 +0100

Received: from listman.example.com (localhost.localdomain [127.0.0.1]) by

    listman.redhat.com (Postfix) with ESMTP id 8386540858; Thu, 22 Aug 2002

    07:35:02 -0400 (EDT)

Delivered-To: exmh-workers@listman.example.com

Received: from int-mx1.corp.example.com (int-mx1.corp.example.com

    [172.1

In this email, it appears right by the top. Its in the 'Received:' field of the email - "by localhost with IMAP (fetchmail-5.9.0)". Doing a little bit of research online, I found the following info on IMAP:

"IMAP (Internet Message Access Protocol) is a standard email protocol that stores email messages on a mail server, but allows the end user to view and manipulate the messages as though they were stored locally on the end user's computing device(s). This allows users to organize messages into folders, have multiple client applications know which messages have been read, flag messages for urgency or follow-up and save draft messages on the server." 

\-taken from http://searchexchange.techtarget.com/definition/IMAP

I suppose that emails sent through this IMAP protocol are more likely to be ham then spam! I could do the same analysis for the other significant word, 'Jul', but after looking into it I realized that it only represents the date which the email was sent. Not so much indicative of an email being spam or ham, just when most of the emails in this dataset were gathered.

## Evaluate the models on the Test Set

Moving onto the final step, testing our near perfect classifiers on the Test Set.

In [75]:
test_X = extract_features(test_path, word_dictionary)
test_y = extract_targets(test_path)

100 files processed
200 files processed
300 files processed
400 files processed


In [76]:
pred_y_svm = svm_clf.predict(test_X)
pred_y_tree = tree_clf.predict(test_X)

Let's see what the predictions for each classifier look like. Since all ham files come before spam alphabetically, a perfect classifier should output two discrete blocks of 1s and 0s.

In [77]:
pred_y_svm

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

Pretty good! Looks like the SVM incorrectly classified two spam emails as ham. Let's see its overall accuracy:

In [78]:
(test_y == pred_y_svm).mean()

0.995

Near perfect... Moving on to the Decision Tree:

In [79]:
pred_y_tree

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

It seems like the Decision Tree made two incorrect classifications on spam emails as well (perhaps due to them containing 'IMAP'). Its accuracy should be the same as the SVM:

In [80]:
(test_y == pred_y_tree).mean()

0.995

In conclusion, it seems as though this dataset can quite easily be handled by out-of-the-box models for ham/spam classification. Since the accuracies were so high for each model, I didn't do any tweaking of hyperparameters to try and improve them. I also ignored checking for precision or recall, since at ~99% accuracy they're both pretty obvious. While an interesting exercise, it says little about the viability of these models in an real online environment. The algorithms I used to extract the features weren't even good enough to judge the emails solely on their contents - the two most important features in the Decision Tree were containnd in the email headers. Future, improved versions of this exercise would involve a much larger dataset of more diverse emails, and an improved algorithm for extracting the word features.

Thanks for reading!