# Spam Classification using SVMs
<br>
In this tutorial, we will finish the SVM assignment by making our own spam email classifier. The concepts explored in this assignment also serve as a very simple introduction to some **natural language processing** techniques. Okay, let's get started by reading a provided sample email!

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.io
from sklearn import svm
import re
from nltk.stem.porter import PorterStemmer

with open("emailSample1.txt", "r") as readfile:
    email_contents_s1 = readfile.read()

print(email_contents_s1)

> Anyone knows how much it costs to host a web portal ?
>
Well, it depends on how many visitors you're expecting.
This can be anywhere from less than 10 bucks a month to a couple of $100. 
You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 
if youre running something big..

To unsubscribe yourself from this mailing list, send an email to:
groupname-unsubscribe@egroups.com




To help build our classifier, we need some sort of vocabulary list that has the most frequent words within a whole corpus or collection of emails. Luckily the assignment provides that list along with a numberical index, which we can convert into a dictionary.

In [3]:
with open("vocab.txt", "r") as readfile:
    vocab_dict = {}
    for line in readfile:
        line = line.split()
        vocab_dict[line[1]] = line[0]

def printDict(dictionary, entries):
    keys = list(dictionary.keys())
    print("First ", entries, "key-values in dictionary:\n")
    for i in range(entries):
        print(keys[i], ": ", dictionary[keys[i]])
        
printDict(vocab_dict,30)

First  30 key-values in dictionary:

aa :  1
ab :  2
abil :  3
abl :  4
about :  5
abov :  6
absolut :  7
abus :  8
ac :  9
accept :  10
access :  11
accord :  12
account :  13
achiev :  14
acquir :  15
across :  16
act :  17
action :  18
activ :  19
actual :  20
ad :  21
adam :  22
add :  23
addit :  24
address :  25
administr :  26
adult :  27
advanc :  28
advantag :  29
advertis :  30


You may notice that some of the words in `vocab_dict` are not even words! This is because the some of the "words" are reducted to their *stems*. For example, "include", "includes", "included", "including" all stem to "includ", which makes it easier for resolving those 4 "include"-like words into one meaning.

Next we are going to preprocess the email normalize its contents a little bit. Then we are going to iterate through all the words in the email, stem the word, check if it's in `vocab_dict` and if so append the index to a vector. Essentially we are converting the text into a vector of numbers.

In [4]:
def preprocessEmail(email_contents):
    # make lowercase
    email_contents = email_contents.lower()
    
    # strip HTML
    email_contents = re.sub('<[^<>]+>', ' ', email_contents)
    
    # handle numbers
    email_contents = re.sub("[0-9]+", 'number', email_contents)
    
    # handle URLs
    email_contents = re.sub("(http|https)://[^\s]*", "httpaddr", email_contents)
    
    # handle email addresses
    email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)
    
    # handle $ sign
    email_contents = re.sub("[$]+", "dollar", email_contents)
    
    return email_contents

def makeWordIndicies(email_contents):
    '''
    Tokenizes by whitespace and punctuation.
    Stems word and checks if stem is in vocab_dict.
    If so, add index to word_indices vector
    '''
    stemmer = PorterStemmer()
    global vocab_dict
    word_indices = []
    
    # tokenize and get rid of punctuation
    tokens = re.split("""[\s(\\\\)@$/#.\-:&*+\=[\]?!(){},'">_<;%]""", email_contents)
    for token in tokens:
        # remove non alphanumberic characters
        token = re.sub("[^a-zA-Z0-9]", '', token)
        stem = stemmer.stem(token)
        if len(stem) < 1:
            continue
        if stem in vocab_dict:
            word_indices.append(vocab_dict[stem])
            
    return word_indices

def processEmail(file_contents):
    email_contents = preprocessEmail(file_contents)
    return makeWordIndicies(email_contents)

word_indices = processEmail(email_contents_s1)
print(word_indices)

['86', '916', '794', '1077', '883', '370', '1699', '790', '1822', '1831', '883', '431', '1171', '794', '1002', '1893', '1364', '592', '1676', '238', '162', '89', '688', '945', '1663', '1120', '1062', '1699', '375', '1162', '479', '1893', '1510', '799', '1182', '1237', '810', '1895', '1440', '1547', '181', '1699', '1758', '1896', '688', '1676', '992', '961', '1477', '71', '530', '1699', '531']


Next we're going to one hot encode the above list using `vocab_dict` to create our feature vector. There are 1899 keys in `vocab_dict` so we create a 1899 length vector with 1s at the indices specified by `word_indices` and 0s everywhere else.

In [5]:
# converting to feature vector

dict_n = len(vocab_dict.keys())
print("number of keys in vocab_dict: ", dict_n)

def emailFeatures(word_indices):
    global dict_n
    global vocab_dict
    
    featureVec = np.zeros(dict_n)
    for i in word_indices:
        i = int(i)
        featureVec[i] = 1
    return featureVec

featureVec = emailFeatures(word_indices)
print("feature vector: ", featureVec)
print("feature vector length: ", len(featureVec))
print("number of non-zero entries: ", sum(featureVec))

number of keys in vocab_dict:  1899
feature vector:  [ 0.  0.  0. ...,  1.  0.  0.]
feature vector length:  1899
number of non-zero entries:  45.0


To train our classifier we are going to use the provided .mat data. All the emails in the dataset are already converted from text to our defined feature vector, so all we have to do is use Scikit-learn to build our SVM linear classifier!


In [7]:
spamTrain = scipy.io.loadmat("spamTrain.mat")
spamTest = scipy.io.loadmat("spamTest.mat")

X_train = spamTrain["X"]
y_train = spamTrain["y"].ravel()

X_test = spamTest["Xtest"]
y_test = spamTest["ytest"].ravel()

# train linear SVM classifier

C = 0.1
clf = svm.SVC(C=C, kernel="linear")
clf.fit(X_train, y_train)

def svmPredict(model, X):
    # returns predictions
    predictions = clf.predict(X)
    return predictions.ravel()

def getAccuracy(predictions, y):
    logicalVec = predictions == y
    logicalVec = logicalVec.astype(int)
    return sum(logicalVec) / len(logicalVec)

predictions_train = svmPredict(clf, X_train)
print(getAccuracy(predictions_train, y_train))

predictions_test = svmPredict(clf, X_test)
print(getAccuracy(predictions_test, y_test))

0.99825
0.989


Eh, the accuracy is close enough (in fact higher) to the ones indicated on the assignment. This may be because Scikit-learn's implementation of building and training an SVM classier is different than Octave's. Next we will look at the top 15 predictor words for spam in an email.

In [8]:
highestInd = np.argsort(-clf.coef_)[0]
# argsort can't sort by desc, so negate clf.coef_ for to get sorted by desc

# switch key, values for vocab_dict (key-values will be int: str)
vocab_dict_switch = {y:x for x,y in vocab_dict.items()}

print('Top 15 predictor words for spam:')
for i in range(15):
    index = highestInd[i]
    word = vocab_dict_switch[str(index)]
    print("- " + word)

Top 15 predictor words for spam:
- otherwis
- clearli
- remot
- gt
- visa
- base
- doesn
- wife
- previous
- player
- mortgag
- natur
- ll
- futur
- hot


Again the words are different than the ones indicated on the assignment. I hypothesize this is because of different SVM implementations between Scikit-learn and Octave's SVM library. If anyone did get the same words I'd be happy to know how you got them!

Okay, let's test our spam detector out with the sample emails provided in the assignment.

In [27]:
def detectIfSpam(email_content, clf):
    wordIndices = processEmail(email_content)
    featureVec = emailFeatures(wordIndices).reshape(1,-1) # single email, reshape to 2D array
    prediction = svmPredict(clf, featureVec)    
    return prediction

def exampleEmailTest(clf):
    emails = ["emailSample1.txt", "spamSample1.txt", "emailSample2.txt", "spamSample2.txt"]

    for email in emails:
        with open(email, "r") as readfile:
            content = readfile.read()
        result = detectIfSpam(content, clf)
        if result == 1:
            print(email, " is predicted to be spam.")
        else:
            print(email, " is predicted to not be spam.")
    return

exampleEmailTest(clf)

emailSample1.txt  is predicted to not be spam.
spamSample1.txt  is predicted to not be spam.
emailSample2.txt  is predicted to not be spam.
spamSample2.txt  is predicted to be spam.


Seems like all the classifications are correct except spamSample1. Below is an example of a spam email from my own inbox. Let's put it to the test.

In [34]:
txt = """
Our VIP department is trying to contact eddiewang 
You could win millions

7 dollars BONUS TO TRY

, Claim your BONUS now!


I want to claim my 70 FREE SPINS !
"""

print(detectIfSpam(txt, clf))

[1]


Nice it predicts correctly.

That is it for the SVM assignment, congrats for making it through - just two more assignments left! In the next post we will move away from supervised learning techniques but instead begin to look at unsupervised learning using k-means clustering. See you there! :)