Many email services today provide spam filters that are able to classify emails
into spam and non-spam email with high accuracy. In this part of the exercise,
we will use SVMs to build our own spam filter.<br>
We will be training a classifier to classify whether a given email, x, is
spam (y = 1) or non-spam (y = 0). The dataset included for this exercise is based on a a subset of
the SpamAssassin Public Corpus. 

In [16]:
import pandas as pd
import numpy as np
import scipy.io
from pprint import pformat
import re
import string
from nltk.stem.porter import *
from sklearn import svm

Before starting on a machine learning task, it is usually insightful to take a look at examples from the dataset. Following is a sample email

In [6]:
with open ('ex6/emailSample1.txt','r') as text:
    email = text.read()

In [8]:
email

"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo unsubscribe yourself from this mailing list, send an email to:\ngroupname-unsubscribe@egroups.com\n\n"

Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used. we have chosen only the most frequently occuring words as our set of words considered (the vocabulary list).

In [77]:
voc_list = []
voc_file = 'ex6/vocab.txt'
for line in file(voc_file):
    line = line.strip()
    word = str(line.split('\t')[1])
    voc_list.append(word)

<h3> Email Preprocessing and Features Extraction

In [12]:
def preprocess(semail):
    # lower case
    semail = semail.lower()
    # Looks for any expression that starts with < and ends with > and replace
    # ana < and > with a space
    semail = re.sub('<[^<>]+>',' ',semail)
    
    # Handle Numbers
    # Look for one or more characters between 0-9
    semail = re.sub('[0-9]+','number',semail)
    # Handle URLS
    # Look for strings starting with http:// or https://
    semail = re.sub('(http|https)://[^\s]+','httpaddr',semail)
    #handle $ sign
    semail = re.sub('\$+','dollar',semail)
    #handle email address
    semail = re.sub('[a-zA-Z0-9]\S*@\S*[a-zA-Z0-9]','emailaddr',semail)
    #get rid of punctuation
    semail = semail.translate(None,string.punctuation)
    #get rid of everything other alphanumeric characters 
    semail = re.sub('[^a-zA-Z0-9]',' ',semail)
    #Stem 
    stemmer = PorterStemmer()
    words = semail.split(' ')
    pemail = []
    for word in words:
        if word == '': continue
        w = str(stemmer.stem(word))
        pemail.append(w)
    ## transform the email into bag of words available in voc_list
    ## features x are vector with 1899 elements where x[i]=1 if i-th word of the voc_list
    ## is in the email else x[i]=0
    features = np.repeat(0,1899)
    word_indices = []
    for word in pemail:
        if word in voc_list:
            index = voc_list.index(word)
            word_indices.append(index)
            features[index]=1
    return features


In [14]:
email_features=preprocess(email)
email_features[:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

<h3> Training SVM for Spam Classification

spamTrain.mat contains 4000 training examples of spam
and non-spam email, while spamTest.mat contains 1000 test examples.

In [24]:
data = scipy.io.loadmat('ex6/spamTrain.mat')
x = np.array(data['X'])
print x.shape
y = np.array(data['y'])
print y.shape
data_test = scipy.io.loadmat('ex6/spamTest.mat')
#print pformat(data_test)
x_test = np.array(data_test['Xtest'])
print x_test.shape
y_test = np.array(data_test['ytest'])
print y_test.shape

(4000L, 1899L)
(4000L, 1L)
(1000L, 1899L)
(1000L, 1L)


the classifier gets a training accuracy
of about 99.8% and a test accuracy of about 98.9%.

In [25]:
clf = svm.SVC(C=0.1,kernel='linear')
clf.fit(x,y.reshape(len(x)))
print clf.score(x,y)
print clf.score(x_test,y_test)

0.99825
0.989


<h3> Top Predictors for Spam

In [75]:
coef = clf.coef_.reshape(len(voc_list))
voc_list = np.array(voc_list)

coef_weight = np.column_stack((voc_list,coef))
coef_weight = np.array(sorted(coef_weight,key=lambda x: x[-1]))
print list(coef_weight[-15:,0])

['ga', 'lo', 'nbsp', 'most', 'pleas', 'price', 'will', 'dollar', 'basenumb', 'visit', 'guarante', 'remov', 'click', 'our', 'snumber']


<h2> Try new emails

In [91]:
## Good email, predicted correctly
with open ('ex6/emailSample2.txt','r') as text:
    email = text.read()
email_features=preprocess(email)
email_features=email_features.reshape(1,len(email_features))
print list(clf.predict(email_features))

[0]


In [94]:
## spam, predicted correctly
with open ('ex6/spamSample2.txt','r') as text:
    email = text.read()
email_features=preprocess(email)
email_features=email_features.reshape(1,len(email_features))
print list(clf.predict(email_features))

[1]


In [95]:
## spam, predicted correctly
with open ('ex6/spamSample1.txt','r') as text:
    email = text.read()
email_features=preprocess(email)
email_features=email_features.reshape(1,len(email_features))
print list(clf.predict(email_features))

[1]


In [97]:
## good email, predicted correctly
with open ('ex6/emailSample1.txt','r') as text:
    email = text.read()
email_features=preprocess(email)
email_features=email_features.reshape(1,len(email_features))
print list(clf.predict(email_features))

[0]
