# Email Spam Filter #

This is a project to classify emails as spam or not. This is basically a Spam-filter. The dataset used for this project can be found here: http://spamassassin.apache.org/old/publiccorpus/. SVM Classifier will be used for this project. The following major packages will be used for this project:

- tarfile - For extracting tar files
- os - For accessing folders and files on local computer
- nltk - For natural language proecessing
- re - For regular expressions
- BeautifulSoup - For handling HTML tags
- Numpy - For working with arrays
- scikit-learn - For machine learning algorithms

Packages will be imported as required.

### Extracting files
Please note that all dataset files are assumed to be downloaded for this project. The dataset files are downloaded from  http://spamassassin.apache.org/old/publiccorpus/ and saved in the same folder as the folder of this Jupyter Notebook.

In [1]:
import tarfile
import os

In [2]:
import nltk
from bs4 import BeautifulSoup as bs
import re
from nltk.corpus import stopwords

If not already downloaded, punkt and stopwords need to be downloaded.

In [29]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amish\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amish\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

The filenames of the tar files are manually stored.

In [3]:
tar_filenames = ['20021010_easy_ham.tar.bz2', '20021010_hard_ham.tar.bz2', '20021010_spam.tar.bz2', '20030228_easy_ham.tar.bz2', '20030228_easy_ham_2.tar.bz2', '20030228_hard_ham.tar.bz2', '20030228_spam.tar.bz2','20030228_spam_2.tar.bz2','20050311_spam_2.tar.bz2']

Tar files are extracrted and stored in the same folder.

In [4]:
for filename in tar_filenames:
    tar_file = tarfile.open(filename,'r:bz2')
    tar_file.extractall(path=(".\\" + filename[0:-8]))

Getting the path for the emails. The path has a pattern based on the file name of the tar files. This can be tested by deleting the first unnencessary file in the third tar file.

In [5]:
filename = tar_filenames[2]
path = '.\\' + filename[0:-8]+'\\' + filename[9:-8]+'\\'
emails = os.listdir(path)
os.remove(path+emails[0])

Generating Stopwords.

In [6]:
stop_words=set(stopwords.words("english"))

### Opening emails and processing them.

The data is already divided into spam and not spam. So for processing the datasets, before they are combined for training, the data will be stored in different variables in every stage.
Filenames are divided bbetween spam files and not spam files.

In [7]:
spam_filenames = list(filter(lambda x: (re.search('spam', x)), tar_filenames))
valid_filenames = list(filter(lambda x: (x not in spam_filenames), tar_filenames))

In [8]:
import chardet
import operator

The function 'folder_open' accepts an email and returns tokensized words. There are some files which do not use standard encoding. This function skips those files whic cannot be read.

The function 'email_tokens' accepts the contents of the email and turns them into word tokens.

In [9]:
def folder_open(filename):
    path = '.\\' + filename[0:-8]+'\\' + filename[9:-8]+'\\'
    emails = os.listdir(path)
    tokens_emails = []
    for email in emails:
        f = open(path+email)
        try:
            emailcontents = f.read()
            etokens = email_tokens(emailcontents)
            tokens_emails.append(etokens)
        except UnicodeDecodeError:
            result = chardet.detect(open(path+email,'rb').read())
            print(path+email)
            print(result['encoding'])      
            print('\n')
        except AttributeError as a:
            print(path+email)
            print(a)
        f.close()
    return tokens_emails

In [10]:
def email_tokens(emailcontents):
    #Remove Headers
    emailcontents = emailcontents[re.search('\n\n',emailcontents).span(0)[1]:]
    
    #Stripping HTML tags
    emailcontents = bs(emailcontents).get_text()
    
    #Handling HTTP links
    emailcontents = re.sub(r'(http|https)://[^\s]*','httpaddr',emailcontents)
    emailcontents = re.sub(r'www.[^\s]*','httpaddr',emailcontents)
    
    #Handling Email addresses
    emailcontents = re.sub(r'[^\s]+@[^\s]+','emailpaddr',emailcontents)
    
    #Convert everything to lowercase
    emailcontents = emailcontents.lower()
    
    #Handling Numbers
    emailcontents = re.sub(r'[0-9]+',' number ',emailcontents)
    
    #Handling $ signs
    emailcontents = emailcontents.replace('$',' dollar ')
    
    #Create tokens
    etokens = nltk.word_tokenize(emailcontents)
    
    #Removing non-alphanumeric characters
    etokens = [re.sub('\W|_','',etoken) for etoken in etokens]
    
    #Removing empty tokens
    etokens = list(filter(None, etokens))
    
    #Removing Stop Words
    etokens = list(filter(lambda x: (x not in stop_words), etokens))
    
    #Stemming
    porter = nltk.PorterStemmer()
    etokens = [porter.stem(t) for t in etokens]
    
    #Removing any words less than 3 letters
    etokens = list(filter(lambda x: (len(x)>=3), etokens))
    
    return etokens

Word tokens are created separately for spam files and not spam files.

In [11]:
spamtokens = []
validtokens = []

In [12]:
for spam_filename in spam_filenames:
    spamtokens.append(folder_open(spam_filename))

.\20021010_spam\spam\0123.68e87f8b736959b1ab5c4b5f2ce7484a
Windows-1254


.\20021010_spam\spam\0273.51c482172b47ce926021aa7cc2552549
SHIFT_JIS


.\20021010_spam\spam\0330.a4df526233e524104c3b3554dd8ab5a8
SHIFT_JIS


.\20021010_spam\spam\0334.3e4946e69031f3860ac6de3d3f27aadd
SHIFT_JIS


.\20021010_spam\spam\0335.9822e1787fca0741a8501bdef7e8bc79
SHIFT_JIS


.\20030228_spam\spam\00116.29e39a0064e2714681726ac28ff3fdef
Windows-1254


.\20030228_spam\spam\00263.13fc73e09ae15e0023bdb13d0a010f2d
SHIFT_JIS


.\20030228_spam\spam\00320.20dcbb5b047b8e2f212ee78267ee27ad
SHIFT_JIS


.\20030228_spam\spam\00323.9e36bf05304c99f2133a4c03c49533a9
SHIFT_JIS


.\20030228_spam\spam\00324.6f320a8c6b5f8e4bc47d475b3d4e86ef
SHIFT_JIS


.\20030228_spam\spam\00500.85b72f09f6778a085dc8b6821965a76f
GB2312


.\20030228_spam\spam\cmds
'NoneType' object has no attribute 'span'
.\20030228_spam_2\spam_2\01065.9ecef01b01ca912fa35453196b4dae4c
Windows-1254


.\20030228_spam_2\spam_2\01227.04a4f94c7a73b29cb56bf38c7d526116

In [13]:
for valid_filename in valid_filenames:
    validtokens.append(folder_open(valid_filename))




" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup


.\20030228_easy_ham\easy_ham\cmds
'NoneType' object has no attribute 'span'
.\20030228_easy_ham_2\easy_ham_2\cmds
'NoneType' object has no attribute 'span'
.\20030228_hard_ham\hard_ham\cmds
'NoneType' object has no attribute 'span'


In [14]:
len(validtokens)

5

### Working on Tokens

Currently the emails are arranged by the tar files they are extracted from. In the next few steps the emails (which are only token words now) will be arranged in a single file.

A copy of all tokens will also be gathered together to create a vocabulary list. 

In [15]:
allspamtokens = [spamtokens2 for spamtokens1 in spamtokens for spamtokens2 in spamtokens1]
len(allspamtokens)

3776

In [16]:
onlyspamtokens = [allspamtokens2 for allspamtokens1 in allspamtokens for allspamtokens2 in allspamtokens1]
len(onlyspamtokens)

1448875

In [17]:
allvalidtokens = [validtokens2 for validtokens1 in validtokens for validtokens2 in validtokens1]
len(allvalidtokens)

6951

In [18]:
onlyvalidtokens =  [validtokens2 for validtokens1 in allvalidtokens for validtokens2 in validtokens1]
len(onlyvalidtokens)

1230575

In [19]:
onlytokens = onlyspamtokens + onlyvalidtokens

In [20]:
vocabWords = nltk.Text(onlytokens).vocab()

In [23]:
sorted_onlytokens = sorted(vocabWords, key = vocabWords.get, reverse=True)

In [24]:
vocabWords[sorted_onlytokens[12959]]

10

Only those words are considered which have at least 10 occurrences in the emails. Based on that criteria we are considering the 12960 most used words.

In [25]:
vocab_list = sorted_onlytokens[0:12960]

In [26]:
vocab_list[12959]

'epicentr'

Word Indices are vectors of the emails based on the position of tokens in the vocabulary list. So each email will have the same length of word indices as they all represent whether a word is in the vocabulary list. For words not in vocabulary list, -1 value is assigned.  So an extra column is provided in word indices to accomodate them. So words not in the list will assigned to -1 location in the array which is the last column.

In [27]:
wordindicesspam = allspamtokens
wordindicesvalid = allvalidtokens

In [28]:
wordindicesspam = [[vocab_list.index(x) if x in vocab_list else -1 for x in indspamtokens] for indspamtokens in allspamtokens]

In [29]:
len(wordindicesspam)

3776

In [30]:
wordindicesvalid = [[vocab_list.index(x) if x in vocab_list else -1 for x in indvalidtokens] for indvalidtokens in allvalidtokens]

In [31]:
len(wordindicesvalid)

6951

Word indices will be one-hot-encoded and stored in numpy arrays. The last column is eliminated.

In [32]:
import numpy as np

In [33]:
Xspam = np.zeros((len(wordindicesspam), len(vocab_list)+1))

In [34]:
Xvalid = np.zeros((len(wordindicesvalid), len(vocab_list)+1))

In [35]:
for i in range(len(wordindicesspam)):
    Xspam[i][wordindicesspam[i][:]]=1

In [36]:
for i in range(len(wordindicesvalid)):
    Xvalid[i][wordindicesvalid[i][:]]=1

In [37]:
Xspam = Xspam[:,0:len(vocab_list)]
Xvalid = Xvalid[:,0:len(vocab_list)]

In [38]:
Xspam.shape

(3776, 12960)

### Performing Machine Learning

Combining the spam and valid emails together into the X variable. A column is provided at the end to store values of whether if the emails are spam or not. If the email is spam then the value of 1 is stored and if its not 0 is stored.

In [39]:
X = np.zeros((len(wordindicesspam) + len(wordindicesvalid), len(vocab_list)+1))

In [40]:
X[0:len(wordindicesspam),0:len(vocab_list)] = Xspam
X[len(wordindicesspam):(len(wordindicesspam) + len(wordindicesvalid)),0:len(vocab_list)]
X[0:len(wordindicesspam), len(vocab_list)] = 1

In [41]:
X.shape

(10727, 12961)

In [42]:
np.random.shuffle(X)

In [43]:
Xdata = X[:,0: len(vocab_list)]
ydata = X[:, len(vocab_list)]

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [45]:
from sklearn.model_selection import cross_val_score
from sklearn import svm

Before doing the actual analysis, a test analysis will help to let us know what to expect.

In [55]:
model = svm.SVC(gamma=0.001, C=1 )

In [57]:
cross_val_score(model, Xdata, ydata, cv=5)

array([0.98556125, 0.98927739, 0.98787879, 0.98741259, 0.98787879])

More than 98% accuracy is good, let us see if we can improve the accuracy.

Splitting the datasets into test and train sets. Also using kfolds cross validation technique to compare results. Also comparing against different values of hyper-paramenter C. The different values of C used is 0.0001, 0.001, 0.1, 1, 10 and 100. We will be using Support Vector Machine Classifierm for analysis.

In [60]:
X_train, X_test, y_train, y_test = train_test_split( Xdata, ydata, test_size=0.20)

In [63]:
kf = KFold(n_splits=4)
kf.get_n_splits(X_train)

4

In [68]:
for c in [0.0001, 0.001, 0.1, 1, 10, 100]:
    print('C is '+str(c))
    for train_index, cv_index in kf.split(X_train,y_train):        
        cvmodel = svm.SVC(gamma=0.001, C=c)
        cvmodel.fit(X_train[train_index], y_train[train_index])
        cv_score = cvmodel.score(X_train[cv_index],y_train[cv_index])
        test_score = cvmodel.score(X_test, y_test)
        print(cv_score)
        print(test_score)
        print('\n')
    print('\n')

C is 0.0001
0.6500465983224604
0.6477166821994408


0.6298368298368299
0.6477166821994408


0.6564102564102564
0.6477166821994408


0.6559440559440559
0.6477166821994408




C is 0.001
0.6500465983224604
0.6477166821994408


0.6298368298368299
0.6477166821994408


0.6564102564102564
0.6477166821994408


0.6559440559440559
0.6477166821994408




C is 0.1
0.9482758620689655
0.9538676607642125


0.9496503496503497
0.9538676607642125


0.9543123543123543
0.9534016775396086


0.9501165501165502
0.9538676607642125




C is 1
0.9874184529356943
0.9878844361602982


0.985081585081585
0.9874184529356943


0.9864801864801864
0.9883504193849021


0.986013986013986
0.9878844361602982




C is 10
0.9958061509785647
0.9962721342031687


0.9958041958041958
0.9953401677539608


0.9958041958041958
0.9953401677539608


0.9944055944055944
0.9958061509785647




C is 100
0.9976700838769804
0.9981360671015843


0.9972027972027973
0.9986020503261882


0.9981351981351981
0.9990680335507922


0.99720279720279

Looking at the results, we can deduce that higher C gives better score. So let us increase C also check against sigmoid kernel.

In [71]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [73]:
model = svm.SVC()
pipe = Pipeline(steps=[('svc',model)])
grid = dict(svc__C=[100, 500], svc__kernel=['rbf','sigmoid'])
estimator = GridSearchCV(pipe, grid, n_jobs=-1)
estimator.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'svc__C': [100, 500], 'svc__kernel': ['rbf', 'sigmoid']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [74]:
estimator.best_score_

0.9962708309054888

In [75]:
estimator.best_params_

{'svc__C': 500, 'svc__kernel': 'rbf'}

In [76]:
estimator.score(X_train,y_train)

0.9976692693159306

So RBF kernel gives the best results and also higher C is still giving better results but the score improvement is marginal. So let us one last time for higher values of C.

In [78]:
model = svm.SVC()
pipe = Pipeline(steps=[('svc',model)])
grid = dict(svc__C=[500, 1000, 2000])
estimator = GridSearchCV(pipe, grid, n_jobs=-1)
estimator.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'svc__C': [500, 1000, 2000]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

In [79]:
estimator.best_score_

0.9975527327817271

In [80]:
estimator.best_params_

{'svc__C': 2000}

In [81]:
estimator.score(X_train,y_train)

0.9983684885211513

So finally we get pretty good score for both cross validated train set and the test set. So we can use this as an email spam classifier.