<b>
PDS Project 2: Spam Detection <br>
Connor Moore | Gaury Nagaraju <br>
Date: 09 May, 2016
</b>

<ol start=1> <li><b> Problem & Domain: </b> </li> </ol>
- Phising is one of the leading sources of viruses
- Spam filtering is one of the primary ways to fight spam and viruses sent via email.

<ol start=2> <li><b> Dataset: </b> </li> </ol>
- Sources Explored for Datasets: South African Hindawai Journal Research, Phishtank, Enron datasets
- Source Selected: CSMining Group http://csmining.org/index.php/spam-email-datasets-.html
- Dataset:
Training Emails: Labeled emails with 4327 messages in .eml format (2949 non-spam, 1378 spam)
Testing Email: Unlabeled Emails - 4292 messages.

<ol start=3>
    <li> <b> Data Sanitization </b> </li> <br>
    <ul>
        <li> Detect Character Set of Email and extract payload. </li>
        <li> Loop through all emails: get isSpam Label, payload and store in a Dictionary. Ignore email if character set is not recognizable. </li>
        <li> Pickle Dictionary </li>
    </ul>
</ol>

In [None]:
# Import statements
import email.parser 
import os, sys, stat
import chardet
import pickle
from bs4 import BeautifulSoup
import numpy

In [None]:
def ExtractSubPayload (filename):
    ''' Extract the subject and payload from the .eml file.

    '''
    if not os.path.exists(filename): # dest path doesnot exist
        print("ERROR: input file does not exist:", filename)
        os.exit(1)

    f = open(filename)

    msg = email.message_from_file(f)

    # Subject, to and from fields
    sub = msg.get('subject')
    sub = str(sub)
    to = str(msg.get('to'))
    fr = str(msg.get('from'))

    # get body of message
    payload = msg.get_payload()
    
    # Beautiful Soup
    soup = BeautifulSoup(payload, 'html.parser')
    payload = soup.get_text()
    
    if type(payload) == type(list()) :
        payload = payload[0] # only use the first part of payload
    if type(payload) != type('') :
        # payload = str(payload)
        payload = payload.encode('ascii', 'replace')
    
    # Charset of payload
    charset = chardet.detect(payload)['encoding']

    return {"charset": charset, "payload": payload}
    close(f)

In [None]:
trainDict = {}

with open("SPAMTrain.label") as f:
    for line in f:
#         print(line[2:])
        isSpam = int(line[0])
        filename = line[2:]
        trainDict[filename.strip()] = isSpam
        
# print trainDict
x = 0
deleteKeys = []
for mail in trainDict:
    try:
        res = ExtractSubPayload("TRAINING/"+mail)
    except:
        deleteKeys += [mail]
        continue
#     print res['filename']
#     print mail
    res["isSpam"] = trainDict[mail]
    trainDict[mail] = res
    if(x %1000 == 0):
        print x
    x+=1

# delete keys, i.e. emails whose charset were not recognizable
for d in deleteKeys:
    trainDict.pop(d)

In [None]:
with open("emailData.pickle", "w") as f:
    pickle.dump(trainDict, f)

In [None]:
with open("emailData.pickle", "rb") as f:
    trainDict = pickle.load(f)

In [None]:
len(trainDict)

In [None]:
trainDict

In [None]:
# Optional: View formatted payload of each email
for mail in trainDict:
    print trainDict[mail]['payload']

<ol start=4>
<li> <b> Feature Extraction </b> </li>
</ol><br>
Significance of Metrics: <br>
<li> Recall: positive identification of spam from all of test data. <i> Note: Recall = 1-FP (False Positive Rate) </i> </li>
<li> Precision: positive identification of spam from those identified as spam </li>
<li> Accuracy: model accuracy </li>

Feature Extractor 1: <br>
CountVectorizer: counts word occurrence in each email <br>
TreebankWordTokenizer: nltk (natural language processing package in python)'s tokenizer which does a good job of tokenizing words in a string based on natural speech

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer

In [None]:
vec = CountVectorizer(tokenizer=TreebankWordTokenizer().tokenize, stop_words='english')

In [None]:
l = list()
labels = list()

for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    bodyStr = body.decode(charset, errors='replace').encode('utf-8', 'replace')
    
    # append to l and label
    l.append(bodyStr)
    labels.append(isSpam)   

In [None]:
X = vec.fit_transform(l) 

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import MultinomialNB

# fit a Naive Bayes model to the data
model = MultinomialNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Feature Extractor 2: with Content Filter <br>
CountVectorizer: counts word occurrence in each email <br>
Spacy Tokenizer: Does a better job of tokenizing words taking into consideration digits, punctuation marks, urls and various other factors <br>
SpamAssassin's list of words to be excluded: http://wiki.apache.org/spamassassin/BayesStopList <br>
Words less than 3 character long are ignored

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

In [None]:
# Source: http://wiki.apache.org/spamassassin/BayesStopList
# SpamAssasin: word length < 3 excluded and words in given list excluded
exclusive_list = ['able', 'all', 'already', 'and', 'any', 'are', 'because','both', 'can', 'come', 'each', 'email', 'even', 'few', 'first', 'for', 'from', 'give', 'has', 'have', 'http', 'information', 'into', "it's", 'just', 'know','like', 'long', 'look', 'made', 'mail', 'mailing', 'mailto', 'make', 'many','more', 'most', 'much', 'need', 'not', 'now', 'number', 'off', 'one', 'only', 'out', 'own', 'people', 'place', 'right', 'same', 'see', 'such', 'that', 'the', 'this', 'through', 'time', 'using', 'web', 'where', 'why', 'with', 'without', 'work', 'world', 'year', 'years', 'you', 'your', "you're"]

In [None]:
def ignore_word(word):
    # ignore digits, punctuation marks, spaces, stop words, new line chars
    if word.is_digit or word.is_punct or word.is_space or word.is_stop or str(word)=='\n' or word.like_num:
        return True
    elif word in exclusive_list:
        return True
    elif len(str(word)) < 3:
        return True
    else:
        return False

In [None]:
l = list()
labels = list()

for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    u = unicode(body, charset)
    try:
        # Tokenize using Spacy
        words = en_nlp(u)
        # Get relevant tokens
        relevant_words = ''
        for word in words:
            if ignore_word(word):
                continue
            # consider words that contain alphabets or look like urls
            elif word.is_alpha or word.like_url:
                relevant_words+= str(word) + ' '
        # append to list and labels
        l.append(relevant_words)
        labels.append(isSpam)
    except:
        continue


In [None]:
X = vec.fit_transform(l)

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import MultinomialNB

# fit a Naive Bayes model to the data
model = MultinomialNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Feature Extractor 3: with ngrams range 1-3 <br>
CountVectorizer ngrams: counts word occurrence in each email. Analyze groups of words. <br>
Spacy Tokenizer: Does a better job of tokenizing words taking into consideration digits, punctuation marks, urls and various other factors <br>
SpamAssassin's list of words to be excluded: http://wiki.apache.org/spamassassin/BayesStopList <br>
Words less than 3 character long are ignored

In [None]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,3), min_df=1, stop_words='english')
X = vectorizer.fit_transform(l)

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import MultinomialNB

# fit a Naive Bayes model to the data
model = MultinomialNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Feature Extractor 4: with content filter and TfidfVectorizer <br>
TfidfVectorizer: calculates term frequency, i.e. word count and weights it based on length of document. If term appears in too mnay documents, the term is ignored. <br>
Spacy Tokenizer: Does a better job of tokenizing words taking into consideration digits, punctuation marks, urls and various other factors <br>
SpamAssassin's list of words to be excluded: http://wiki.apache.org/spamassassin/BayesStopList <br>
Words less than 3 character long are ignored

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(l)

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import MultinomialNB

# fit a Naive Bayes model to the data
model = MultinomialNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Feature Extractor 5: with content filter <br>
CountVectorizer: calculates word occurrences in email <br>
Spacy Tokenizer Lemma: Get the lemma/ root for each word in the email and group different versions of the words into the lemma count

In [None]:
l = list()
labels = list()

for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    u = unicode(body, charset)
    try:
        # Tokenize using Spacy
        words = en_nlp(u)
        # Get relevant tokens
        relevant_words = ''
        for word in words:
            # ignore digits, punctuation marks, spaces, stop words, new line chars
            if word.is_digit or word.is_punct or word.is_space or word.is_stop or str(word)=='\n' or word.like_num:
                continue
            # consider words that contain alphabets or look like urls
            elif word.is_alpha or word.like_url:
                # use lemma of word
                relevant_words+= str(word.lemma_) + ' '
        # append to list and labels
        l.append(relevant_words)
        labels.append(isSpam)
    except:
        continue


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(l)

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import BernoulliNB

# fit a Naive Bayes model to the data
model = BernoulliNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

<ol start=5>
<li> <b> Feature Selection </b> </li>
</ol>
Select most relevant features given all the emails rather than considering counts of every word. <br>
Tools Used: <br>
ExtraTressClassifier: identifies important features from features found using CountVectorizer and returns a 0,1 np array. <br>
SelectFromModel: Applies the result of ExtraTreesClassifier to our data to get value for each feature

In [None]:
l = list()
labels = list()

for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    u = unicode(body, charset)
    try:
        # Tokenize using Spacy
        words = en_nlp(u)
        # Get relevant tokens
        relevant_words = ''
        for word in words:
            if word.is_digit or word.is_punct or word.is_space or word.is_stop or str(word)=='\n' or word.like_num:
                continue
            # consider words that contain alphabets or look like urls
            elif word.is_alpha or word.like_url:
                relevant_words+= str(word) + ' '
        # append to list and labels
        l.append(relevant_words)
        labels.append(isSpam)
    except:
        continue

In [None]:
# Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,3), min_df = 1, stop_words = 'english')
X = vectorizer.fit_transform(l)

In [None]:
# Feature Selection
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()
clf.fit(X, labels)

In [None]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(clf, prefit=True)

In [None]:
X_new = model.transform(X)

In [None]:
# Optional: Compare shape of X and X_new to see reduction in total number of features
print X.shape
print X_new.shape

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, labels, test_size=1000, random_state=42)

from sklearn.naive_bayes import MultinomialNB

# fit a Naive Bayes model to the data
model = MultinomialNB()
model.fit(X_train, y_train)

# make predictions
expected = y_test
predicted = model.predict(X_test)

In [None]:
# Import  metrics
from sklearn import metrics

# summarize the fit of the model

print(metrics.accuracy_score(expected, predicted))
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

<ol start=6>
<li> <b> Spam Assassin </b> </li>
</ol>
We wrote a script that will give us the spam score for every email by making a request to the SpamAssassin server. Score > 5 is considered as spam. <br>
The result output is a tuple with Spam Assassin's classification of the email and our label

In [None]:
import sys
import os
import pickle
# our custom fork of the spamcheck
cwd = os.getcwd()
sys.path.append(cwd + 'SpamAss/spamcheck-python')
import spamcheck

In [None]:
with open("../cleanEmailData.pickle") as f:
    emailData = pickle.load(f)

In [None]:
# For Testing Purposes: Try one clean email (without any new line or \r characters)
x = emailData[0][0].replace('\n', ' ').replace('\r', ' ')
spamcheck.check(x)

In [None]:
x = 0

res = []
for i in emailData:
    filteredPayload = i[0].replace('\n', ' ').replace('\r', ' ')
    isSpam = False
    try:
        if spamcheck.check(filteredPayload)['score'] > 5:
            isSpam = True
    except:
        continue
    res.append((isSpam, i[1]))
    if i == 100:
        break
    x+=1
    if (x % 50 == 0):
        print x

# The result is a tuple (SpamAssassin's classification of email as spam or not, our label of spam or not)
res

In [None]:
with open("spamAssassinData.pickle", "w") as f:
    pickle.dump(res, f)

In [None]:
with open("spamAssassinData.pickle", "rb") as f:
    res = pickle.load(f)

<ol start=7>
<li> <b> Understanding Model Results </b> </li>
</ol>

While comparing different models, we should look at the recall and precision rates since they reveal our ability to classify an email as spam and the false positive rate. <br><br>
The top 3 models with respect to these factors are: <br>
1. CountVectorizer with content filters (Spacy Tokenizer, spam assassin exclusive word list) and ngrams. The model was 96.5% accurate. Recall and precision for spam were 0.95 and 0.94 respectively. <br>
2. CountVectorizer with ngrams and feature selector (treeClassifier and selectFromModel that select most important features). The model was 95.7% accurate. Recall and precision for spam were 0.96 and 0.9 respectively.  <br>
3. CountVectorizer without content filters or feature selectors.  The model was 95.8% accurate. Recall and precision for spam were 0.9 and 0.96 respectively. <br>
<br>

These metrics are very comparable to those of popular models such as the Enhanced Naive Bayes classifier by Paul Graham and Spam Assassin. Paul Graham has a catch rate of about 99.5% while our recall is around 95%.<br>
However, our model has a higher false positive rate of around 6% in comparison to Paul Graham's 0.05% Some of the reasons include: <br>
1. Message Headers: The email dataset we used didn't all have headers consistently and were often in various formats. So, we discarded all message headers but they definitely have a huge impact on classification, as mentioned by Paul Graham as well.
2. Data Sanitization: Each email was in a different format; text, email reply threads, html, newsletter format. So, having a technique that would standardize the format may affect results since words associated with the format were included in the word counts. 
<br>
Finally, a lot of email classifiers get to 95% accuracy and about 5% false positive rate. However, identifying features and content filters that decrease that 5% is the current challenge and a very dynamic problem since spammers constantly find ways to beat the algorithm. <br>

As opposed to our hypothesis, models that used tfidf vectorizer and lemmas decreased performance because information is lost. In the tfidf vectorizer, ignoring words that appear in too many emails or weighting them based on the length of the email is not useful. Words are spammy regardless of their count or length of email they appear in. Similarly, using the lemma of each word results in loss of information about mis-spelt words or variations of words that occur in spam vs ham emails. <br>

Some features that we plan to consider in the future are length of email, header information, subject of message, header information. 

<ol start=8>
<li> <b> Conclusion and Future Actionable Steps </b> </li>
</ol>

Moving ahead, we can 
1. Use better data sanitation techniques and maybe cluster emails into groups such as reply threads, newsletters, html structure, text.
2. Find a way to extract message headers for all emails
3. Clever feature engineering: Identify significant features apart from words in the email such as length of email, time it was sent, number of links in the email.
4. Look into gmail filters such as degree of links between sender and receiver

<ol start=9>
<li> <b> Appendix </b> </li>
</ol>

Interesting Observations we made while analyzng words present in the email
<br>Appendix 1: Word Count Dictionary

In [None]:
# spacy english module
import spacy
en_nlp = spacy.load('en')

In [None]:
# create word count dictionary
wordDict = {}

In [None]:
def addToDict(word, isSpam):
    # ignore some words
    if word.is_digit or word.is_punct or word.is_space or word.is_stop:
        return
    # if every char in word is alphabet
    elif word.is_alpha:
        word = word.orth_[:].encode('ascii')
        # else add to dict
        if word.lower() in wordDict:
            # index 0: spam
            if isSpam == 0:
                wordDict[word.lower()][0] += 1
            # index 1: ham
            else:
                wordDict[word.lower()][1] += 1
        else:
            # index 0: spam
            if isSpam == 0:
                wordDict[word.lower()] = [1,0]
                # index 1: ham
            else:
                wordDict[word.lower()] = [0,1]

In [None]:
for k in trainDict.keys():
    # get relevant information
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset==None:
        charset='ascii'
    u = unicode(body, charset)
    
    # Tokenize using Spacy
    try:
        words = en_nlp(u)
        for word in words:
            addToDict(word, isSpam)
    except:
        continue

# Testing for individual email (smaller run time)
# k = trainDict.keys()[281]
# body = trainDict[k]['payload']
# charset = trainDict[k]['charset']
# isSpam = trainDict[k]['isSpam']

In [None]:
with open("wordDictData.pickle", "w") as f:
    pickle.dump(wordDict, f)

In [None]:
with open("wordDictData.pickle", "rb") as f:
    wordDict = pickle.load(f)

In [None]:
# get descending order of word counts. Index 0: spam, Index 1: ham
sorted(wordDict.items(), key= lambda x:x[1], reverse=True)

Appendix 2: TextBlob <br>
Use text blob to classify data since it has methods that allow us to explore why each feature is given a particular weight. For e.g. We can see the top five important features and waht about those features can be enhanced for better results. But it takes insanely long amount of time to train the entire dataset

In [None]:
from textblob.classifiers import NaiveBayesClassifier

In [None]:
# spacy english module
import spacy
en_nlp = spacy.load('en')

In [None]:
# Source: http://wiki.apache.org/spamassassin/BayesStopList
# SpamAssasin: word length < 3 excluded and words in given list excluded
exclusive_list = ['able', 'all', 'already', 'and', 'any', 'are', 'because','both', 'can', 'come', 'each', 'email', 'even', 'few', 'first', 'for', 'from', 'give', 'has', 'have', 'http', 'information', 'into', "it's", 'just', 'know','like', 'long', 'look', 'made', 'mail', 'mailing', 'mailto', 'make', 'many','more', 'most', 'much', 'need', 'not', 'now', 'number', 'off', 'one', 'only', 'out', 'own', 'people', 'place', 'right', 'same', 'see', 'such', 'that', 'the', 'this', 'through', 'time', 'using', 'web', 'where', 'why', 'with', 'without', 'work', 'world', 'year', 'years', 'you', 'your', "you're"]

In [None]:
def ignore_word(word):
    # ignore digits, punctuation marks, spaces, stop words, new line chars
    if word.is_digit or word.is_punct or word.is_space or word.is_stop or str(word)=='\n' or word.like_num:
        return True
    elif word in exclusive_list:
        return True
    elif len(str(word)) < 3:
        return True
    else:
        return False

In [None]:
l = list() # list of tuples (email body, isSpam label)

for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    u = unicode(body, charset)
    try:
        # Tokenize using Spacy
        words = en_nlp(u)
        # Get relevant tokens
        relevant_words = ''
        for word in words:
            if ignore_word(word):
                continue
            # consider words that contain alphabets or look like urls
            elif word.is_alpha or word.like_url:
                relevant_words+= str(word) + ' '
        # append to list and labels
        l.append((relevant_words, isSpam))
    except:
        continue

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test = train_test_split(l, test_size=1000, random_state=42)

In [None]:
# Warning: make take more than an hour to train
cl = NaiveBayesClassifier(X_train)

In [None]:
cl.accuracy(X_test)

In [None]:
# Show top 5 informative features
cl.show_informative_features(5)

Appendix 3: Clean Email Data <br>
Payload for each email is decode based on charset and encoded to ascii. <br>
This cleansed version is pickled for SpamAssassin API Requests

In [None]:
l = list() # list of tuples (email body, isSpam label)
for k in trainDict:
    # get body
    body = trainDict[k]['payload']
    charset = trainDict[k]['charset']
    isSpam = trainDict[k]['isSpam']
    if charset == None:
        charset = 'ascii'
    bodyStr = body.decode(charset, errors='replace').encode('utf-8', 'replace')

    # append to list and labels
    l.append((bodyStr, isSpam))

In [None]:
# Pickle
with open("cleanmailData.pickle", "w") as f:
    pickle.dump(trainDict, f)