 ## Data 620, Week 5, Part 2 Assignment
 #### Team 4: John Grando, Nick Capofari, Ken Markus, Armenoush Aslanian-Persico, Andrew Goldberg
 
 #### Project Details: 
Use a dataset to predict a class of new documents (either withheld from the training dataset or from another source such as your own spam folder). 

For this project, we used a pre-processed Enron e-mail corpus (available here: http://www2.aueb.gr/users/ion/data/enron-spam/) to classify the documents as either spam or ham. 

 ### Import and normalize data

In [29]:
import os
import nltk
from IPython.display import display

spamfolder = '/Users/andrew/Documents/School/Web Analytics/HW4/enron1/spam'
spamdata = []
for filename in os.listdir(spamfolder):
    with open(spamfolder+'/'+filename) as spamtext:
        spamtext = spamtext.read()
        spamtext = spamtext.decode('UTF8', errors='ignore')
        spamdata.append(spamtext)
        
hamfolder = '/Users/andrew/Documents/School/Web Analytics/HW4/enron1/ham'
hamdata = []
for filename in os.listdir(hamfolder):
    with open(hamfolder+'/'+filename) as hamtext:
        hamtext = hamtext.read()
        hamtext = hamtext.decode('UTF8', errors='ignore')
        hamdata.append(hamtext)

In [5]:
#Sample e-mail data
hamdata[1]

u'Subject: vastar resources , inc .\r\ngary , production from the high island larger block a - 1 # 2 commenced on\r\nsaturday at 2 : 00 p . m . at about 6 , 500 gross . carlos expects between 9 , 500 and\r\n10 , 000 gross for tomorrow . vastar owns 68 % of the gross production .\r\ngeorge x 3 - 6992\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16\r\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\r\ndaren j farmer\r\n12 / 10 / 99 10 : 38 am\r\nto : carlos j rodriguez / hou / ect @ ect\r\ncc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect\r\nsubject : vastar resources , inc .\r\ncarlos ,\r\nplease call linda and get everything set up .\r\ni \' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each\r\nfollowing day based on my conversations with bill fischer at bmar .\r\nd .\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer / hou / ect on 12 / 10

 ### Format and label e-mails spam/ham

In [15]:
labeled_emails = ([(ham_mail.split(), 'ham') for ham_mail in hamdata] +
                  [(spam_mail.split(), 'spam') for spam_mail in spamdata])
import random
random.shuffle(labeled_emails)

all_emails = [email for email, classification in labeled_emails][:500]
flattened_emails = [word for email in all_emails for word in email]

tokenized_emails = []
for word in flattened_emails:
        tokenized_emails.extend(nltk.word_tokenize(word))

 ### Define feature extractor

In [17]:
from nltk.corpus import stopwords

#extract the 500 most common words
word_freq = nltk.FreqDist(tokenized_emails)
top_words = [w for (w,c) in word_freq.most_common(500)]
top_words = [w for w in top_words if w.isalpha()]
top_words = [w for w in top_words if w not in stopwords.words('english')]

In [78]:
#most common words
print display(word_freq.most_common(20))

[(u'-', 7721),
 (u'.', 5159),
 (u',', 3787),
 (u'/', 3321),
 (u':', 2745),
 (u'the', 2583),
 (u'to', 2025),
 (u'and', 1366),
 (u'ect', 1120),
 (u'of', 1051),
 (u'a', 1041),
 (u'for', 1037),
 (u'?', 931),
 (u'@', 922),
 (u'in', 816),
 (u'on', 791),
 (u'you', 780),
 (u'this', 735),
 (u'is', 700),
 (u'i', 641)]

None


In [12]:
#build feature extractor; uses both most common words and extremes in email length

import math
def document_features(document):
    document_words = set(document)
    features = {}
    if len(document) < 20:
        short_mail = True
        long_mail = False
    elif len(document) > 1500:
        short_mail = False
        long_mail = True
    else:
        short_mail = False
        long_mail = False
    features['len_check({})'.format("short_mail")] = short_mail
    features['len_check({})'.format("long_mail")] = long_mail
    for word in top_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features

 ### Train Classifier

In [120]:
featuresets = [(document_features(d), c) for (d,c) in labeled_emails]
train_set, dev_test_set, test_set = featuresets[:500], featuresets[500:1000], featuresets[1000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

 ### Predictions

In [121]:
import pandas as pd
preds = pd.DataFrame({'spam or ham':[email for (email,classification) in dev_test_set],
                      'observed':[classification for (email,classification) in dev_test_set],
                      'predicted': [classifier.classify(document_features(n)) for (n,g) in labeled_emails[500:1000]]})

In [122]:
pd.crosstab(preds.observed,preds.predicted)

predicted,ham,spam
observed,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,319,46
spam,0,135


Impressive sensitivity at the expense of some specificity; some ham is predicted as spam. 

In [123]:
#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(preds.observed,preds.predicted)
sensitivity1 = (float(cm[1,1])/(cm[1,1]+cm[1,0]))
print('Sensitivity : ', sensitivity1 )

specificity1 = (float(cm[0,0])/(cm[0,0]+cm[0,1]))
print('Specificity : ', specificity1)

[[319  46]
 [  0 135]]
('Sensitivity : ', 1.0)
('Specificity : ', 0.873972602739726)


 ### Test Performance

In [83]:
print 'Accuracy: %4.2f' %nltk.classify.accuracy(classifier, dev_test_set)
classifier.show_most_informative_features(15)

Accuracy: 0.91
Most Informative Features
           contains(ect) = True              ham : spam   =     29.4 : 1.0
          contains(stop) = True             spam : ham    =     26.9 : 1.0
      contains(attached) = True              ham : spam   =     21.9 : 1.0
     contains(microsoft) = True             spam : ham    =     20.7 : 1.0
           contains(gas) = True              ham : spam   =     14.3 : 1.0
          contains(corp) = True              ham : spam   =     13.8 : 1.0
        contains(prices) = True             spam : ham    =     13.4 : 1.0
             contains(j) = True              ham : spam   =     11.4 : 1.0
          contains(deal) = True              ham : spam   =     11.0 : 1.0
         contains(daily) = True              ham : spam   =     10.9 : 1.0
      contains(products) = True             spam : ham    =     10.6 : 1.0
      contains(software) = True             spam : ham    =      9.7 : 1.0
         contains(stock) = True             spam : ham    =

 ### Errors

In [52]:
errors = []
for (doc, tag) in labeled_emails[500:1000]:
    guess = classifier.classify(document_features(doc))
    accuracy = classifier.prob_classify(document_features(doc))
    if guess != tag:
        errors.append( (tag, guess, len(doc), accuracy.prob("spam")) )

Unclear, based on current analysis, what is causing the misclassifications

In [72]:
col_names = ['tag', 'guess', 'length', 'modeled spam probability']
pd.DataFrame(errors, columns = col_names)

Unnamed: 0,tag,guess,length,spam probability
0,ham,spam,130,0.626272
1,ham,spam,12,0.999444
2,ham,spam,578,0.679953
3,ham,spam,20,0.838354
4,ham,spam,31,0.99364
5,ham,spam,409,0.999982
6,ham,spam,30,0.991248
7,ham,spam,11,0.967119
8,ham,spam,98,0.997005
9,ham,spam,16,0.55512


 ## Model Performance

In [76]:
print 'Accuracy: %4.2f' %nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(15)

Accuracy: 0.90
Most Informative Features
           contains(ect) = True              ham : spam   =     29.4 : 1.0
          contains(stop) = True             spam : ham    =     26.9 : 1.0
      contains(attached) = True              ham : spam   =     21.9 : 1.0
     contains(microsoft) = True             spam : ham    =     20.7 : 1.0
           contains(gas) = True              ham : spam   =     14.3 : 1.0
          contains(corp) = True              ham : spam   =     13.8 : 1.0
        contains(prices) = True             spam : ham    =     13.4 : 1.0
             contains(j) = True              ham : spam   =     11.4 : 1.0
          contains(deal) = True              ham : spam   =     11.0 : 1.0
         contains(daily) = True              ham : spam   =     10.9 : 1.0
      contains(products) = True             spam : ham    =     10.6 : 1.0
      contains(software) = True             spam : ham    =      9.7 : 1.0
         contains(stock) = True             spam : ham    =

In [124]:
perf = pd.DataFrame({'spam or ham':[email for (email,classification) in test_set],
                      'observed':[classification for (email,classification) in test_set],
                      'predicted': [classifier.classify(document_features(n)) for (n,g) in labeled_emails[1000:]]})
pd.crosstab(perf.observed,perf.predicted)

predicted,ham,spam
observed,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,2555,403
spam,4,1210


We see that sensitivity remains very high, but some ham is unfortunately classified as spam. 

In [125]:
#Confusion matrix, Accuracy, sensitivity and specificity
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(perf.observed,perf.predicted)
sensitivity1 = (float(cm[1,1])/(cm[1,1]+cm[1,0]))
print('Sensitivity : ', sensitivity1 )

specificity1 = (float(cm[0,0])/(cm[0,0]+cm[0,1]))
print('Specificity : ', specificity1)

('Sensitivity : ', 0.9967051070840197)
('Specificity : ', 0.8637592968221771)
