 ## Data 620, Week 5, Part 2 Assignment
 #### Team 4: John Grando, Nick Capofari, Ken Markus, Armenoush Aslanian-Persico, Andrew Goldberg
 
 #### Project Details: 
Use a dataset to predict a class of new documents (either withheld from the training dataset or from another source such as your own spam folder). 

For this project, we used a pre-processed Enron e-mail corpus (available here: http://www2.aueb.gr/users/ion/data/enron-spam/) to classify the documents as either spam or ham. 

 ### Import and normalize data

In [3]:
import os
import nltk

spamfolder = '/Users/andrew/Documents/School/Web Analytics/HW4/enron1/spam'
spamdata = []
for filename in os.listdir(spamfolder):
    with open(spamfolder+'/'+filename) as spamtext:
        spamtext = spamtext.read()
        spamtext = spamtext.decode('UTF8', errors='ignore')
        spamdata.append(spamtext)
        #spamdata.append(spamtext.read())
        
hamfolder = '/Users/andrew/Documents/School/Web Analytics/HW4/enron1/ham'
hamdata = []
for filename in os.listdir(hamfolder):
    with open(hamfolder+'/'+filename) as hamtext:
        hamtext = hamtext.read()
        hamtext = hamtext.decode('UTF8', errors='ignore')
        hamdata.append(hamtext)

In [4]:
type(spamdata[1])

unicode

In [5]:
hamdata[1]

u'Subject: vastar resources , inc .\r\ngary , production from the high island larger block a - 1 # 2 commenced on\r\nsaturday at 2 : 00 p . m . at about 6 , 500 gross . carlos expects between 9 , 500 and\r\n10 , 000 gross for tomorrow . vastar owns 68 % of the gross production .\r\ngeorge x 3 - 6992\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by george weissman / hou / ect on 12 / 13 / 99 10 : 16\r\nam - - - - - - - - - - - - - - - - - - - - - - - - - - -\r\ndaren j farmer\r\n12 / 10 / 99 10 : 38 am\r\nto : carlos j rodriguez / hou / ect @ ect\r\ncc : george weissman / hou / ect @ ect , melissa graves / hou / ect @ ect\r\nsubject : vastar resources , inc .\r\ncarlos ,\r\nplease call linda and get everything set up .\r\ni \' m going to estimate 4 , 500 coming up tomorrow , with a 2 , 000 increase each\r\nfollowing day based on my conversations with bill fischer at bmar .\r\nd .\r\n- - - - - - - - - - - - - - - - - - - - - - forwarded by daren j farmer / hou / ect on 12 / 10

 ### Format and label e-mails spam/ham

In [7]:
labeled_emails = ([(ham_mail.split(), 'ham') for ham_mail in hamdata] +
                  [(spam_mail.split(), 'spam') for spam_mail in spamdata])
import random
random.shuffle(labeled_emails)


all_emails = [email for email, classifcation in labeled_emails]
flattened_emails = [word for email in all_emails for word in email]

tokenized_emails = []
for word in flattened_emails:
        tokenized_emails.extend(nltk.word_tokenize(word))

In [8]:
tokenized_emails[:10]

[u'Subject',
 u':',
 u'please',
 u'process',
 u'immediately',
 u'to',
 u'avoid',
 u'loss',
 u'of',
 u'information']

 ### Define feature extractor

In [73]:
word_freq = nltk.FreqDist(w for w in tokenized_emails)
top_words = list(word_freq)[:10000]

In [63]:
len(word_freq)

50540

In [52]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in top_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features

 ### Train Classifier

In [74]:
featuresets = [(document_features(d), c) for (d,c) in labeled_emails]
train_set, test_set = featuresets[200:], featuresets[:200]
classifier = nltk.NaiveBayesClassifier.train(train_set)

 ### Predictions

In [75]:
import pandas as pd
preds = pd.DataFrame({'spam or ham':[email for (email,classification) in test_set],
                      'observed':[classification for (email,classification) in test_set],
                      'predicted': [classifier.classify(document_features(n)) for (n,g) in test_set]})

In [76]:
preds[:1000]

Unnamed: 0,observed,predicted,spam or ham
0,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
1,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
2,spam,ham,"{u'contains(corporate)': False, u'contains(bed..."
3,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
4,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
5,spam,ham,"{u'contains(corporate)': False, u'contains(bed..."
6,spam,ham,"{u'contains(corporate)': False, u'contains(bed..."
7,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
8,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."
9,ham,ham,"{u'contains(corporate)': False, u'contains(bed..."


In [77]:
pd.crosstab(preds.observed,preds.predicted)

predicted,ham
observed,Unnamed: 1_level_1
ham,140
spam,60


 ### Model Performance

In [78]:
print 'Accuracy: %4.2f' %nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

Accuracy: 0.86
Most Informative Features
           contains(nom) = True              ham : spam   =    147.8 : 1.0
          contains(spam) = True             spam : ham    =    105.4 : 1.0
        contains(sexual) = True             spam : ham    =     61.3 : 1.0
       contains(foresee) = True             spam : ham    =     53.1 : 1.0
       contains(advises) = True             spam : ham    =     51.5 : 1.0
