# Mini Project: create a system that guess the authors given emails

A couple of years ago, J.K. Rowling (of Harry Potter fame) tried something interesting. She wrote a book, “The Cuckoo’s Calling,” under the name Robert Galbraith. The book received some good reviews, but no one paid much attention to it--until an anonymous tipster on Twitter said it was J.K. Rowling. The London Sunday Times enlisted two experts to compare the linguistic patterns of “Cuckoo” to Rowling’s “The Casual Vacancy,” as well as to books by several other authors. After the results of their analysis pointed strongly toward Rowling as the author, the Times directly asked the publisher if they were the same person, and the publisher confirmed. The book exploded in popularity overnight.

We’ll do something very similar in this project. We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email. We will start with Naive Bayes in this mini-project, and then expand in later projects to other algorithms.

We will start by giving you a list of strings. Each string is the text of an email, which has undergone some basic preprocessing; we will then provide the code to split the dataset into training and testing sets. (In the next lessons you’ll learn how to do this preprocessing and splitting yourself, but for now we’ll give the code to you).

Different from the other mini project, in that one we use SVM.

In [1]:
# %load svm_author_id.py
#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [2]:
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
clf=SVC(kernel="linear")
t0 = time()
clf.fit(features_train,labels_train)
print ("fitting time:", round(time()-t0, 3), "s")
t1=time()
pred = clf.predict(features_test)
print ("predicting time:", round(time()-t1, 3), "s")
print('accuracy: ',accuracy_score(pred,labels_test))

fitting time: 288.447 s
predicting time: 29.906 s
accuracy:  0.9840728100113766


Reduced data to 1%, and checking the results

In [4]:
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]
clf=SVC(kernel="linear")
t0 = time()
clf.fit(features_train,labels_train)
print ("fitting time:", round(time()-t0, 3), "s")
t1=time()
pred = clf.predict(features_test)
print ("predicting time:", round(time()-t1, 3), "s")
print('accuracy: ',accuracy_score(pred,labels_test))

fitting time: 0.173 s
predicting time: 1.765 s
accuracy:  0.8845278725824801


Reduced data to 1%, changing the kernel to rbf and checking the results

In [5]:
clf=SVC(kernel="rbf")
t0 = time()
clf.fit(features_train,labels_train)
print ("fitting time:", round(time()-t0, 3), "s")
t1=time()
pred = clf.predict(features_test)
print ("predicting time:", round(time()-t1, 3), "s")
print('accuracy: ',accuracy_score(pred,labels_test))



fitting time: 0.361 s
predicting time: 2.135 s
accuracy:  0.6160409556313993


Now let's check with C = 10,100,1000 and 10000

In [6]:
c=[10,100,1000,10000]
for i in range(4):
    clf=SVC(kernel="rbf",C=c[i])
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)
    print('C[{}]: {}'.format(c[i],accuracy_score(pred,labels_test)))



C[10]: 0.6160409556313993




C[100]: 0.6160409556313993




C[1000]: 0.8213879408418657




C[10000]: 0.8924914675767918


Using all dataset, changing the kernel to rbf, C=10000 and checking the results

In [7]:
features_train, features_test, labels_train, labels_test = preprocess()
clf=SVC(kernel="rbf",C=10000)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
print('accuracy: ',accuracy_score(pred,labels_test))

no. of Chris training emails: 7936
no. of Sara training emails: 7884




accuracy:  0.9908987485779295


Now try to predict 10,26,50.

In [9]:
features_train = features_train[:int(len(features_train)/100)]
labels_train = labels_train[:int(len(labels_train)/100)]
clf=SVC(kernel="rbf",C=10000)
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)
print(pred[10])
print(pred[26])
print(pred[50])



1
0
1


How many predictions was classified as Chris(1)?

In [10]:
from collections import Counter
c=Counter(pred)
c[1]

1018