# Person of Interest Identifier 2.0 Part I
### ~Abhishek Singh

This project is based on the [Enron scandal](https://en.wikipedia.org/wiki/Enron_scandal) of 2001, the data used here is the emails data of enron employees which was made public and can be found [here](https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tar.gz). 
Here we identify people from the numerous enron employees which can be considered as 'person of interest (poi)' i.e. who may have a hand in the scandal based on their email conversations. The data has handpicked people classified as poi which were convicted in reality. We use a supervised learning approach with text mining techniques to build our poi identifier.
This is a follow-up analysis (version 2) for my earlier project where I used Enron's financial data to identify person of interests.

In [1]:
#Importing dependencies
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle as pkl
import re
import json
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### Preparing the data

In [2]:
#Here's the text file with list of POI names
poi_data = open('poi_names.txt','r')
poi = poi_data.read()
poi_data.close()
poi

'(y) Lay, Kenneth\n(y) Skilling, Jeffrey\n(n) Howard, Kevin\n(n) Krautz, Michael\n(n) Yeager, Scott\n(n) Hirko, Joseph\n(n) Shelby, Rex\n(n) Bermingham, David\n(n) Darby, Giles\n(n) Mulgrew, Gary\n(n) Bayley, Daniel\n(n) Brown, James\n(n) Furst, Robert\n(n) Fuhs, William\n(n) Causey, Richard\n(n) Calger, Christopher\n(n) DeSpain, Timothy\n(n) Hannon, Kevin\n(n) Koenig, Mark\n(y) Forney, John\n(n) Rice, Kenneth\n(n) Rieker, Paula\n(n) Fastow, Lea\n(n) Fastow, Andrew\n(y) Delainey, David\n(n) Glisan, Ben\n(n) Richter, Jeffrey\n(n) Lawyer, Larry\n(n) Belden, Timothy\n(n) Kopper, Michael\n(n) Duncan, David\n(n) Bowen, Raymond\n(n) Colwell, Wesley\n(n) Boyle, Dan\n(n) Loehr, Christopher\n'

In [3]:
#Here we save it in a python list format to be used later
poi = poi.split('\n')[:-1] #remove blank space
poi = list(map(lambda x: x.split(')')[1].strip().upper().replace(',',''),poi)) #Extract names from text file
poi[:5]

['LAY KENNETH',
 'SKILLING JEFFREY',
 'HOWARD KEVIN',
 'KRAUTZ MICHAEL',
 'YEAGER SCOTT']

In [4]:
#Here's the text file with POI email IDs
email_poi_data = open('poi_email_ids.txt','r')
poi_emails = email_poi_data.read()
email_poi_data.close()
poi_emails[:1000]

'"kenneth_lay@enron.net",    \n            "kenneth_lay@enron.com",\n            "klay.enron@enron.com",\n            "kenneth.lay@enron.com", \n            "klay@enron.com",\n            "layk@enron.com",\n            "chairman.ken@enron.com",\n            "jeffreyskilling@yahoo.com",\n            "jeff_skilling@enron.com",\n            "jskilling@enron.com",\n            "effrey.skilling@enron.com",\n            "skilling@enron.com",\n            "jeffrey.k.skilling@enron.com",\n            "jeff.skilling@enron.com",\n            "kevin_a_howard.enronxgate.enron@enron.net",\n            "kevin.howard@enron.com",\n            "kevin.howard@enron.net",\n            "kevin.howard@gcm.com",\n            "michael.krautz@enron.com"\n            "scott.yeager@enron.com",\n            "syeager@fyi-net.com",\n            "scott_yeager@enron.net",\n            "syeager@flash.net",\n            "joe\'.\'hirko@enron.com", \n            "joe.hirko@enron.com", \n            "rex.shelby@enron.com",

In [5]:
#We save it in a python list format
poi_emails = poi_emails.split('\n') #remove blank space
#next we remove non-alphanumerics & extract email ids
poi_emails = list(map(lambda x: x.strip().replace('"','').replace(',',''),poi_emails)) 
poi_emails[:5]

['kenneth_lay@enron.net',
 'kenneth_lay@enron.com',
 'klay.enron@enron.com',
 'kenneth.lay@enron.com',
 'klay@enron.com']

In [6]:
#Enron email data
emails_df = pd.read_csv('emails.csv')
emails_df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [7]:
#Let's have a first look at our data
print(emails_df.shape)
emails_df.head()

(517401, 2)


Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


### EDA and Cleaning

As you may have noticied from the first look our data has raw email conversations. We cannot use these as they are to build our classifier. Hence next we extract features from these raw email files.

In [8]:
print(emails_df.iloc[0,1]) #raw email
emails_df.iloc[0,1].split('\n') #list with features

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


['Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>',
 'Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)',
 'From: phillip.allen@enron.com',
 'To: tim.belden@enron.com',
 'Subject: ',
 'Mime-Version: 1.0',
 'Content-Type: text/plain; charset=us-ascii',
 'Content-Transfer-Encoding: 7bit',
 'X-From: Phillip K Allen',
 'X-To: Tim Belden <Tim Belden/Enron@EnronXGate>',
 'X-cc: ',
 'X-bcc: ',
 "X-Folder: \\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\'Sent Mail",
 'X-Origin: Allen-P',
 'X-FileName: pallen (Non-Privileged).pst',
 '',
 'Here is our forecast',
 '',
 ' ']

In [9]:
#Extracting date,sender,receiver and email text from raw emails
months,years,senders,receivers,emails = [],[],[],[],[]
def extract_features(row):
    message = row['message'].split('\n') #converts message into a list
    try:
        email = ''.join(message[15:]) #for every message, the actual email starts from here
        if "Forwarded" not in email: #do not include forawrded emails to avoid duplication
            date = re.search('Date: .*, [0-9]+ (.*) [0-9].*',message[1]).group(1).split() #extract date using regex
            row['Month'] = date[0]
            row['Year'] = int(date[1])
            row['Sender'] = message[2].split()[1]
            row['Receiver'] = message[3].split()[1]
            row['Email'] = email
            return row
    except:
        pass

In [10]:
emails_df['Month']=''
emails_df['Year']=0
emails_df['Sender']=''
emails_df['Receiver']=''
emails_df['Email']=''

#Apply our function to the entire data
emails_df = emails_df.apply(extract_features,axis=1)
emails_df.dropna(inplace=True)

In [11]:
#We also create additional features
#first based on the folder which email belongs to
emails_df['Folder'] = emails_df['file'].apply(lambda x: x.split('/')[1])
#next length of the email in characters
emails_df['Length'] = emails_df['Email'].apply(lambda x: len(str(x)))

In [12]:
#Creating the target variable
emails_df['POI']=0
emails_df.loc[emails_df['Sender'].isin(poi_emails),'POI'] = 1

In our analysis, our aim is to identify POI conversations to identify fraudulent activities. To do so we use email conversations and try to classify emails that have been sent by a POI to build a classifier which can detect fraudulent activities given email conversations. Thus, our label is one for all emails that have been initiated by POIs.

In [13]:
#Let's look at our data now
emails_df.head()

Unnamed: 0,file,message,Month,Year,Sender,Receiver,Email,Folder,Length,POI
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,May,2001.0,phillip.allen@enron.com,tim.belden@enron.com,Here is our forecast,_sent_mail,21,0
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,May,2001.0,phillip.allen@enron.com,john.lavorato@enron.com,Traveling to have a business meeting takes the...,_sent_mail,781,0
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,Oct,2000.0,phillip.allen@enron.com,leah.arsdall@enron.com,test successful. way to go!!!,_sent_mail,30,0
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,Oct,2000.0,phillip.allen@enron.com,randall.gay@enron.com,"Randy, Can you send me a schedule of the salar...",_sent_mail,181,0
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Aug,2000.0,phillip.allen@enron.com,greg.piper@enron.com,Let's shoot for Tuesday at 11:45.,_sent_mail,35,0


Now that we have our data, let's investigate its integrity

In [14]:
#Lets look at different years in our data
list(emails_df['Year'].unique())

[2001.0,
 2000.0,
 1999.0,
 1979.0,
 2002.0,
 1.0,
 2004.0,
 2.0,
 2020.0,
 1998.0,
 2012.0,
 2007.0,
 2005.0,
 1997.0,
 1986.0,
 2024.0,
 2044.0,
 2043.0]

In [15]:
#We notice some irregularities with 1 and 2
emails_df[emails_df['Year'].isin([1.0,2.0])].apply(lambda x: x['message'].split('\n')[1],axis=1).head()

8110     Date: Wed, 21 Dec 0001 22:31:43 -0800 (PST)
8130     Date: Sat, 24 Dec 0001 22:26:44 -0800 (PST)
32825     Date: Sat, 6 Aug 0001 00:06:06 -0800 (PST)
32843    Date: Sat, 13 Aug 0001 00:11:21 -0800 (PST)
32908    Date: Sun, 28 Aug 0001 00:08:57 -0800 (PST)
dtype: object

After looking at the corresponding emails it looks like they have '2' missing in the thousands place, let's fix this

In [16]:
emails_df.loc[emails_df['Year']=='0001','Year'] = 2001.0
emails_df.loc[emails_df['Year']=='0002','Year'] = 2002.0

In [17]:
#We also notice some emails being from way ahead in the future let's investigate these
print(emails_df[emails_df['Year'].isin([2024.0,2044.0,2043.0])].iloc[:,3:-1])
#Looks like these have been sent by 2 individuals, let's check if these are POI
print('cramer@cadvision.com' in poi_emails,'pse6yl706@aloha.net' in poi_emails)
emails_df = emails_df.loc[~emails_df['Year'].isin([2024.0,2044.0,2043.0]),:] #Since they aren't we disregard these

          Year                Sender                 Receiver  \
508375  2024.0   pse6yl706@aloha.net                   (None)   
517039  2044.0  cramer@cadvision.com  john.zufferli@enron.com   
517040  2044.0  cramer@cadvision.com  john.zufferli@enron.com   
517042  2044.0  cramer@cadvision.com  john.zufferli@enron.com   
517045  2043.0  cramer@cadvision.com  linsider.jed@enron.com,   

                                                    Email         Folder  \
508375  <html><head><title>Untitled Document</title><m...  deleted_items   
517039  Howdy, bom went out 35 at 35.5 Feb traded 32.7...          inbox   
517040  BOM  5th to 31st traded 34, 33.5 , 33.5 and  3...          inbox   
517042  feb dec trades 37.5 feb dec LL went out 20 at ...          inbox   
517045  X-cc: X-bcc: X-Folder: \ExMerge - Zufferli, Jo...          inbox   

        Length  
508375    8454  
517039     215  
517040     148  
517042     210  
517045     147  
False False


In [18]:
#Let's look at the number of different individuals in our data
print(len(set(emails_df['Sender'].unique()))) #more than 19k
print(len(set(emails_df['Sender'].unique()).intersection(set(poi_emails)))) 
print(len(set(emails_df['Receiver'].unique()).intersection(set(poi_emails))))
print(len(set(emails_df['Receiver'].unique()).intersection(set(emails_df['Sender'].unique()))))

19929
32
45
7802


In [19]:
#Remove major texts other than the message itself
sum(emails_df['Email'].str.contains('-----Original Message-----'))

#Here we remove the section which doesn't contribute to the message
emails_df.loc[emails_df['Email'].str.contains('-----Original Message-----'),'Email'] = \
emails_df.loc[emails_df['Email'].str.contains('-----Original Message-----'),'Email'].\
apply(lambda x: x.split('-----Original Message-----')[0])

In [20]:
#To further restrict the scope we only consider employee emails
#extracting only employee email IDs from all unique IDs that ends with '@enrom.com' 
employee_emails = list(filter(None,list(map(lambda x: x if ('@' in x) and (x.split('@')[1]=='enron.com') else None,\
                                           list(emails_df['Sender'].unique())))))
#only keep emails sent by enron employees
emails_df = emails_df.loc[(emails_df['Sender'].isin(employee_emails)) & (emails_df['Receiver'].isin(employee_emails)),:]

In [21]:
#We further only consider emails from and beyond the year 1999
emails_df = emails_df.loc[~emails_df['Year'].isin([1979.,1998.]),:]
emails_df.dropna(inplace=True)

In [22]:
#Let's check if there are any duplicates present
sum(emails_df['Email'].duplicated()) #59909

60297

We can simple drop these duplicates, however the choice about which version of the email to choose will be governed by which amongst the duplicates were sent first. To consider this temporal aspect we sort our data based on date.

In [23]:
#Extracting the complete date
emails_df['Date'] = emails_df['message']\
                    .apply(lambda x: re.search('Date: .*, ([0-9]+ .* [0-9]+:[0-9]+:[0-9]+) .*',x.split('\n')[1]).group(1))
emails_df['Date'] = emails_df['Date'].apply(lambda x: x.replace('0001','2001').replace('0002','2002'))

#Datetime object
emails_df['Date'] = pd.to_datetime(emails_df['Date'],format="%d %b %Y %H:%M:%S")
emails_df.sort_values(by='Date',inplace=True) #sort based on date

In [24]:
#Now we can remove duplicates and keep the one which we encounter first
emails_df.drop_duplicates(subset ='Email',keep = 'first',inplace=True) 

In [25]:
#Let's look at how the lengths of the emails are distributed
print(emails_df['Length'].describe())

count     86320.000000
mean       1193.312442
std        4199.770581
min           2.000000
25%         252.000000
50%         680.000000
75%        1300.000000
max      386187.000000
Name: Length, dtype: float64


In [26]:
#we notice a large gap between the 75th and the max hinting presence of outliers
#in this case they are large emails which maybe marketing/spam emails hence we subset these
emails_df = emails_df.loc[(emails_df['Length']<np.percentile(emails_df['Length'],99))&(emails_df['Length']>0),:]
emails_df.reset_index(inplace=True)
emails_df.drop('index',axis=1,inplace=True)

### Text pre-processing

In [27]:
#We start by expanding contractions
#here we use a list of common contractions found online
contractions = json.load(open("contractions.txt")) 
contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))

#find and expand contractions using pre-defined dictionary
def expand_contractions(s, contractions_dict=contractions):
     def replace(match):
         return contractions_dict[match.group(0)]
     return contractions_re.sub(replace, s)

In [28]:
#Let's look at how this affects the text
print('Before:',emails_df['Email'].iloc[8])
print('After:',expand_contractions(emails_df['Email'].iloc[8]))
emails_df['Email'] = emails_df['Email'].apply(lambda x: expand_contractions(x))

Before: Nobody is leaving - it's a new position.
After: Nobody is leaving - it is a new position.


In [29]:
#Next we perform standard steps like removing outliers, non-alphanumeric characters 
#and perform stemming to restrict our vocabulary
porter = PorterStemmer()
stop_words = set(stopwords.words('english'))
def clean_text(txt):
    return " ".join([porter.stem(i) for i in word_tokenize(txt.lower()) if i.isalpha() and i not in stop_words])
emails_df['Email'] = emails_df['Email'].apply(clean_text)

In [30]:
emails_df.head()

Unnamed: 0,file,message,Month,Year,Sender,Receiver,Email,Folder,Length,POI,Date
0,taylor-m/all_documents/87.,Message-ID: <24886497.1075859876880.JavaMail.e...,Jan,1999.0,mark.taylor@enron.com,scott.sefton@enron.com,happi welcom back sorri hear travel problem kn...,all_documents,755,0,1999-01-05 04:04:00
1,taylor-m/all_documents/88.,Message-ID: <11362494.1075859876902.JavaMail.e...,Jan,1999.0,mark.taylor@enron.com,jenny.helton@enron.com,thursday time look better pmto dale ect bob ec...,all_documents,521,0,1999-01-05 06:54:00
2,taylor-m/all_documents/96.,Message-ID: <8591399.1075859877135.JavaMail.ev...,Jan,1999.0,mark.taylor@enron.com,carol.clair@enron.com,carol tana mention meet credit guy discuss asp...,all_documents,872,0,1999-01-08 03:38:00
3,taylor-m/sent/87.,Message-ID: <8455803.1075860048091.JavaMail.ev...,Jan,1999.0,mark.taylor@enron.com,peter.keohane@enron.com,understand tana disagre advic receiv outsid co...,sent,672,0,1999-01-08 08:15:00
4,taylor-m/sent/88.,Message-ID: <13667607.1075860048113.JavaMail.e...,Jan,1999.0,mark.taylor@enron.com,richard.sanders@enron.com,us chang swap agreement form provid arbitr ord...,sent,960,0,1999-01-08 08:54:00


In [31]:
#Finally we save our data as a checkpoint
# emails_df.to_csv('cleaned_emails.csv')

Now that we have our data ready we can proceed with the next steps of building our POI classifier, continued in Part II.