## enron emails to data.world


Convert the enron email data into something easier to use in data.world.

data cleansing based *roughly* on: https://www.kaggle.com/zichen/d/wcukierski/enron-email-dataset/explore-enron/code


### Labels

The recores with `labeled` set were labelled by [CMU students](https://www.cs.cmu.edu/~./enron/).

There are up to 12 categories per email:  
* Cat_[1-12]_level_1 = top-level category
* Cat_[1-12]_level_2 = second-level category
* Cat_[1-12]_level_weight = frequency with which this category was assigned to this message

Here are the categories:


* 1 Coarse genre
 * 1.1 Company Business, Strategy, etc. (elaborate in Section 3 [Topics])
 * 1.2 Purely Personal
 * 1.3 Personal but in professional context (e.g., it was good working with you)
 * 1.4 Logistic Arrangements (meeting scheduling, technical support, etc)
 * 1.5 Employment arrangements (job seeking, hiring, recommendations, etc)
 * 1.6 Document editing/checking (collaboration)
 * 1.7 Empty message (due to missing attachment)
 * 1.8 Empty message
* 2 Included/forwarded information 
 * 2.1 Includes new text in addition to forwarded material
 * 2.2 Forwarded email(s) including replies
 * 2.3 Business letter(s) / document(s)
 * 2.4 News article(s)
 * 2.5 Government / academic report(s)
 * 2.6 Government action(s) (such as results of a hearing, etc)
 * 2.7 Press release(s)
 * 2.8 Legal documents (complaints, lawsuits, advice)
 * 2.9 Pointers to url(s)
 * 2.10 Newsletters
 * 2.11 Jokes, humor (related to business)
 * 2.12 Jokes, humor (unrelated to business)
 * 2.13 Attachment(s) (assumed missing)
* 3 Primary topics (if coarse genre 1.1 is selected) 
 * 3.1 regulations and regulators (includes price caps)
 * 3.2 internal projects -- progress and strategy
 * 3.3 company image -- current
 * 3.4 company image -- changing / influencing
 * 3.5 political influence / contributions / contacts
 * 3.6 california energy crisis / california politics
 * 3.7 internal company policy
 * 3.8 internal company operations
 * 3.9 alliances / partnerships
 * 3.10 legal advice
 * 3.11 talking points
 * 3.12 meeting minutes
 * 3.13 trip reports
* 4 Emotional tone (if not neutral) 
 * 4.1 jubilation
 * 4.2 hope / anticipation
 * 4.3 humor
 * 4.4 camaraderie
 * 4.5 admiration
 * 4.6 gratitude
 * 4.7 friendship / affection
 * 4.8 sympathy / support
 * 4.9 sarcasm
 * 4.10 secrecy / confidentiality
 * 4.11 worry / anxiety
 * 4.12 concern
 * 4.13 competitiveness / aggressiveness
 * 4.14 triumph / gloating
 * 4.15 pride
 * 4.16 anger / agitation
 * 4.17 sadness / despair
 * 4.18 shame
 * 4.19 dislike / scorn




In [181]:
import os, sys, email
import numpy as np 
import pandas as pd
from boto.s3.key import Key
import boto
import zipfile

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

emails.csv



In [182]:
# Read the data into a DataFrame
emails_df = pd.read_csv('../input/emails.csv')
print(emails_df.shape)
emails_df.head()

(517401, 2)


Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [183]:
# A single message looks like this
print(emails_df['message'][0])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


In [184]:
## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs


In [185]:
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails_df['message']))
emails_df.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails_df[key] = [doc[key] for doc in messages]
# Parse content from emails
emails_df['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails_df['From'] = emails_df['From'].map(split_email_addresses)
emails_df['To'] = emails_df['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails_df['user'] = emails_df['file'].map(lambda x:x.split('/')[0])
del messages

emails_df.head()

Unnamed: 0,file,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,content,user
0,allen-p/_sent_mail/1.,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",(phillip.allen@enron.com),(tim.belden@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast\n\n,allen-p
1,allen-p/_sent_mail/10.,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",(phillip.allen@enron.com),(john.lavorato@enron.com),Re:,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,allen-p
2,allen-p/_sent_mail/100.,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",(phillip.allen@enron.com),(leah.arsdall@enron.com),Re: test,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,allen-p
3,allen-p/_sent_mail/1000.,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",(phillip.allen@enron.com),(randall.gay@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy,\n\n Can you send me a schedule of the s...",allen-p
4,allen-p/_sent_mail/1001.,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",(phillip.allen@enron.com),(greg.piper@enron.com),Re: Hello,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,allen-p


In [186]:
print('shape of the dataframe:', emails_df.shape)
# Find number of unique values in each columns
for col in emails_df.columns:
    print(col, emails_df[col].nunique())

shape of the dataframe: (517401, 18)
file 517401
Message-ID 517401
Date 224128
From 20328
To 54748
Subject 159290
Mime-Version 1
Content-Type 2
Content-Transfer-Encoding 3
X-From 27980
X-To 73552
X-cc 33701
X-bcc 132
X-Folder 5335
X-Origin 259
X-FileName 429
content 249025
user 150


In [187]:
# Set index and drop columns with two few values
emails_df = emails_df.set_index('Message-ID')\
    .drop(['file', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding'], axis=1)
# Parse datetime
emails_df['Date'] = pd.to_datetime(emails_df['Date'], infer_datetime_format=True)
emails_df.dtypes

Date          datetime64[ns]
From                  object
To                    object
Subject               object
X-From                object
X-To                  object
X-cc                  object
X-bcc                 object
X-Folder              object
X-Origin              object
X-FileName            object
content               object
user                  object
dtype: object

In [200]:
def save_to_s3(file_name):
    s3 = boto.connect_s3()
    b = s3.get_bucket('brianray')
    k = Key(b)
    k.key = file_name
    k.set_contents_from_filename(file_name)
    k.set_acl('public-read')
    return k.generate_url(expires_in=0, query_auth=False)

def zipit(file_name):
    zip_file_name = "{}.zip".format(file_name)
    zf = zipfile.ZipFile(zip_file_name, 'w', zipfile.ZIP_DEFLATED)
    try:
        zf.write(file_name)
    finally:
        zf.close()
    return zip_file_name

In [189]:
import glob
list_found = {}
cats = []
for path in glob.glob("enron_with_categories/*/*.txt"):
    batch, filename = path.split("/")[1:]
    contents = open(path, "r").read()

    try:
        email_parsed = email.message_from_string(contents)
        list_found[email_parsed['Message-ID']] = [x.split(',') 
                                                  for x in 
                                                  open(path.replace(".txt", ".cats")).read().split()]
    except Exception as e:
        print("error: {}".format(e))
        


In [190]:
for x in range(12):
    x += 1
    emails_df['Cat_{}_level_1'.format(x)] = None
    emails_df['Cat_{}_level_2'.format(x)] = None
    emails_df['Cat_{}_weight'.format(x)] = None    

In [191]:
emails_df.columns

Index(['Date', 'From', 'To', 'Subject', 'X-From', 'X-To', 'X-cc', 'X-bcc',
       'X-Folder', 'X-Origin', 'X-FileName', 'content', 'user',
       'Cat_1_level_1', 'Cat_1_level_2', 'Cat_1_weight', 'Cat_2_level_1',
       'Cat_2_level_2', 'Cat_2_weight', 'Cat_3_level_1', 'Cat_3_level_2',
       'Cat_3_weight', 'Cat_4_level_1', 'Cat_4_level_2', 'Cat_4_weight',
       'Cat_5_level_1', 'Cat_5_level_2', 'Cat_5_weight', 'Cat_6_level_1',
       'Cat_6_level_2', 'Cat_6_weight', 'Cat_7_level_1', 'Cat_7_level_2',
       'Cat_7_weight', 'Cat_8_level_1', 'Cat_8_level_2', 'Cat_8_weight',
       'Cat_9_level_1', 'Cat_9_level_2', 'Cat_9_weight', 'Cat_10_level_1',
       'Cat_10_level_2', 'Cat_10_weight', 'Cat_11_level_1', 'Cat_11_level_2',
       'Cat_11_weight', 'Cat_12_level_1', 'Cat_12_level_2', 'Cat_12_weight'],
      dtype='object')

In [192]:
emails_df['labeled'] = False       
for item, val in list_found.items():
    emails_df.loc[item, 'labeled'] = True
    i = 0
    for lev1, lev2, weight in val:
        i += 1
        emails_df.loc[item, 'Cat_{}_level_1'.format(i)] = lev1
        emails_df.loc[item, 'Cat_{}_level_2'.format(i)] = lev2
        emails_df.loc[item, 'Cat_{}_weight'.format(i)] = weight      

In [193]:
emails_df.columns

Index(['Date', 'From', 'To', 'Subject', 'X-From', 'X-To', 'X-cc', 'X-bcc',
       'X-Folder', 'X-Origin', 'X-FileName', 'content', 'user',
       'Cat_1_level_1', 'Cat_1_level_2', 'Cat_1_weight', 'Cat_2_level_1',
       'Cat_2_level_2', 'Cat_2_weight', 'Cat_3_level_1', 'Cat_3_level_2',
       'Cat_3_weight', 'Cat_4_level_1', 'Cat_4_level_2', 'Cat_4_weight',
       'Cat_5_level_1', 'Cat_5_level_2', 'Cat_5_weight', 'Cat_6_level_1',
       'Cat_6_level_2', 'Cat_6_weight', 'Cat_7_level_1', 'Cat_7_level_2',
       'Cat_7_weight', 'Cat_8_level_1', 'Cat_8_level_2', 'Cat_8_weight',
       'Cat_9_level_1', 'Cat_9_level_2', 'Cat_9_weight', 'Cat_10_level_1',
       'Cat_10_level_2', 'Cat_10_weight', 'Cat_11_level_1', 'Cat_11_level_2',
       'Cat_11_weight', 'Cat_12_level_1', 'Cat_12_level_2', 'Cat_12_weight',
       'labeled'],
      dtype='object')

In [194]:
emails_df[emails_df['labeled'] == True]

Unnamed: 0_level_0,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,...,Cat_10_level_1,Cat_10_level_2,Cat_10_weight,Cat_11_level_1,Cat_11_level_2,Cat_11_weight,Cat_12_level_1,Cat_12_level_2,Cat_12_weight,labeled
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
<9831685.1075855725804.JavaMail.evans@thyme>,2001-03-15 14:45:00,(phillip.allen@enron.com),(todd.burke@enron.com),Re: Confidential Employee Information/Lenhart,Phillip K Allen,Todd Burke,,,\Phillip_Allen_June2001\Notes Folders\'sent mail,Allen-P,...,,,,,,,,,,True
<21041312.1075855725847.JavaMail.evans@thyme>,2001-03-15 14:11:00,(phillip.allen@enron.com),(kim.bolton@enron.com),RE: PERSONAL AND CONFIDENTIAL COMPENSATION INF...,Phillip K Allen,Kim Bolton,,,\Phillip_Allen_June2001\Notes Folders\'sent mail,Allen-P,...,,,,,,,,,,True
<5907100.1075858639941.JavaMail.evans@thyme>,2001-06-20 17:04:51,(k..allen@enron.com),"(matt.smith@enron.com, jay.reitmeyer@enron.com...",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Lenhart, Matthew </O=ENRON/OU=NA/CN=RECIPIENTS...",,,"\PALLEN (Non-Privileged)\Allen, Phillip K.\Sen...",Allen-P,...,,,,,,,,,,True
<26625142.1075858639964.JavaMail.evans@thyme>,2001-06-20 17:09:00,(k..allen@enron.com),"(jay.reitmeyer@enron.com, matt.smith@enron.com...",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Lenhart, Matthew </O=ENRON/OU=NA/CN=RECIPIENTS...",,,"\PALLEN (Non-Privileged)\Allen, Phillip K.\Sen...",Allen-P,...,,,,,,,,,,True
<19730598.1075858642129.JavaMail.evans@thyme>,2001-08-09 12:30:58,(k..allen@enron.com),"(matt.smith@enron.com, m..tholt@enron.com)",FW: Western Wholesale Activities - Gas & Power...,"Allen, Phillip K. </O=ENRON/OU=NA/CN=RECIPIENT...","Smith, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=M...",,,"\PALLEN (Non-Privileged)\Allen, Phillip K.\Sen...",Allen-P,...,,,,,,,,,,True
<21261996.1075858638025.JavaMail.evans@thyme>,2001-05-07 19:28:00,(phillip.allen@enron.com),"(jay.reitmeyer@enron.com, matt.smith@enron.com...",Re: Western Wholesale Activities - Gas & Power...,Phillip K Allen,"Matthew Lenhart <Matthew Lenhart/HOU/ECT@ECT>,...",,,"\PALLEN (Non-Privileged)\Allen, Phillip K.\Sen...",Allen-P,...,,,,,,,,,,True
<20399547.1075857614321.JavaMail.evans@thyme>,2000-12-19 16:55:00,(john.arnold@enron.com),(jeanie.slone@enron.com),Re: confidential employee information-dutch qu...,John Arnold,Jeanie Slone,,,\John_Arnold_Jun2001\Notes Folders\All documents,Arnold-J,...,,,,,,,,,,True
<860767.1075849626951.JavaMail.evans@thyme>,2000-12-14 10:51:00,(matt.harris@enron.com),(sarah-joy.hunter@enron.com),Re: HP -- confidential internal document,Matt Harris,Sarah-Joy Hunter,"Dale Clark, Jennifer Medcalf, Patrick Tucker, ...",,\John_Arnold_Nov2001\Notes Folders\All documents,ARNOLD-J,...,,,,,,,,,,True
<17578964.1075849627055.JavaMail.evans@thyme>,2000-12-14 17:48:00,(matt.harris@enron.com),(patrick.tucker@enron.com),Re: HP -- confidential internal document,Matt Harris,Patrick Tucker,"Dale Clark, Jennifer Medcalf",,\John_Arnold_Nov2001\Notes Folders\All documents,ARNOLD-J,...,,,,,,,,,,True
<24049587.1075849626031.JavaMail.evans@thyme>,2000-12-12 16:42:00,(sarah-joy.hunter@enron.com),(matt.harris@enron.com),HP -- confidential internal document,Sarah-Joy Hunter,Matt Harris,"Patrick Tucker, Peter Goebel, Dale Clark, Jenn...",,\John_Arnold_Nov2001\Notes Folders\All documents,ARNOLD-J,...,,,,,,,,,,True


In [195]:
len(emails_df.columns)

50

In [196]:
emails_df.reset_index(level=0, inplace=True)

In [197]:
emails_df.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,...,Cat_10_level_1,Cat_10_level_2,Cat_10_weight,Cat_11_level_1,Cat_11_level_2,Cat_11_weight,Cat_12_level_1,Cat_12_level_2,Cat_12_weight,labeled
0,<18782981.1075855378110.JavaMail.evans@thyme>,2001-05-14 23:39:00,(phillip.allen@enron.com),(tim.belden@enron.com),,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",...,,,,,,,,,,False
1,<15464986.1075855378456.JavaMail.evans@thyme>,2001-05-04 20:51:00,(phillip.allen@enron.com),(john.lavorato@enron.com),Re:,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",...,,,,,,,,,,False
2,<24216240.1075855687451.JavaMail.evans@thyme>,2000-10-18 10:00:00,(phillip.allen@enron.com),(leah.arsdall@enron.com),Re: test,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,...,,,,,,,,,,False
3,<13505866.1075863688222.JavaMail.evans@thyme>,2000-10-23 13:13:00,(phillip.allen@enron.com),(randall.gay@enron.com),,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,...,,,,,,,,,,False
4,<30922949.1075863688243.JavaMail.evans@thyme>,2000-08-31 12:07:00,(phillip.allen@enron.com),(greg.piper@enron.com),Re: Hello,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,...,,,,,,,,,,False


In [202]:
filename = "enron_05_17_2015_with_labels.csv"
emails_df.to_csv(filename)
save_to_s3(zipit(filename))

'https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels.csv.zip'

In [204]:


chunks = emails_df.groupby(np.arange(len(emails_df)) // 100000)
for i, chunk in chunks:
    name = "enron_05_17_2015_with_labels_100K_chunk_{}_of_{}.csv".format(i+1, len(chunks))
    chunk.to_csv(name)
    print(save_to_s3(zipit(name)))

https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_1_of_6.csv.zip
https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_2_of_6.csv.zip
https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_3_of_6.csv.zip
https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_4_of_6.csv.zip
https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_5_of_6.csv.zip
https://brianray.s3.amazonaws.com:443/enron_05_17_2015_with_labels_100K_chunk_6_of_6.csv.zip
