## enron emails to data.world


Convert the enron email data into something easier to use in data.world.

data cleansing based *roughly* on: https://www.kaggle.com/zichen/d/wcukierski/enron-email-dataset/explore-enron/code


### Labels

The recores with `labeled` set were labelled by [CMU students](https://www.cs.cmu.edu/~./enron/).

There are up to 12 categories per email:  
* Cat_[1-12]_level_1 = top-level category
* Cat_[1-12]_level_2 = second-level category
* Cat_[1-12]_level_weight = frequency with which this category was assigned to this message

Here are the categories:


* 1 Coarse genre
 * 1.1 Company Business, Strategy, etc. (elaborate in Section 3 [Topics])
 * 1.2 Purely Personal
 * 1.3 Personal but in professional context (e.g., it was good working with you)
 * 1.4 Logistic Arrangements (meeting scheduling, technical support, etc)
 * 1.5 Employment arrangements (job seeking, hiring, recommendations, etc)
 * 1.6 Document editing/checking (collaboration)
 * 1.7 Empty message (due to missing attachment)
 * 1.8 Empty message
* 2 Included/forwarded information 
 * 2.1 Includes new text in addition to forwarded material
 * 2.2 Forwarded email(s) including replies
 * 2.3 Business letter(s) / document(s)
 * 2.4 News article(s)
 * 2.5 Government / academic report(s)
 * 2.6 Government action(s) (such as results of a hearing, etc)
 * 2.7 Press release(s)
 * 2.8 Legal documents (complaints, lawsuits, advice)
 * 2.9 Pointers to url(s)
 * 2.10 Newsletters
 * 2.11 Jokes, humor (related to business)
 * 2.12 Jokes, humor (unrelated to business)
 * 2.13 Attachment(s) (assumed missing)
* 3 Primary topics (if coarse genre 1.1 is selected) 
 * 3.1 regulations and regulators (includes price caps)
 * 3.2 internal projects -- progress and strategy
 * 3.3 company image -- current
 * 3.4 company image -- changing / influencing
 * 3.5 political influence / contributions / contacts
 * 3.6 california energy crisis / california politics
 * 3.7 internal company policy
 * 3.8 internal company operations
 * 3.9 alliances / partnerships
 * 3.10 legal advice
 * 3.11 talking points
 * 3.12 meeting minutes
 * 3.13 trip reports
* 4 Emotional tone (if not neutral) 
 * 4.1 jubilation
 * 4.2 hope / anticipation
 * 4.3 humor
 * 4.4 camaraderie
 * 4.5 admiration
 * 4.6 gratitude
 * 4.7 friendship / affection
 * 4.8 sympathy / support
 * 4.9 sarcasm
 * 4.10 secrecy / confidentiality
 * 4.11 worry / anxiety
 * 4.12 concern
 * 4.13 competitiveness / aggressiveness
 * 4.14 triumph / gloating
 * 4.15 pride
 * 4.16 anger / agitation
 * 4.17 sadness / despair
 * 4.18 shame
 * 4.19 dislike / scorn




In [97]:
import os, sys, email
import numpy as np 
import pandas as pd
import zipfile

from subprocess import check_output

In [98]:
# Read the data into a DataFrame
emails_df = pd.read_csv('./emails.csv')
print(emails_df.shape)
emails_df.head()

(517401, 2)


Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


In [99]:
## Helper functions
def get_text_from_email(msg, max_word_len=30):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            payload = part.get_payload()
            payload = ' '.join(filter(lambda x: len(x) < max_word_len,  payload.split()))
            parts.append( payload )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs


In [100]:
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails_df['message']))

# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails_df[key] = [doc[key] for doc in messages]
# Parse content from emails
emails_df['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails_df['From'] = emails_df['From'].map(split_email_addresses)
emails_df['To'] = emails_df['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails_df['user'] = emails_df['file'].map(lambda x:x.split('/')[0])

# cleanup
del messages
emails_df.drop('message', axis=1, inplace=True)

emails_df.head()

Unnamed: 0,file,Message-ID,Date,From,To,Subject,Mime-Version,Content-Type,Content-Transfer-Encoding,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,content,user
0,allen-p/_sent_mail/1.,<18782981.1075855378110.JavaMail.evans@thyme>,"Mon, 14 May 2001 16:39:00 -0700 (PDT)",(phillip.allen@enron.com),(tim.belden@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Tim Belden <Tim Belden/Enron@EnronXGate>,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Here is our forecast,allen-p
1,allen-p/_sent_mail/10.,<15464986.1075855378456.JavaMail.evans@thyme>,"Fri, 4 May 2001 13:51:00 -0700 (PDT)",(phillip.allen@enron.com),(john.lavorato@enron.com),Re:,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,allen-p
2,allen-p/_sent_mail/100.,<24216240.1075855687451.JavaMail.evans@thyme>,"Wed, 18 Oct 2000 03:00:00 -0700 (PDT)",(phillip.allen@enron.com),(leah.arsdall@enron.com),Re: test,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Leah Van Arsdall,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,test successful. way to go!!!,allen-p
3,allen-p/_sent_mail/1000.,<13505866.1075863688222.JavaMail.evans@thyme>,"Mon, 23 Oct 2000 06:13:00 -0700 (PDT)",(phillip.allen@enron.com),(randall.gay@enron.com),,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Randall L Gay,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Randy, Can you send me a schedule of the salar...",allen-p
4,allen-p/_sent_mail/1001.,<30922949.1075863688243.JavaMail.evans@thyme>,"Thu, 31 Aug 2000 05:07:00 -0700 (PDT)",(phillip.allen@enron.com),(greg.piper@enron.com),Re: Hello,1.0,text/plain; charset=us-ascii,7bit,Phillip K Allen,Greg Piper,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,Let's shoot for Tuesday at 11:45.,allen-p


In [101]:
# Set index and drop columns with two few values
emails_df = emails_df.set_index('Message-ID')\
    .drop(['Mime-Version', 'Content-Type', 'Content-Transfer-Encoding'], axis=1)
# Parse datetime
emails_df['Date'] = pd.to_datetime(emails_df['Date'], infer_datetime_format=True)


In [102]:
emails_df = emails_df[emails_df['content'].str.split().str.len() >= 50]
emails_df.dtypes
emails_df.columns

Index(['file', 'Date', 'From', 'To', 'Subject', 'X-From', 'X-To', 'X-cc',
       'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName', 'content', 'user'],
      dtype='object')

In [112]:
emails_df.fillna("None", inplace=True)
emails_df.replace("", "None", inplace=True)
emails_df.replace("\n", " ", inplace=True)
emails_df.isnull().sum(axis=0)


Message-ID    0
file          0
Date          0
From          0
To            0
Subject       0
X-From        0
X-To          0
X-cc          0
X-bcc         0
X-Folder      0
X-Origin      0
X-FileName    0
content       0
user          0
dtype: int64

In [106]:
emails_df.head()


Unnamed: 0_level_0,file,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,content,user
Message-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
<15464986.1075855378456.JavaMail.evans@thyme>,allen-p/_sent_mail/10.,2001-05-04 13:51:00-07:00,(phillip.allen@enron.com),(john.lavorato@enron.com),Re:,Phillip K Allen,John J Lavorato <John J Lavorato/ENRON@enronXg...,,,"\Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Se...",Allen-P,pallen (Non-Privileged).pst,Traveling to have a business meeting takes the...,allen-p
<30795301.1075855687494.JavaMail.evans@thyme>,allen-p/_sent_mail/102.,2000-10-16 06:44:00-07:00,(phillip.allen@enron.com),(zimam@enron.com),FW: fixed forward or other Collar floor gas pr...,Phillip K Allen,zimam@enron.com,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,allen-p
<25459584.1075855687536.JavaMail.evans@thyme>,allen-p/_sent_mail/104.,2000-10-13 06:45:00-07:00,(phillip.allen@enron.com),(stagecoachmama@hotmail.com),,Phillip K Allen,stagecoachmama@hotmail.com,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,"Lucy, Here are the rentrolls: Open them and sa...",allen-p
<13116875.1075855687561.JavaMail.evans@thyme>,allen-p/_sent_mail/105.,2000-10-09 07:16:00-07:00,(phillip.allen@enron.com),(keith.holst@enron.com),Consolidated positions: Issues & To Do list,Phillip K Allen,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,allen-p
<2707340.1075855687584.JavaMail.evans@thyme>,allen-p/_sent_mail/106.,2000-10-09 07:00:00-07:00,(phillip.allen@enron.com),(keith.holst@enron.com),Consolidated positions: Issues & To Do list,Phillip K Allen,Keith Holst,,,\Phillip_Allen_Dec2000\Notes Folders\'sent mail,Allen-P,pallen.nsf,---------------------- Forwarded by Phillip K ...,allen-p


In [107]:
len(emails_df.columns)

14

In [113]:
emails_df.reset_index(level=0, inplace=True)


In [86]:
emails_df.shape

(374949, 15)

In [114]:
filename = "enron_05_17_2015_filtered.csv"
emails_df.to_csv(filename, index=False, sep='#', encoding='utf-8')