In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import email

# Enron Emails Data Cleaning
## Load the Data

The Enron Email Dataset was aquired on March 9, 2020 from [Kaggle](https://www.kaggle.com/wcukierski/enron-email-dataset).

In [2]:
file_path = "../data/emails.csv"
emails = pd.read_csv(file_path)

## Preview the Data

Lets take a look at the emails DataFrame.

In [3]:
emails.shape

(517401, 2)

In [4]:
emails.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


Lets look at a sample from the `message` column.

In [5]:
print(emails['message'][1])

Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>
Date: Fri, 4 May 2001 13:51:00 -0700 (PDT)
From: phillip.allen@enron.com
To: john.lavorato@enron.com
Subject: Re:
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: John J Lavorato <John J Lavorato/ENRON@enronXgate@ENRON>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Traveling to have a business meeting takes the fun out of the trip.  Especially if you have to prepare a presentation.  I would suggest holding the business plan meetings here then take a trip without any formal business meetings.  I would even try and get some honest opinions on whether a trip is even desired or necessary.

As far as the business meetings, I think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not.  Too often the

## Rename the `message` column

Since the real email messages are buried in the `message` column, lets rename the column to `data`.

In [6]:
emails.rename(columns={'message':'data'}, inplace=True)

## Parsing the Data

It appears to be that the `message` column contains all the desired information. This column must be parsed for the useful information. 

In [7]:
# Convert content in emails['data'] to email objects
email_objs = list(map(email.message_from_string, emails['data']))

Now that we have email objects, we may use get_payload() to extract the email message itself.

In [8]:
def create_messages_column(emails, df):
    msgs = []
    for email in emails:
        msg = email.get_payload()
        msgs.append(msg)
    df['message'] = msgs

In [9]:
create_messages_column(email_objs, emails)

In [10]:
emails.head()

Unnamed: 0,file,data,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...,Here is our forecast\n\n
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...,Traveling to have a business meeting takes the...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...,test successful. way to go!!!
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...,"Randy,\n\n Can you send me a schedule of the s..."
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...,Let's shoot for Tuesday at 11:45.
