# Data Manipulation

## Background

The first data set we are going to analyze is the Enron Corpus.
Because the corpus is so large, we really would not want to load all of the emails into python at once.
Therefore, I have split the original csv, containing all 500k emails, into smaller files of about 10k emails each.
I made sure that emails from a specific user's inbox are not split between multiple files. This should help later when we need to split up the Corpus into training and testing sets.<br/>
<br/>
We will start with loading the first chunk of the emails using pandas.<br/>
Every cvs file has a header, telling pandas to record each email as two parts: a filename and the email's body.<br/>


In [13]:
import pandas as pd
import csv
import email.utils
import email
import email.header
import datetime
import os

In [33]:
file = pd.read_csv("enron/initial/enron emails_chunk1.csv")
print(file.shape)

(10430, 2)


After having read the csv in, we can see the "shape" of the file.<br/>
Here, we have 7942 emails, each with 2 fields: a filename and a message.

In [34]:
print(file['file'][0])

allen-p/_sent_mail/1.


In [35]:
print(file['message'][5000])

Message-ID: <32259460.1075852688586.JavaMail.evans@thyme>
Date: Fri, 5 Oct 2001 01:31:17 -0700 (PDT)
From: jennifer.fraser@enron.com
To: john.arnold@enron.com
Subject: RE: right about now dont u think u otta sell some calls against yr
 36.88s
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Fraser, Jennifer </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JFRASER>
X-To: Arnold, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Jarnold>
X-cc: 
X-bcc: 
X-Folder: \JARNOLD (Non-Privileged)\Arnold, John\Deleted Items
X-Origin: Arnold-J
X-FileName: JARNOLD (Non-Privileged).pst

becuase we are overvalued .... jan01 37.50 2.90 bid

 -----Original Message-----
From: 	Arnold, John  
Sent:	Thursday, October 04, 2001 10:25 PM
To:	Fraser, Jennifer
Subject:	RE: right about now dont u think u otta sell some calls against yr 36.88s

because we're $10 off the lows or because you think we're overvalued?

 -----Original Message-----
From: 	Fraser, Jenni

## Breaking Down our Features

Not all of the information from the emails will be helpful. This experiment focuses mainly on the content of the body, and we hypothesize that the senders, receivers, dates, etc. will not be useful in classifying their topics. Our first step is to remove this information from the body while retaining it in our dataframe in case they <i>do</i> end up being important for classification later on in the experiment.<br/>

We will need to construct a new header for each .csv containing our complete set of features.<br/>
When we are done, our features will be:
<ul>
    <li>filename</li>
    <li>id</li>
    <li>date</li>
    <li>from</li>
    <li>to</li>
    <li>subject</li>
    <li>cc</li>
    <li>mime-version</li>
    <li>content-type</li>
    <li>content-encoding</li>
    <li>bcc</li>
    <li>message</li>
</ul>
Although there are fields in each email such as "X-From" "X-To" etc., these are redundant and it should be safe to disregard them without losing any data.<br/><br/>
Let's make a function that will construct a header and add it to a file.

In [36]:
def create_header(filepath):
    # The header containing our complete set of features
    header = "\"filename\",\"id\",\"date\",\"from\",\"to\",\"subject\",\"cc\",\"mime-version\",\"content-type\",\"content-encoding\",\"bcc\",\"message\"\n"

    # Create a new file in the "processed" folder 
    # The filename should correspond to the given filepath.
    f = open(os.path.join("enron/unlabeled", os.path.basename(filepath)), "w+")
    # Write the header to the file and print the first line of the file to confirm that the header is formatted correctly.
    f.write(header)
    f.close()

Now we will write another function to take the lines from an email header and stick them into each of these features. This will be done by separating each feature with a comma inside the .csv.<br/>
Our function will perform this action on all emails in a single chunk.<br/>

In [37]:
def separate_features(filepath):
    
    # Read the file and collect all unprocessed emails
    file = pd.read_csv(filepath)
    
    # Open the file in the "processed" folder.
    # The filename should correspond to the given filepath.
    f = open(os.path.join("enron/unlabeled", os.path.basename(filepath)), "a+")
    
    # For each email in the chunk, separate the features and write to a new file.
    for x in range(0, file.shape[0]):
    
        lines = file['message'][x].splitlines()
    
        # First comes the filename
        f.write("\"" + file['file'][0] + "\",")

        # Every email has a message id
        f.write("\"" + lines[0] + "\",")

        # Every email has a date
        f.write("\"" + lines[1] + "\",")

        # Every email has a from line
        f.write("\"" + lines[2] + "\",")

        # Every email has a to line
        f.write("\"" + lines[3] + "\",")

        # Every email has a subject line, even though some are left blank
        f.write("\"" + lines[4] + "\",")

        # Not every email has a cc line.
        cc_line = 0
        if lines[5].startswith('Cc'):
            f.write("\"" + lines[5] + "\",")
            cc_line = 1
        else:
            f.write("\" \",")

        # Every email has a mime version
        f.write("\"" + lines[5 + cc_line] + "\",")

        # Every email has a content-type
        f.write("\"" + lines[6 + cc_line] + "\",")

        # Every email has a content-encoding
        f.write("\"" + lines[7 + cc_line] + "\",")

        # Not email has a bcc line
        bcc_line = 0
        if lines[8 + cc_line].startswith('Bcc'):
            f.write("\"" + lines[8 + cc_line] + "\",")
            bcc_line = 1
        else:
            f.write("\" \",")

        # Whatever remains is part of the message
        f.write("\"")
        for i in range(9 + cc_line + bcc_line, len(lines)):
            # We don't need to hold on to fields that start with X- because these are duplicated from the file header.
            # These lines are only found in the current email, so it will not alter the content of email chains.
            if lines[i].startswith("X-"):
                continue
            f.write(lines[i].replace("\"", "\"\"") + "\n")
        f.write("\"\n")
        f.flush()

    f.close()

Let's test these methods on our first chunk and make sure the formatting is what we expect.

In [38]:
create_header("enron/initial/enron emails_chunk1.csv")
separate_features("enron/initial/enron emails_chunk1.csv")

Opening the file visually looks good... Let's see if we can read it in using pandas without any errors.

In [41]:
file = pd.read_csv("enron/unlabeled/enron emails_chunk1.csv", error_bad_lines=False)
print(file.shape)

(10427, 12)


b'Skipping line 8868: expected 12 fields, saw 13\nSkipping line 9197: expected 12 fields, saw 13\nSkipping line 9380: expected 12 fields, saw 13\n'


It looks like we've lost a couple of emails due to parsing errors. I need to look into this.

In [52]:
print(file['message'][347])

Message-ID: <5705357.1075855920203.JavaMail.evans@thyme>
Date: Wed, 9 Aug 2000 02:28:00 -0700 (PDT)
From: sally.beck@enron.com
To: mary.gray@enron.com
Subject: Re: excitement
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Sally Beck
X-To: Mary Griff Gray
X-cc: 
X-bcc: 
X-Folder: \Sally_Beck_Dec2000\Notes Folders\'sent mail
X-Origin: Beck-S
X-FileName: sbeck.nsf

It was great to see a friendly face!  Don't ever hesitate to wave or stop in 
and say hello.  Can you believe that I had been on the 30th floor since the 
summer of 1994?!  I know that sets some kind of record at Enron!!  It is kind 
of fun to be up here.  The appreciation for the jobs that we do in operations 
has grown over the last several years, thanks to the hard work of you and so 
many others.  

My family is doing great.  The children start school tomorrow.  Meagan will 
be a junior in high school -- we took a brief trip with her this summer f

In [392]:
file = pd.read_csv("enron/initial/enron emails_chunk42.csv", error_bad_lines=False)
print(file.shape)

(8785, 2)


b'Skipping line 993: expected 2 fields, saw 3\nSkipping line 999: expected 2 fields, saw 3\nSkipping line 1002: expected 2 fields, saw 3\nSkipping line 1004: expected 2 fields, saw 3\nSkipping line 1006: expected 2 fields, saw 3\nSkipping line 1032: expected 2 fields, saw 3\nSkipping line 1037: expected 2 fields, saw 3\nSkipping line 1038: expected 2 fields, saw 3\nSkipping line 1043: expected 2 fields, saw 3\nSkipping line 1057: expected 2 fields, saw 3\nSkipping line 1071: expected 2 fields, saw 3\nSkipping line 1073: expected 2 fields, saw 3\nSkipping line 1093: expected 2 fields, saw 3\nSkipping line 1107: expected 2 fields, saw 3\nSkipping line 1110: expected 2 fields, saw 3\nSkipping line 1112: expected 2 fields, saw 3\nSkipping line 1118: expected 2 fields, saw 3\nSkipping line 1138: expected 2 fields, saw 3\nSkipping line 1155: expected 2 fields, saw 3\nSkipping line 1176: expected 2 fields, saw 3\nSkipping line 1190: expected 2 fields, saw 3\nSkipping line 1193: expected 2 fie