# Project: Enron dataset preparation for text analysis
## Citation

1. Students in the ANLP course annotated a subset of about 1700 labeled email messages 
   http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
2. mysql dataset (219 MB compressed) of the Enron email collection, built by Andrew Fiore and Jeff Heer, containing the enron email messages.
   http://bailando.sims.berkeley.edu/enron/enron.sql.gz
     

## Dependencies  
    python 3.6.7 +  

## Business Problems
* To generate a labeled enron dataset for topic classification
* To create email address list

## Descriptions

### enron_with_categories.tar.gz

Format of each line in .cats file:  
n1,n2,n3  

n1 = top-level category  
n2 = second-level category  
n3 = frequency with which this category was assigned to this message  

Here are the categories:  

1 Coarse genre  
 
1.1 Company Business, Strategy, etc. (elaborate in Section 3 [Topics])  
1.2 Purely Personal  
1.3 Personal but in professional context (e.g., it was good working with you)  
1.4 Logistic Arrangements (meeting scheduling, technical support, etc)  
1.5 Employment arrangements (job seeking, hiring, recommendations, etc)  
1.6 Document editing/checking (collaboration)  
1.7 Empty message (due to missing attachment)  
1.8 Empty message  


2 Included/forwarded information  

2.1 Includes new text in addition to forwarded material  
2.2 Forwarded email(s) including replies  
2.3 Business letter(s) / document(s)  
2.4 News article(s)  
2.5 Government / academic report(s)  
2.6 Government action(s) (such as results of a hearing, etc)  
2.7 Press release(s)  
2.8 Legal documents (complaints, lawsuits, advice)  
2.9 Pointers to url(s)  
2.10 Newsletters  
2.11 Jokes, humor (related to business)  
2.12 Jokes, humor (unrelated to business)  
2.13 Attachment(s) (assumed missing)  


3 Primary topics (if coarse genre 1.1 is selected)  

3.1 regulations and regulators (includes price caps)  
3.2 internal projects -- progress and strategy  
3.3 company image -- current  
3.4 company image -- changing / influencing  
3.5 political influence / contributions / contacts  
3.6 california energy crisis / california politics  
3.7 internal company policy  
3.8 internal company operations  
3.9 alliances / partnerships  
3.10 legal advice  
3.11 talking points  
3.12 meeting minutes  
3.13 trip reports  


4 Emotional tone (if not neutral)   

4.1 jubilation  
4.2 hope / anticipation  
4.3 humor  
4.4 camaraderie  
4.5 admiration  
4.6 gratitude  
4.7 friendship / affection  
4.8 sympathy / support  
4.9 sarcasm  
4.10 secrecy / confidentiality  
4.11 worry / anxiety  
4.12 concern  
4.13 competitiveness / aggressiveness  
4.14 triumph / gloating  
4.15 pride  
4.16 anger / agitation  
4.17 sadness / despair  
4.18 shame  
4.19 dislike / scorn  

### Contact Info: enron.sql.gz > table people 
CREATE TABLE people (  
  personid INTEGER PRIMARY KEY AUTOINCREMENT,  
  email varchar default NULL,  
  name varchar default NULL,  
  title varchar default NULL,  
  enron int default NULL,  
  msgsent int default NULL,  
  msgrec int default NULL,  
  CONSTRAINT email_unique UNIQUE (email)  
);



## Output files  
1. Email Body and Classification  
    Variables: 
    * abspath of text 
    * text - string - email content  
    * classlabel - array - ["1.1", "2.1"]
2. Contact information


# Get email metadata including firstname, lastname, email address, email subject, email body

In [1]:
import pandas as pd

In [2]:
from pathlib import Path
import os
from os import listdir
from os.path import isfile, join
path = Path("./data")

In [3]:
def readCatsFiles(filename):
    df = pd.read_csv(filename, header=None)
    df.columns = ['major', 'minor', 'freq']
    df['major'] = df['major'].astype(int).astype(str)
    df['minor'] = df['minor'].astype(int).astype(str) 
    df['label'] = df['major'] + "_" + df['minor']
    return list(set(df['major'])), df['label'].tolist()

readCatsFiles("/home/wk/myProjects/EnronDataset/data/4/54600.cats")

(['1', '3', '2'], ['1_1', '1_3', '1_4', '2_1', '2_2', '2_13', '3_9'])

In [4]:
def absoluteFilePaths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           if f.endswith(".cats"):
            yield os.path.abspath(os.path.join(dirpath, f))

mygenerator = absoluteFilePaths(path)
majors = []
minors = []
txts = []
filenames = []

for fn in mygenerator:
    major, minor = readCatsFiles(fn)
    majors.append(major)
    minors.append(minor)
    filenames.append(os.path.basename(fn).split(".")[0])
    with open(fn.split(".")[0] + '.txt', 'r') as file:
        txts.append(file.read())

df_email = pd.DataFrame({"MessageID": filenames, "Major":majors, "Minor":minors, "TXT":txts, })

In [5]:
from datetime import datetime
def convert_Data_Format(date_time_str1):
    date_time_obj1 = datetime.strptime(date_time_str1, '%d %b %Y')
    return (date_time_obj1.strftime('%m/%d/%Y'))

In [6]:
import re
def get_firstname_lastname(txt):
    firstname = "";
    lastname = "";
    nm = re.findall(r"(.+?)(?:\.|@)", txt)
    organization = nm[-1]
    if len(nm) >= 3 :
        nm = nm[:-1]
        firstname = nm[0]
        lastname = " ".join(nm[1:])
    return firstname,lastname, organization

In [7]:
import re
def retrieve_Meta_Data(txt):
    l_meta = re.findall(r"^^Message-ID: <(\d+)\.(\d+)\.\w+\.\w+@\w+> .*?Date:\s*\w+?, (\d+ \w+ \d+).*?From:\s*?(.*?)<.*?>.Subject:?(.*?)<LF>.*X-FileName:.*? <LF>(.*)", txt)

    master_rootid = l_meta[0][0]
    docket = l_meta[0][1]
    emailAddr = l_meta[0][3]
    nm = get_firstname_lastname(emailAddr)
    fn = nm[0]
    ln = nm[1]
    dt = convert_Data_Format(l_meta[0][2])
    subject = l_meta[0][4]
    body = l_meta[0][5].replace("<LF>", "\n")
    return master_rootid, docket, subject, dt, fn, ln, emailAddr, body

In [8]:
df_email.tail(2)

Unnamed: 0,MessageID,Major,Minor,TXT
1700,174075,"[4, 1, 2]","[1_3, 2_2, 4_10]",Message-ID: <19438086.1075846167206.JavaMail.e...
1701,54536,"[4, 1, 2]","[1_2, 1_3, 2_1, 2_2, 2_13, 4_2, 4_4, 4_10, 4_12]",Message-ID: <25473912.1075863420369.JavaMail.e...


In [9]:
list_email = list(df_email.TXT)
list_email = [e.replace("\n", " <LF> ").replace("\t", "   ") for e in list_email]

master_rootids = []
dockets = []
subjects = [] 
dts = [] 
fns = [] 
lns = [] 
emailAddrs = [] 
bodies = []
for txt in list_email:
    master_rootid, docket, subject, dt, fn, ln, emailAddr, body = retrieve_Meta_Data(txt)
    master_rootids.append(master_rootid)
    dockets.append(docket)
    subjects.append(subject) 
    dts.append(dt) 
    fns.append(fn) 
    lns.append(ln) 
    emailAddrs.append(emailAddr) 
    bodies.append(body)


In [10]:
df_out = pd.DataFrame(
    {
    "master_rootid" : master_rootids, 
    "DOCKET" : dockets,
    "SUBJECT" : subjects,
    "LEAD" : " ",
    "RECEIVED" : dts ,
    "PRIORITY" : "GEN",
    "ASGNTO" : df_email.Minor,
    "FNAME" : fns,
    "LNAME" : lns,
    "PROV" : " ",
    "EMAIL" : emailAddrs,
    "TEL1" : " ",
    "ADOC_REF" : df_email.MessageID,
    "DESCRIPTION" : "",
    "FILESUFFIX" : "PDF",
    "TRECS_Added_Date" : dts,
    "filename" : "",
    "text" : bodies,
    "replyfname" : "",
    "reply" : ""
    }
)
df_out.head(2)

Unnamed: 0,master_rootid,DOCKET,SUBJECT,LEAD,RECEIVED,PRIORITY,ASGNTO,FNAME,LNAME,PROV,EMAIL,TEL1,ADOC_REF,DESCRIPTION,FILESUFFIX,TRECS_Added_Date,filename,text,replyfname,reply
0,14452264,1075858883447,Re: EBS Article for eBiz,,07/13/2001,GEN,"[1_6, 2_2, 2_13]",steven,kean,,steven.kean@enron.com,,177817,,PDF,07/13/2001,,\n See attached \n \n \n \n \n From: ...,,
1,22747723,1075858707662,Bingaman Draft On Transparency -- Amendment I...,,09/11/2001,GEN,"[1_6, 3_1]",john,shelk,,john.shelk@enron.com,,136389,,PDF,09/11/2001,,"\n \n Last night, Linda and I spent a fairl...",,


In [11]:
df_out.to_csv('./out/emails.csv', index = True)

# Part 2: Get Contact Information  

In [12]:
df_emailaddress = pd.read_csv(path.joinpath("people.csv"),encoding = "ISO-8859-1")
df_emailaddress.head(1)

Unnamed: 0,personid,email,name,title,enron,msgsent,msgrec
0,2,mktstathourahead@caiso.com,Market Status: Hour-Ahead/Real-Time,,0,,


In [13]:
firstnames = []
lastnames = []
organizations = []

for email_addr in df_emailaddress['email']:
    fn, ln, org = get_firstname_lastname(email_addr)
    firstnames.append(fn)
    lastnames.append(ln)
    organizations.append(org)

In [14]:
df_contact = pd.DataFrame({"DetailID":df_emailaddress.personid, "FNAME": firstnames, "LNAME": lastnames, "PROV": "", "EMAIL":df_emailaddress.email, "TEL1":"", "ORGANIZATION":organizations})

In [15]:
df_contact.head(5)

Unnamed: 0,DetailID,FNAME,LNAME,PROV,EMAIL,TEL1,ORGANIZATION
0,2,,,,mktstathourahead@caiso.com,,caiso
1,3,,,,marketopsrealtimebeep@caiso.com,,caiso
2,4,,,,crcommunications@caiso.com,,caiso
3,5,,,,20participants@caiso.com,,caiso
4,6,,,,isoclientrelations@caiso.com,,caiso


In [16]:
df_contact.to_csv('./out/contacts.csv', index = False)