# Project: Enron dataset preparation for text analysis
## Citation

1. Students in the ANLP course annotated a subset of about 1700 labeled email messages 
   http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
2. mysql dataset (219 MB compressed) of the Enron email collection, built by Andrew Fiore and Jeff Heer, containing the enron email messages.
   http://bailando.sims.berkeley.edu/enron/enron.sql.gz
     

## Dependencies  
    python 3.6.7 +  

## Business Problems
* To generate a labeled enron dataset for topic classification
* To create email address list

## Descriptions

### enron_with_categories.tar.gz

Format of each line in .cats file:  
n1,n2,n3  

n1 = top-level category  
n2 = second-level category  
n3 = frequency with which this category was assigned to this message  

Here are the categories:  

1 Coarse genre  
 
1.1 Company Business, Strategy, etc. (elaborate in Section 3 [Topics])  
1.2 Purely Personal  
1.3 Personal but in professional context (e.g., it was good working with you)  
1.4 Logistic Arrangements (meeting scheduling, technical support, etc)  
1.5 Employment arrangements (job seeking, hiring, recommendations, etc)  
1.6 Document editing/checking (collaboration)  
1.7 Empty message (due to missing attachment)  
1.8 Empty message  


2 Included/forwarded information  

2.1 Includes new text in addition to forwarded material  
2.2 Forwarded email(s) including replies  
2.3 Business letter(s) / document(s)  
2.4 News article(s)  
2.5 Government / academic report(s)  
2.6 Government action(s) (such as results of a hearing, etc)  
2.7 Press release(s)  
2.8 Legal documents (complaints, lawsuits, advice)  
2.9 Pointers to url(s)  
2.10 Newsletters  
2.11 Jokes, humor (related to business)  
2.12 Jokes, humor (unrelated to business)  
2.13 Attachment(s) (assumed missing)  


3 Primary topics (if coarse genre 1.1 is selected)  

3.1 regulations and regulators (includes price caps)  
3.2 internal projects -- progress and strategy  
3.3 company image -- current  
3.4 company image -- changing / influencing  
3.5 political influence / contributions / contacts  
3.6 california energy crisis / california politics  
3.7 internal company policy  
3.8 internal company operations  
3.9 alliances / partnerships  
3.10 legal advice  
3.11 talking points  
3.12 meeting minutes  
3.13 trip reports  


4 Emotional tone (if not neutral)   

4.1 jubilation  
4.2 hope / anticipation  
4.3 humor  
4.4 camaraderie  
4.5 admiration  
4.6 gratitude  
4.7 friendship / affection  
4.8 sympathy / support  
4.9 sarcasm  
4.10 secrecy / confidentiality  
4.11 worry / anxiety  
4.12 concern  
4.13 competitiveness / aggressiveness  
4.14 triumph / gloating  
4.15 pride  
4.16 anger / agitation  
4.17 sadness / despair  
4.18 shame  
4.19 dislike / scorn  

### enron.sql.gz > table people 
CREATE TABLE people (  
  personid INTEGER PRIMARY KEY AUTOINCREMENT,  
  email varchar default NULL,  
  name varchar default NULL,  
  title varchar default NULL,  
  enron int default NULL,  
  msgsent int default NULL,  
  msgrec int default NULL,  
  CONSTRAINT email_unique UNIQUE (email)  
);



## Output files  
1. Email Body and Classification  
    Variables: 
    * abspath of text 
    * text - string - email content  
    * classlabel - array - ["1.1", "2.1"]


# Part I: Get email text 

In [1]:
import pandas as pd

In [2]:
from pathlib import Path
import os
from os import listdir
from os.path import isfile, join
path = Path("./data")


In [3]:

def readCatsFiles(filename):
    df = pd.read_csv(filename, header=None)
    df.columns = ['major', 'minor', 'freq']
    df['major'] = df['major'].astype(int).astype(str)
    df['minor'] = df['minor'].astype(int).astype(str)
    df['label'] = df['major'] + "_" + df['minor']
    return list(set(df['major'])), df['label'].tolist()

readCatsFiles("/home/wk/myProjects/EnronDataset/data/4/54600.cats")

(['3', '1', '2'], ['1_1', '1_3', '1_4', '2_1', '2_2', '2_13', '3_9'])

In [4]:
def absoluteFilePaths(directory):
   for dirpath,_,filenames in os.walk(directory):
       for f in filenames:
           if f.endswith(".cats"):
            yield os.path.abspath(os.path.join(dirpath, f))

mygenerator = absoluteFilePaths(path)
majors = []
minors = []
txts = []
filenames = []

for fn in mygenerator:
    major, minor = readCatsFiles(fn)
    majors.append(major)
    minors.append(minor)
    filenames.append(os.path.basename(fn).split(".")[0])
    with open(fn.split(".")[0] + '.txt', 'r') as file:
        txts.append(file.read())

df_email = pd.DataFrame({"MessageID": filenames, "Major":majors, "Minor":minors, "TXT":txts, })
df_email.head(30)
    

Unnamed: 0,MessageID,Major,Minor,TXT
0,177817,"[1, 2]","[1_6, 2_2, 2_13]",Message-ID: <14452264.1075858883447.JavaMail.e...
1,136389,"[3, 1]","[1_6, 3_1]",Message-ID: <22747723.1075858707662.JavaMail.e...
2,54663,"[1, 4, 2]","[1_6, 2_1, 2_2, 2_13, 4_7]",Message-ID: <10462332.1075863429489.JavaMail.e...
3,54679,[1],[1_6],Message-ID: <31003117.1075863429910.JavaMail.e...
4,176583,"[1, 4, 2]","[1_6, 2_1, 2_2, 2_7, 2_13, 4_12]",Message-ID: <16142741.1075849870380.JavaMail.e...
5,175676,"[1, 2]","[1_6, 2_1, 2_2, 2_13]",Message-ID: <647171.1075847619009.JavaMail.eva...
6,81420,"[3, 1, 4, 2]","[1_6, 2_4, 2_5, 2_13, 3_2, 4_5, 4_10]",Message-ID: <29691642.1075863611766.JavaMail.e...
7,175323,[1],[1_6],Message-ID: <25313634.1075847587139.JavaMail.e...
8,175558,"[1, 2]","[1_6, 2_13]",Message-ID: <12499440.1075847612101.JavaMail.e...
9,218920,"[1, 2]","[1_6, 2_13]",Message-ID: <6316972.1075852466973.JavaMail.ev...


In [5]:
df_email.tail(50)

Unnamed: 0,MessageID,Major,Minor,TXT
1652,173884,"[1, 2]","[1_3, 2_1, 2_2]",Message-ID: <8165738.1075846161635.JavaMail.ev...
1653,54642,"[1, 2]","[1_3, 2_1, 2_2]",Message-ID: <13361293.1075863428997.JavaMail.e...
1654,174137,"[1, 4, 2]","[1_3, 2_1, 2_2, 4_4]",Message-ID: <16698040.1075846169047.JavaMail.e...
1655,176591,"[1, 4, 2]","[1_3, 1_4, 2_2, 4_6, 4_7]",Message-ID: <30106710.1075849870589.JavaMail.e...
1656,175240,"[1, 2]","[1_3, 2_1, 2_2]",Message-ID: <13881087.1075847582172.JavaMail.e...
1657,54671,"[1, 2]","[1_3, 2_9]",Message-ID: <5644939.1075863429705.JavaMail.ev...
1658,54537,"[1, 2]","[1_3, 1_5, 2_1, 2_2]",Message-ID: <24575622.1075863420436.JavaMail.e...
1659,53536,"[1, 4, 2]","[1_3, 2_1, 2_2, 4_8]",Message-ID: <3454095.1075840788231.JavaMail.ev...
1660,54263,"[1, 4, 2]","[1_3, 1_5, 2_2, 4_4]",Message-ID: <14136486.1075858478980.JavaMail.e...
1661,9159,"[1, 4]","[1_3, 4_10]",Message-ID: <16962899.1075852653726.JavaMail.e...


In [48]:
list_email = list(df_email.TXT)
list_email = [e.replace("\n", " ").replace("\t", " ").replace("\\", '') for e in list_email]
list_email

hesitated over Mr. Lieberman is that he\'s Jewish. Mr. > Gore decided that was just fine. I think that I have never seen Al Gore do > such an elegant, intelligent and original thing. Well done, Mr. Gore. > I have to tell you, this really does feel like a 1960 moment to me. I was > a > little girl when a Catholic got chosen to run for president, and I had > gathered from the conversation of grownups that You Don\'t Elect Catholics > to the Presidency. When it happened, it\'s hard to describe how exciting > and > moving and idealism-inspiring it was. It gave a lot of people a lot of > joy. > It opened things up more. That was a good thing. So is this. > And because this is such a good thing, I hope everyone of whatever > politics > or persuasion sits back for a few days and feels good about it. Everyone > should be nice and not do any political bashing until . . . Friday. > However, I think it\'s okay and maybe even helpful to note the following. > Network producers are going to decide, 

# Part 2: Get Contact Information  

In [7]:
df_emailaddress = pd.read_csv(path.joinpath("people.csv"),encoding = "ISO-8859-1")
df_emailaddress.head(100)



Unnamed: 0,personid,email,name,title,enron,msgsent,msgrec
0,2,mktstathourahead@caiso.com,Market Status: Hour-Ahead/Real-Time,,0,,
1,3,marketopsrealtimebeep@caiso.com,CAISO Market Operations - Realtime/BEEP,,0,,
2,4,crcommunications@caiso.com,CRCommunications,,0,,
3,5,20participants@caiso.com,ISO Market Participants,,0,,
4,6,isoclientrelations@caiso.com,ISO Client Relations,,0,,
5,7,kalmeida@caiso.com,"Keoni"" ""Almeida",,0,,
6,8,bill.williams@enron.com,Bill Williams III,,1,,
7,9,dcarter@allmort.com,Dawn Carter,,0,,
8,10,-nikole@excite.com,nikki cole,,0,,
9,11,monika.causholli@enron.com,Monika Causholli,,1,,


In [26]:
import re
def get_firstname_lastname(txt):
    firstname = "";
    lastname = "";
    nm = re.findall(r"(.+?)(?:\.|@)", txt)
    organization = nm[-1]
    if len(nm) >= 3 :
        nm = nm[:-1]
        firstname = nm[0]
        lastname = " ".join(nm[1:])
    return firstname,lastname, organization


In [27]:
firstnames = []
lastnames = []
organizations = []

for email_addr in df_emailaddress['email']:
    fn, ln, org = get_firstname_lastname(email_addr)
    firstnames.append(fn)
    lastnames.append(ln)
    organizations.append(org)

In [28]:
df_contact = pd.DataFrame({"DetailID":df_emailaddress.personid, "FNAME": firstnames, "LNAME": lastnames, "PROV": "", "EMAIL":df_emailaddress.email, "TEL1":"", "ORGANIZATION":organizations})

In [29]:
df_contact

Unnamed: 0,DetailID,FNAME,LNAME,PROV,EMAIL,TEL1,ORGANIZATION
0,2,,,,mktstathourahead@caiso.com,,caiso
1,3,,,,marketopsrealtimebeep@caiso.com,,caiso
2,4,,,,crcommunications@caiso.com,,caiso
3,5,,,,20participants@caiso.com,,caiso
4,6,,,,isoclientrelations@caiso.com,,caiso
5,7,,,,kalmeida@caiso.com,,caiso
6,8,bill,williams,,bill.williams@enron.com,,enron
7,9,,,,dcarter@allmort.com,,allmort
8,10,,,,-nikole@excite.com,,excite
9,11,monika,causholli,,monika.causholli@enron.com,,enron
