# Finding emails relevant to meetings in Enron dataset

Plan:

1. Read Enron dataset
    - Read raw messages, parse them and convert into easily accessible format
    - Split meetings and emails
    - Exploratory data analysis
2. L1 ranking: find candidate emails (up to 200) to each meeting
    - Build document frequency for each word in email
    - Build text query for each meeting (up to 5 most important words from subject + body)
    - Filter emails by text query or by people intersection
    - Replicate same idea using PyTerrier
3. L2 ranking: rank candidate emails using OpenAI GPT-3
    - Send data to OpenAI GPT-3 for evaluation

Stretch goals:

- Process data using PySpark


In [1]:
import email
from email import policy
import bs4
import pickle
import openai
import spacy
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from collections import Counter, defaultdict
import json
import os
import re

In [2]:
# Create new `pandas` methods which use `tqdm` progress
# https://stackoverflow.com/questions/18603270/progress-indicator-during-pandas-operations
# https://pypi.org/project/tqdm/#pandas-integration
tqdm.pandas()

In [36]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pyterrier as pt
if not pt.started():
    pt.init()

PyTerrier 0.5.0 has loaded Terrier 5.4 (built by craigm on 2021-01-16 14:17)


## 0. Helper functions

Tokenization references:

- https://towardsdatascience.com/benchmarking-python-nlp-tokenizers-3ac4735100c5
- https://stackoverflow.com/questions/48691087/which-is-the-fastest-tokenization-function-in-python-3
- https://realpython.com/natural-language-processing-spacy-python/

In [3]:
sample_text = "John:\n?\nI'm not really sure what happened between us.? I was  under the impression \nafter my visit to Houston that we were about to enter into  a trial agreement \nfor my advisory work.? Somehow,?this never  occurred.? Did I say or do \nsomething wrong to screw this  up???\n?\nI don't know if you've blown this whole thing off, but I still  hope you are \ninterested in trying?to create an arrangement.? As a  courtesy, here is my \nreport from this past weekend.? If you are no longer  interested in my work, \nplease tell me so.??Best wishes,\n?\nMark Sagel\nPsytech Analytics\n(410)308-0245? \n - energy2000-1112.doc"

In [4]:
nlp = spacy.load("en_core_web_sm")
def tokenize_spacy(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc]

In [5]:
" ".join(tokenize_spacy(sample_text))

"John : \n ? \n I be not really sure what happen between we . ? I be   under the impression \n after my visit to Houston that we be about to enter into   a trial agreement \n for my advisory work . ? somehow,?this never   occur . ? do I say or do \n something wrong to screw this   up ? ? ? \n ? \n I do n't know if you 've blow this whole thing off , but I still   hope you be \n interested in trying?to create an arrangement . ? as a   courtesy , here be my \n report from this past weekend . ? if you be no long   interested in my work , \n please tell I so.??Best wish , \n ? \n Mark Sagel \n Psytech Analytics \n ( 410)308 - 0245 ? \n  - energy2000-1112.doc"

In [6]:
t_spacy = %timeit -o tokenize_spacy(sample_text)

34 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
print(f"tokenize_spacy could take about {(500_000 * t_spacy.average / 60 / 60):.2f} hours")

tokenize_spacy could take about 4.72 hours


In [8]:
WORD = re.compile(r'\w+')
def tokenize_regexp(text):
    return WORD.findall(text)

In [9]:
" ".join(tokenize_regexp(sample_text))

'John I m not really sure what happened between us I was under the impression after my visit to Houston that we were about to enter into a trial agreement for my advisory work Somehow this never occurred Did I say or do something wrong to screw this up I don t know if you ve blown this whole thing off but I still hope you are interested in trying to create an arrangement As a courtesy here is my report from this past weekend If you are no longer interested in my work please tell me so Best wishes Mark Sagel Psytech Analytics 410 308 0245 energy2000 1112 doc'

In [10]:
t_regexp = %timeit -o tokenize_regexp(sample_text)

41.6 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [11]:
print(f"tokenize_regexp could take about {(500_000 * t_regexp.average / 60 / 60):.2f} hours")

tokenize_regexp could take about 0.01 hours


In [12]:
# Choosing regexp tokenization for this moment, since it is faster
tokenize = tokenize_regexp

## 1. Read Enron dataset

In [14]:
def extract_msg_body(msg):
    body = msg.get_body()
    if not body['content-type']:
        return body.get_content()
    if body['content-type'].maintype == 'text':
        if body['content-type'].subtype == 'plain':
            return body.get_content()
        elif body['content-type'].subtype == 'html':
            content = DOCTYPE_REGEX.sub("", body.get_content())
            soup = bs4.BeautifulSoup(content, 'lxml')
            [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
            return soup.get_text()
    return ""

def parse_message(msg):
    return {
        'ID': msg['Message-ID'],
        'Date': msg['Date'],
        'Subject': msg['Subject'],
        'To': msg['To'],
        'From': msg['From'],
        'Cc': msg['Cc'],
        'Bcc': msg['Bcc'],
        'X-To': msg['X-To'],
        'X-From': msg['X-From'],
        'X-Cc': msg['X-cc'],
        'X-Bcc': msg['X-bcc'],
        'Body': extract_msg_body(msg)
    }

In [15]:
enron_input_path = r"/Users/anton/Downloads/maildir"
enron_processed_path = r"/Users/anton/datasets/enron.jsonl"

In [16]:
def preprocess_enron_data():
    print("Counting number of messages in Enron dataset...")
    msg_count = sum([len(files) for _, _, files in os.walk(enron_input_path)])
    print("Found {} messages".format(msg_count))

    print("Reading messages in Enron dataset...")
    with open(enron_processed_path, "w", encoding="utf-8") as output_file:
        with tqdm(total=msg_count) as pbar:
            for root, _, files in os.walk(enron_input_path):
                rel_dir = os.path.relpath(root, enron_input_path)
                pbar.set_description(f'Processing "{rel_dir}"')
                for file_name in files:
                    rel_file = os.path.join(rel_dir, file_name)

                    with open(os.path.join(enron_input_path, rel_file), 'rb') as f:
                        try:
                            msg = email.parser.BytesParser(policy=policy.default).parse(f)
                            parsed_msg = parse_message(msg)
                        except Exception as e:
                            pbar.write(f'Ignoring {rel_file}, parsing failed with exception: {e}')

                        if not parsed_msg['ID']:
                            pbar.write(f'Ignoring {rel_file}, does not look like proper a message')
                        else:
                            parsed_msg['File'] = rel_file
                            output_file.write(json.dumps(parsed_msg) + "\n")

                    pbar.update(1)
    print("Done")

In [28]:
preprocess_enron_data()

Counting number of messages in Enron dataset...
Found 517404 messages
Reading messages in Enron dataset...


  0%|          | 0/517404 [00:00<?, ?it/s]

Ignoring ./.DS_Store, does not look like proper a message
Ignoring kitchen-l/sent_items/24., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/sent_items/29., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/sent_items/20., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/_americas/netco_eol/83., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/_americas/netco_eol/82., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/_americas/netco_eol/1., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/_americas/esvl/87., parsing failed with exception: 'ValueTerminal' object does not support item assignment
Ignoring kitchen-l/_americas/netco_restart/3., parsing fai

In [30]:
df = pd.read_json(enron_processed_path, lines=True)
df.to_parquet(r"/Users/anton/datasets/enron.parquet")

In [13]:
df = pd.read_parquet(r"/Users/anton/datasets/enron.parquet")

In [14]:
print(f"Total number of messages: {df.shape[0]}")

Total number of messages: 517401


In [19]:
df.memory_usage(deep=True)

Index            128
ID          52603300
Date        45531288
Subject     44153383
To         103506888
From        41142712
Cc          29974977
Bcc         29974977
X-To       133026209
X-From      44442913
X-Cc        46754093
X-Bcc       29634964
Body       977529990
File        42845664
dtype: int64

In [18]:
df.head()

Unnamed: 0,ID,Date,Subject,To,From,Cc,Bcc,X-To,X-From,X-Cc,X-Bcc,Body,File
0,<17334447.1075857585446.JavaMail.evans@thyme>,"Thu, 16 Nov 2000 09:30:00 -0800",Status,jarnold@enron.com,msagel@home.com,,,"""John Arnold"" <jarnold@enron.com>","""Mark Sagel"" <msagel@home.com>",,,John:\n?\nI'm not really sure what happened be...,arnold-j/notes_inbox/36.
1,<19171686.1075857585034.JavaMail.evans@thyme>,"Fri, 08 Dec 2000 05:05:00 -0800",re:summer inverses,john.arnold@enron.com,slafontaine@globalp.com,,,John.Arnold@enron.com,slafontaine@globalp.com,,,i suck-hope youve made more money in natgas la...,arnold-j/notes_inbox/19.
2,<29887033.1075857630725.JavaMail.evans@thyme>,"Tue, 15 May 2001 09:43:00 -0700",The WTI Bullet swap contracts,"icehelpdesk@intcx.com, internalmarketing@intcx...",iceoperations@intcx.com,,,"**ICEHELPDESK <**ICEHELPDESK@intcx.com>, **Int...",ICE Operations <ICEOperations@intcx.com>,,,"Hi,\n\n\n Following the e-mail you have rece...",arnold-j/notes_inbox/50.
3,<29084893.1075849630138.JavaMail.evans@thyme>,"Mon, 27 Nov 2000 01:49:00 -0800",Invitation: EBS/GSS Meeting w/Bristol Babcock ...,"anthony.gilmore@enron.com, colleen.koenig@enro...",jeff.youngflesh@enron.com,,,"Anthony Gilmore, Colleen Koenig, Jennifer Stew...",Jeff Youngflesh,,,Conference Room TBD. \n\nThis meeting will be...,arnold-j/notes_inbox/3.
4,<30248874.1075857584813.JavaMail.evans@thyme>,"Tue, 12 Dec 2000 09:33:00 -0800",Harvard Mgmt,mike.grigsby@enron.com,caroline.abramo@enron.com,john.arnold@enron.com,john.arnold@enron.com,Mike Grigsby,Caroline Abramo,John Arnold,,Mike- I have their trader coming into the offi...,arnold-j/notes_inbox/9.


In [15]:
filter = df.File.str.contains("calendar")
df_meetings = df[filter]
df_emails = df[~filter]

In [16]:
print(f"Number of meetings: {df_meetings.shape[0]}")
print(f"Number of emails: {df_emails.shape[0]}")
print(f"Emails / Meetings ratio: {df_emails.shape[0] / df_meetings.shape[0]}")

Number of meetings: 6232
Number of emails: 511169
Emails / Meetings ratio: 82.02326700898588


## 2. L1 ranking: find candidate emails (up to 200) to each meeting
### 2.1. Build document frequency for each word in email

In [22]:
doc_freq = Counter()
for column in ['Subject', 'Body']:
    for text in tqdm(df[column]):
        doc_freq.update(set(tokenize(text)))

  0%|          | 0/517401 [00:00<?, ?it/s]

  0%|          | 0/517401 [00:00<?, ?it/s]

In [17]:
enron_df_path = r"/Users/anton/datasets/enron_df.p"

In [23]:
pickle.dump(doc_freq, open(enron_df_path, "wb"))
print(os.path.getsize(enron_df_path))

18254379


In [18]:
doc_freq = pickle.load(open(enron_df_path, "rb"))

In [19]:
len(doc_freq)

698155

In [24]:
doc_freq.most_common(20)  # most common words

[('the', 423056),
 ('to', 413287),
 ('and', 368052),
 ('for', 363579),
 ('of', 345295),
 ('on', 331826),
 ('you', 326065),
 ('a', 321187),
 ('is', 305843),
 ('in', 305037),
 ('I', 274604),
 ('To', 256965),
 ('have', 253450),
 ('this', 252973),
 ('be', 250607),
 ('that', 244642),
 ('with', 238975),
 ('by', 225529),
 ('s', 225062),
 ('Subject', 224242)]

In [27]:
doc_freq.most_common()[:-20-1:-1]   # least common words

[('Correctable', 1),
 ('PACIFICORP_OU', 1),
 ('emis_transmission_scheduling', 1),
 ('EMISOASIS', 1),
 ('SCHDOASIS', 1),
 ('Robert_L', 1),
 ('singlebus', 1),
 ('ARef', 1),
 ('ETags', 1),
 ('SLC_CN', 1),
 ('Larocco', 1),
 ('40wscc', 1),
 ('ISASarians', 1),
 ('WON_Security_Policy', 1),
 ('01CL', 1),
 ('01RL', 1),
 ('ATC_posting_background_10', 1),
 ('ATCmethodolgy_10', 1),
 ('VancouverMinutes', 1),
 ('ESIBW', 1)]

In [22]:
subject = df_meetings.iloc[0].Subject
body = df_meetings.iloc[0].Body
print(subject)
print(body)
Counter({word: doc_freq[word] for word in tokenize(subject + " " + body)}).most_common()[:-11:-1]

Mtg: Budget Meeting - Whalley
Greg Whalley will be holding a budget meeting on Tuesday, Feb. 6 @ 3:00 p.m. in EB3321.  This invitation is extended to all CEOs and/or COOs reporting to Wholesale Services.  Should you have any questions, please contact me immediately.  

Confirmed Attendees:
Louise Kitchen
Dave Delainey
Jeff McMahon
Jeff Shankman
Jim Hughes
Wes Colwell



[('COOs', 16),
 ('EB3321', 258),
 ('Confirmed', 534),
 ('Attendees', 981),
 ('CEOs', 1014),
 ('Colwell', 1780),
 ('Wes', 2308),
 ('Budget', 2327),
 ('Mtg', 2834),
 ('McMahon', 2923)]

In [25]:
pd.DataFrame([{"Word": word, "Freq": doc_freq[word]} for word in tokenize(subject + " " + body)]).sort_values(by="Freq")

Unnamed: 0,Word,Freq
31,COOs,16
21,EB3321,258
45,Confirmed,534
46,Attendees,981
28,CEOs,1014
58,Colwell,1780
57,Wes,2308
1,Budget,2327
0,Mtg,2834
52,McMahon,2923


In [20]:
MAX_DOC_FREQ = 100

postings = defaultdict(list)

for index, email in tqdm(df_emails.iterrows(), total=df_emails.shape[0]):
    email_words = set(tokenize(email.Subject + " " + email.Body))
    for word in email_words:
        if doc_freq[word] < MAX_DOC_FREQ:
            postings[word].append(index)

  0%|          | 0/511169 [00:00<?, ?it/s]

In [41]:
def get_query(meeting):
    return tuple({word: (1 / (doc_freq[word] or 1)) for word in tokenize(meeting[column])} for column in ['Subject', 'Body'])

df_meetings_with_query = df_meetings.assign(Query = df_meetings.apply(lambda meeting: get_query(meeting), axis=1))

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR! Session/line number was not unique in database. History logging moved to new session 89
Traceback (most recent call last):
  File "/Users/anton/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 98, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 89, in pandas._libs.index.Int64Engine._check_type
KeyError: 'Subject'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  

In [23]:
EMAILS_PER_MEETING = 200

def get_email_to_score(query):
    email_to_score = defaultdict(float)
    for word in tokenize(query):
        for email in postings.get(word, []):
            email_to_score[email] += 1 / (doc_freq[word] or 1)
    return email_to_score

subject_query_emails = []
body_query_emails = []
for index, meeting in tqdm(df_meetings.iterrows(), total=df_meetings.shape[0]):
    subject_query_emails.append(len(get_email_to_score(meeting.Subject)))
    body_query_emails.append(len(get_email_to_score(meeting.Body)))

  0%|          | 0/6232 [00:00<?, ?it/s]

In [26]:
(sum(subject_query_emails) + sum(body_query_emails)) / df_meetings.shape[0]

194.44608472400515

In [29]:
EMAILS_PER_MEETING = 200

def compute_score(query, email_words):
    #score = sum([(query.get(word) or 0) for word in email_words])
    query_keys = query.keys()
    common_words = email_words.intersection(query_keys)
    score = sum([query.get(word) for word in common_words])
    return score

# for meeting in df_meetings.iterrows():
#     meeting_subject = meeting.Subject
#     meeting_body = meeting.Body

#     scores = [(compute_score(meeting_subject, email), compute_score(meeting_body, email)) for email in df_emails.iterrows()]
#     numpy.argsort(scores)[:EMAILS_PER_MEETING]

def test():
    for index, email in tqdm(df_emails.head(10).iterrows(), total=10):
        email_words = set(tokenize(email.Subject + " " + email.Body))
        scores = [(compute_score(meeting.Query[0], email_words), compute_score(meeting.Query[1], email_words)) for _, meeting in df_meetings_with_query.iterrows()]

    # meeting_subject = df_meetings.iloc[0].Subject
    # meeting_body = df_meetings.iloc[0].Body

    # scores = [(index, compute_score(meeting_subject, email), compute_score(meeting_body, email)) for index, email in ]
    # sorted(scores, key=lambda element: (element[1], element[2]), reverse=True)
    # return None

#numpy.argsort(scores)[:EMAILS_PER_MEETING]

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/Users/anton/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-29-64290846851a>", line 25, in <module>
    scores = [(index, compute_score(meeting_subject, email), compute_score(meeting_body, email)) for index, email in df_emails.iterrows()]
  File "<ipython-input-29-64290846851a>", line 25, in <listcomp>
    scores = [(index, compute_score(meeting_subject, email), compute_score(meeting_body, email)) for index, email in df_emails.iterrows()]
  File "<ipython-input-29-64290846851a>", line 8, in compute_score
   

In [60]:
test()

  0%|          | 0/10 [00:00<?, ?it/s]

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

{'blown', 'impression', 'create', 'enter', 'interested', 'no', 'so', 'agreement', 've', 'do', 'hope', 'from', 'what', 'in', 'between', 'Psytech', 'here', 'If', 'past', 'for', 'or', 'still', 'into', 'wrong', 'under', 'Analytics', 'screw', 'were', 'wishes', 'whole', 'a', 'John', 'weekend', 'are', 'was', 'don', 'Somehow', 'say', 'you', 'sure', 'know', 'but', 'my', 'trial', 'if', 'off', '1112', 'longer', 'work', 'not', 'Best', '0245', 'we', 't', 'arrangement', 'visit', 'something', 'Did', 'Mark', 'report', 'happened', 'that', 'courtesy', 'to', 'Sagel', 'doc', 'advisory', 'after', 'this', 'is', 'm', 'really', '308', 'As', '410', 'I', 'occurred', 'Houston', 'us', 'about', 'me', '

In [55]:
500_000/60/60

138.88888888888889

In [63]:
%lprun -f compute_score test()

  0%|          | 0/10 [00:00<?, ?it/s]

Timer unit: 1e-06 s

Total time: 1.5725 s
File: <ipython-input-62-298096d3f635>
Function: compute_score at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
     3                                           def compute_score(query, email_words):
     4                                               #score = sum([(query.get(word) or 0) for word in email_words])
     5    124640     152213.0      1.2      9.7      query_keys = query.keys()
     6    124640     643617.0      5.2     40.9      common_words = email_words.intersection(query_keys)
     7    124640     708258.0      5.7     45.0      score = sum([query.get(word) for word in common_words])
     8    124640      68416.0      0.5      4.4      return score

In [3]:
import pandas as pd

# lets not truncate output too much
pd.set_option('display.max_colwidth', 150)

docs_df = pd.DataFrame([
        ["d1", "this is the first document of many documents"],
        ["d2", "this is another document"],
        ["d3", "the topic of this document is unknown"]
    ], columns=["docno", "text"])

docs_df

Unnamed: 0,docno,text
0,d1,this is the first document of many documents
1,d2,this is another document
2,d3,the topic of this document is unknown


In [4]:
indexer = pt.DFIndexer("/Users/Anton/datasets/index_3docs", overwrite=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

'/Users/Anton/datasets/index_3docs/data.properties'

In [5]:
!ls -al /Users/Anton/datasets/index_3docs

total 72
drwxr-xr-x  10 anton  staff   320 Apr  5 17:22 [1m[36m.[m[m
drwxr-xr-x   5 anton  staff   160 Apr  5 17:22 [1m[36m..[m[m
-rw-r--r--   1 anton  staff     3 Apr  5 17:22 data.direct.bf
-rw-r--r--   1 anton  staff    51 Apr  5 17:22 data.document.fsarrayfile
-rw-r--r--   1 anton  staff     4 Apr  5 17:22 data.inverted.bf
-rw-r--r--   1 anton  staff   344 Apr  5 17:22 data.lexicon.fsomapfile
-rw-r--r--   1 anton  staff   249 Apr  5 17:22 data.lexicon.fsomaphash
-rw-r--r--   1 anton  staff    24 Apr  5 17:22 data.meta.idx
-rw-r--r--   1 anton  staff    36 Apr  5 17:22 data.meta.zdata
-rw-r--r--   1 anton  staff  4193 Apr  5 17:22 data.properties


In [6]:
index = pt.IndexFactory.of(index_ref)

#lets see what type index is.
type(index)

jnius.reflect.org.terrier.structures.Index

In [7]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,0,d1,0,2.0,document
1,1,2,d3,1,1.0,document
2,1,1,d2,2,1.0,document


In [59]:
with open("/Users/anton/openai_api_key.txt", "r") as f:
    openai.api_key = f.read().strip()

3313


In [6]:
mail_files[150]

'hyatt-k/deleted_items/atoka_lateral/3.'

In [28]:
calendar_messages = []

for user in os.listdir(enron_path):
    calendar_dir = os.path.join(enron_path, user, "calendar")
    if os.path.isdir(calendar_dir):
        for root, _, files in os.walk(calendar_dir):
            for file_name in files:
                f = os.path.join(root, file_name)
                with open(os.path.join(calendar_dir, f), 'rb') as file:
                    msg = email.parser.BytesParser(policy=policy.default).parse(file)
                    calendar_messages.append(parse_message(msg))

print(len(calendar_messages))

6133


In [29]:
df_calendar = pd.DataFrame(calendar_messages)

In [30]:
calendar_messages[0]

{'Date': 'Wed, 21 Feb 2001 12:39:49 -0800',
 'Subject': 'Mtg: Budget Meeting - Whalley',
 'To': None,
 'From': 'lavorato@enron.com',
 'Cc': None,
 'Bcc': None,
 'X-To': '',
 'X-From': 'Lavorato, John',
 'X-Cc': '',
 'X-Bcc': '',
 'Body': 'Greg Whalley will be holding a budget meeting on Tuesday, Feb. 6 @ 3:00 p.m. in EB3321.  This invitation is extended to all CEOs and/or COOs reporting to Wholesale Services.  Should you have any questions, please contact me immediately.  \n\nConfirmed Attendees:\nLouise Kitchen\nDave Delainey\nJeff McMahon\nJeff Shankman\nJim Hughes\nWes Colwell\n'}

In [31]:
df_calendar

Unnamed: 0,Date,Subject,To,From,Cc,Bcc,X-To,X-From,X-Cc,X-Bcc,Body
0,"Wed, 21 Feb 2001 12:39:49 -0800",Mtg: Budget Meeting - Whalley,,lavorato@enron.com,,,,"Lavorato, John",,,"Greg Whalley will be holding a budget meeting on Tuesday, Feb. 6 @ 3:00 p.m. in EB3321. This invitation is extended to all CEOs and/or COOs repor..."
1,"Thu, 14 Jun 2001 13:16:25 -0700",Lunch: Dan Leff,,kimberly.hillis@enron.com,,,,"Hillis, Kimberly </O=ENRON/OU=NA/CN=RECIPIENTS/CN=KHILLIS>",,,Esmerelda Hinojosa - x57390
2,"Thu, 26 Jul 2001 07:05:22 -0700",Mtg: Heath Schiesser - Xcellorator,,kimberly.hillis@enron.com,,,,"Hillis, Kimberly </O=ENRON/OU=NA/CN=RECIPIENTS/CN=KHILLIS>",,,Annitta Granado - 39724
3,"Wed, 23 May 2001 12:08:24 -0700",Mtg: Wes Colwell - Flash to Actual,,hillis@enron.com,,,,"Hillis, Kimberly",,,"\n\n -----Original Message-----\nFrom: \tTijerina, Shirley \nSent:\tWednesday, May 23, 2001 2:34 PM\nTo:\tBlack, Don; Ruffer, Mary Lynne; Herndon..."
4,"Wed, 21 Mar 2001 06:16:25 -0800",Golf: With Greg Whalley (Noon T-time),,hillis@enron.com,,,,"Hillis, Kimberly",,,"Kim,\n\nJust blk off from 11:00a.m. til the rest of the day. Memorial Park Golf Course. (Florida Scramble) I forward more details later. \nNoo..."
...,...,...,...,...,...,...,...,...,...,...,...
6128,"Mon, 05 Nov 2001 11:24:29 -0800",Air Products Interruptible Proposal,"mike.curry@enron.com, doug.gilbert-smith@enron.com, m..forney@enron.com, l..day@enron.com, madhup.kumar@enron.com, jeffrey.miller@enron.com, judy....",judy.martinez@enron.com,,,"Curry, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mcurry>, Gilbert-smith, Doug </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Dsmith3>, Forney, John M. </O=ENRON/OU=...","Martinez, Judy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JMARTIN2>",,,"When: Wednesday, November 07, 2001 9:00 AM-10:00 AM (GMT-06:00) Central Time (US & Canada).\nWhere: EB - 3143B\n\n*~*~*~*~*~*~*~*~*~*\n"
6129,"Thu, 18 Oct 2001 16:35:17 -0700",Action Required: Attend West Power Staff Meeting,"debra.davidson@enron.com, tim.belden@enron.com, tom.alonso@enron.com, mark.fischer@enron.com, sean.crandall@enron.com, robert.badeer@enron.com, h....",debra.davidson@enron.com,anna.mehrer@enron.com,anna.mehrer@enron.com,"Davidson, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavids3>, Belden, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tbelden>, Alonso, Tom </O=ENRON/OU=NA/CN=...","Davidson, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=DDAVIDS3>","Mehrer, Anna </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Amehrer>",,"Please make plans to attend a West Power Trading luncheon staff meeting.\n\nDate: Monday, October 22\nPlace: Mt. Hood Conference Room\nTime: 11..."
6130,"Mon, 01 Oct 2001 12:49:10 -0700",Revised: Randy Hardy,"anna.mehrer@enron.com, tim.belden@enron.com, sean.crandall@enron.com, diana.scholtes@enron.com, mike.swerzbin@enron.com, h..foster@enron.com, paul...",anna.mehrer@enron.com,debra.davidson@enron.com,debra.davidson@enron.com,"Mehrer, Anna </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Amehrer>, Belden, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tbelden>, Crandall, Sean </O=ENRON/OU=NA/CN=R...","Mehrer, Anna </O=ENRON/OU=NA/CN=RECIPIENTS/CN=AMEHRER>","Davidson, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavids3>",,This has replaced the previously scheduled meeting.
6131,"Mon, 01 Oct 2001 12:49:10 -0700",Revised: Randy Hardy,"anna.mehrer@enron.com, tim.belden@enron.com, sean.crandall@enron.com, diana.scholtes@enron.com, mike.swerzbin@enron.com, h..foster@enron.com, paul...",anna.mehrer@enron.com,debra.davidson@enron.com,debra.davidson@enron.com,"Mehrer, Anna </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Amehrer>, Belden, Tim </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Tbelden>, Crandall, Sean </O=ENRON/OU=NA/CN=R...","Mehrer, Anna </O=ENRON/OU=NA/CN=RECIPIENTS/CN=AMEHRER>","Davidson, Debra </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Ddavids3>",,This has replaced the previously scheduled meeting.


In [33]:
df_calendar.Subject.nunique()

5154

In [16]:
print(msg)

Message-ID: <21156393.1075857700266.JavaMail.evans@thyme>
Date: Wed, 21 Feb 2001 12:39:49 -0800 (PST)
From: lavorato@enron.com
Subject: Mtg: Budget Meeting - Whalley
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Lavorato, John
X-To: 
X-cc: 
X-bcc: 
X-Folder: \jlavora\Calendar
X-Origin: Lavorado-J
X-FileName: jlavora.pst

Greg Whalley will be holding a budget meeting on Tuesday, Feb. 6 @ 3:00 p.m. in EB3321.  This invitation is extended to all CEOs and/or COOs reporting to Wholesale Services.  Should you have any questions, please contact me immediately.  

Confirmed Attendees:
Louise Kitchen
Dave Delainey
Jeff McMahon
Jeff Shankman
Jim Hughes
Wes Colwell



In [11]:
msg.get_body()

<email.message.EmailMessage at 0x7ff32dda4710>

In [11]:
def extract_msg_body(msg):
    body = msg.get_body()
    if not body['content-type']:
        return body.get_content()
    if body['content-type'].maintype == 'text':
        if body['content-type'].subtype == 'plain':
            return body.get_content()
        elif body['content-type'].subtype == 'html':
            content = DOCTYPE_REGEX.sub("", body.get_content())
            soup = bs4.BeautifulSoup(content, 'lxml')
            [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
            return soup.get_text()
    return ""

In [19]:
extract_msg_body(msg)

'Greg Whalley will be holding a budget meeting on Tuesday, Feb. 6 @ 3:00 p.m. in EB3321.  This invitation is extended to all CEOs and/or COOs reporting to Wholesale Services.  Should you have any questions, please contact me immediately.  \n\nConfirmed Attendees:\nLouise Kitchen\nDave Delainey\nJeff McMahon\nJeff Shankman\nJim Hughes\nWes Colwell\n'

In [20]:
msg['Subject']

'Mtg: Budget Meeting - Whalley'

In [21]:
msg.keys()

['Message-ID',
 'Date',
 'From',
 'Subject',
 'Mime-Version',
 'Content-Type',
 'Content-Transfer-Encoding',
 'X-From',
 'X-To',
 'X-cc',
 'X-bcc',
 'X-Folder',
 'X-Origin',
 'X-FileName']

In [22]:
with open("/Users/anton/Downloads/maildir/campbell-l/sent/7.", 'rb') as file:
    random_msg = email.parser.BytesParser(policy=policy.default).parse(file)

In [24]:
print(random_msg)

Message-ID: <26730414.1075851918003.JavaMail.evans@thyme>
Date: Thu, 15 Jul 1999 09:09:00 -0700 (PDT)
From: larry.campbell@enron.com
To: team.monahans@enron.com
Subject: Groundwater Monitoring to be Discontinued at the Puckett Plant
Cc: william.kendrick@enron.com, michael.terraso@enron.com, 
	butch.russell@enron.com, rich.jolly@enron.com, bob.bandel@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: william.kendrick@enron.com, michael.terraso@enron.com, 
	butch.russell@enron.com, rich.jolly@enron.com, bob.bandel@enron.com
X-From: Larry Campbell
X-To: Team Monahans
X-cc: William Kendrick, Michael Terraso, Butch Russell, Rich Jolly, Bob Bandel
X-bcc: 
X-Folder: \Larry_Campbell_Nov2001_1\Notes Folders\Sent
X-Origin: CAMPBELL-L
X-FileName: lcampbe.nsf

Approval was received from the Texas Railroad Commission to discontinue 
groundwater monitoring at the Pucket Plant.  As you may remember, 
Transwestern constructed a permitted landfil

In [8]:
with open("/Users/anton/Downloads/maildir/campbell-l/importantstuff/5.", 'rb') as file:
    random_msg2 = email.parser.BytesParser(policy=policy.default).parse(file)

In [9]:
print(random_msg2)

Message-ID: <17123273.1075857873789.JavaMail.evans@thyme>
Date: Tue, 5 Sep 2000 04:34:00 -0700 (PDT)
From: kayne.coulter@enron.com
To: larry.jester@enron.com, jay.wills@enron.com, cyril.price@enron.com, 
	john.kinser@enron.com, rudy.acevedo@enron.com, 
	richard.hrabal@enron.com, wayne.herndon@enron.com, 
	jason.choate@enron.com, juan.hernandez@enron.com, 
	greg.trefz@enron.com, miguel.garcia@enron.com, 
	russell.ballato@enron.com, joe.stepenovitch@enron.com, 
	joe.errigo@enron.com, doug.miller@enron.com, 
	larry.campbell@enron.com, juan.hernandez@enron.com, 
	keller.mayeaux@enron.com, chad.starnes@enron.com, 
	dean.laurent@enron.com, don.baughman@enron.com, 
	lawrence.clayton@enron.com
Subject: NERC Training schedule
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Kayne Coulter
X-To: Larry Jester, Jay Wills, Cyril Price, John Kinser, Rudy Acevedo, Richard
 Hrabal, Wayne Herndon, Jason Choate, Juan Hernandez, Greg Trefz, Miguel
 Garci

In [26]:
len(pickle.dumps(random_msg2))

2373

In [12]:
def parse_message(msg):
    return {
        'ID': msg['Message-ID'],
        'Date': msg['Date'],
        'Subject': msg['Subject'],
        'To': msg['To'],
        'From': msg['From'],
        'Cc': msg['Cc'],
        'Bcc': msg['Bcc'],
        'X-To': msg['X-To'],
        'X-From': msg['X-From'],
        'X-Cc': msg['X-cc'],
        'X-Bcc': msg['X-bcc'],
        'Body': extract_msg_body(msg)
    }

{'Date': 'Tue, 05 Sep 2000 04:34:00 -0700', 'Subject': 'NERC Training schedule', 'To': 'larry.jester@enron.com, jay.wills@enron.com, cyril.price@enron.com, john.kinser@enron.com, rudy.acevedo@enron.com, richard.hrabal@enron.com, wayne.herndon@enron.com, jason.choate@enron.com, juan.hernandez@enron.com, greg.trefz@enron.com, miguel.garcia@enron.com, russell.ballato@enron.com, joe.stepenovitch@enron.com, joe.errigo@enron.com, doug.miller@enron.com, larry.campbell@enron.com, juan.hernandez@enron.com, keller.mayeaux@enron.com, chad.starnes@enron.com, dean.laurent@enron.com, don.baughman@enron.com, lawrence.clayton@enron.com', 'From': 'kayne.coulter@enron.com', 'Cc': None, 'Bcc': None, 'X-To': 'Larry Jester, Jay Wills, Cyril Price, John Kinser, Rudy Acevedo, Richard Hrabal, Wayne Herndon, Jason Choate, Juan Hernandez, Greg Trefz, Miguel Garcia, Russell Ballato, Joe Stepenovitch, Joe Errigo, Doug Miller, Larry F Campbell, Juan Hernandez, Keller Mayeaux, Chad Starnes, Dean Laurent, Don Baughm

In [74]:
def number_of_words(s):
    return len(s.split())

In [75]:
subject = random_msg2['Subject']
body = extract_msg_body(random_msg2)
print(subject, "\n", body, "\n", number_of_words(body))

NERC Training schedule 
 ---------------------- Forwarded by Kayne Coulter/HOU/ECT on 09/05/2000 11:37 
AM ---------------------------


Keith Comeaux
09/05/2000 11:03 AM
To: mitch.robinson@enron.com, kayne.coulter@enron.com
cc: Lloyd Will/HOU/ECT@ECT 
Subject: NERC Training schedule

Mitch attached you will find the schedule for the NERC training . I will need 
room 3109 next to the control room for the weeks listed as classroom 
training. Kayne and Lloyd please share this schedule with all interested 
parties in your group. I will rely on you to have your people attend the 
training as listed or advise me of any problems.
Thanks,
Keith

  
 
 90


In [65]:
def query_openai_search(query, documents):
    return openai.Engine("davinci-msft").search(query=query, documents=documents)

In [82]:
%%time
result = query_openai_search(query=subject + body, documents=[subject + body for i in range(100)])

CPU times: user 8.97 ms, sys: 2.46 ms, total: 11.4 ms
Wall time: 5.22 s


In [79]:
result.data

[<OpenAIObject search_result at 0x7fc74f7b67d8> JSON: {
   "document": 0,
   "object": "search_result",
   "score": 301.315
 }, <OpenAIObject search_result at 0x7fc74f7b64c0> JSON: {
   "document": 1,
   "object": "search_result",
   "score": 301.227
 }, <OpenAIObject search_result at 0x7fc74f7b6728> JSON: {
   "document": 2,
   "object": "search_result",
   "score": 301.359
 }, <OpenAIObject search_result at 0x7fc74f7b6780> JSON: {
   "document": 3,
   "object": "search_result",
   "score": 301.19
 }, <OpenAIObject search_result at 0x7fc74f7b6830> JSON: {
   "document": 4,
   "object": "search_result",
   "score": 301.202
 }, <OpenAIObject search_result at 0x7fc74f7b6888> JSON: {
   "document": 5,
   "object": "search_result",
   "score": 301.301
 }, <OpenAIObject search_result at 0x7fc74f7b6990> JSON: {
   "document": 6,
   "object": "search_result",
   "score": 301.335
 }, <OpenAIObject search_result at 0x7fc74f7b69e8> JSON: {
   "document": 7,
   "object": "search_result",
   "scor