We will implement a set of search strategies on the Enron data set.

There is a database `enron.sqlite` in the same directory as this jupyter notebook.

Connect to it, and make a data frame from the `email_text` column. Use just the "ham"
documents.

In [1]:
import pandas
import sqlite3
conn = sqlite3.connect("enron.sqlite")
df = pandas.read_sql("select email_text from enron where spam_or_ham = 'ham'", conn)
df.sample(5)

Unnamed: 0,email_text
2408,Subject: hpl meter # 987195 tatton central poi...
1753,Subject: cpr pipeline exchange activity report...
1198,Subject: king ranch gas plant - 12 / 2000 elec...
2007,"Subject: duplicates\nsorry for any duplicate ,..."
2873,"Subject: hpl nom for march 17 , 2001\n( see at..."


## Exact substring match

Let's be super-naive first. Create a function that takes a string and a Series
as argument. It should search through the  dataframe looking for it anywhere in the email_text. Have it
return a boolean Series aligned with the dataframe. (i.e. it should return a series
with True if that row was a match, and False if it wasn't). We'll use this 
pattern in other search methods.

In [2]:
def substring_match(series, string_to_find):
    return series.str.contains(string_to_find)

Test it out. Try the word `energy` for example. You can use
`df[your_search_func(df.email_text, "energy")].email_text` to display the results.

In [3]:
df[substring_match(df.email_text,'energy')].email_text

9       Subject: sarco lateral and crow o ' connor met...
29      Subject: energy operations promotions\ni am pl...
35      Subject: revised nom for copano ' s . . . smal...
39      Subject: re : no / actual vols for 5 / 22 / 01...
40      Subject: meter 981594 - san jacinto low pressu...
                              ...                        
3647    Subject: new noms\n- - - - - - - - - - - - - -...
3648    Subject: seacrest meter # 0435 - april , 2001\...
3651    Subject: oct noms\n- - - - - - - - - - - - - -...
3666    Subject: 6 th noms\n- - - - - - - - - - - - - ...
3671    Subject: pennzenergy property details\n- - - -...
Name: email_text, Length: 415, dtype: object

What should we do about ranking?

The more of the email that our search string covers, the more likely it is to be useful.
So that should return a small number. If our search string is a tiny part of the email,
that should return a big number.

Write a function that takes a pandas Series (of emails) and a string and returns the
number of times bigger the email is compared to the string. 

(If the search string has zero length, it's meaningless as a search, so let's ignore that.)

In [4]:
def substring_match_ranking(series, string_to_find):
    return series.str.len() / len(string_to_find)

What should we do about creating a displayable snippet?

Let's break it into two parts. First, a function that takes a 
successfully-matched string, and the string to highlight, 
and returns some Markdown just for that.

For example, if you are searching for `six` in `The sixth sick sheik's sixth sick sheep.`
it should return `The  **six** th sick sheik's  **six** th sick sheep.`

(A really good algorithm would clean up any email that had Markdown elements in it first. But
we won't worry about that too much.)

In [5]:
def highlight_simple_substring(corpus_text, string_to_find):
    return f' **{string_to_find}** '.join(corpus_text.split(string_to_find))

highlight_simple_substring("The sixth sick sheik's sixth sick sheep.", 'six')

"The  **six** th sick sheik's  **six** th sick sheep."

Secondly, let's make a general purpose `snippet_viewer`. It should take
a `ranking_series` and a `snippet_series` and create a Markdown document
out of the snippets, starting with the highest-ranked, working down. Just
show the top three results.

Then it can use IPython.display.Markdown to show the document nicely.

In [6]:
import IPython.display

def snippet_viewer(ranking_series, snippet_series):
    answer = ""
    i = 0
    for idx in ranking_series.sort_values().index:
        snippet = snippet_series.loc[idx]
        answer += f"### Document {idx}\n\n{snippet}\n\n"
        i += 1
        if i == 3:
            break
    return IPython.display.Markdown(answer)
    
snippet_viewer(pandas.Series([1,0,3,2]), 
               pandas.Series(['Middle **search result**', '_Best search result_', '#### Worst search result',
                             'Last result shown']))

### Document 1

_Best search result_

### Document 0

Middle **search result**

### Document 3

Last result shown



------------------

# Testing out what we've done so far

Create a variable for the search term.

In [7]:
search_term = 'cat'

Create a pandas Series which contains only successful search results. 

In [8]:
search_results = df[substring_match(df.email_text, search_term)].email_text

Calculate ranks for each of them.

In [9]:
search_ranking = substring_match_ranking(search_results, search_term)

Create a series containing snippets of the search results (using the highlighter you wrote a moment ago)

In [10]:
search_snippets = search_results.map(lambda x: highlight_simple_substring(x, search_term))

Call the snippet viewer function with your ranking and snippet Series.

In [11]:
snippet_viewer(search_ranking, search_snippets)

### Document 2919

Subject: is this fri feb 11 a problem for taking va **cat** ion ?


### Document 2007

Subject: dupli **cat** es
sorry for any dupli **cat** e , having problems with lotus
notes .
gary green

### Document 2689

Subject: june va **cat** ion
please submit your june va **cat** ion to me asap .
thank you !
yvette
x 3 . 5953



-------

# Improvements

It would be nice to be able to search case-insensitively, to get whole words, and it would be nice to search for multiple terms in an email.

That means we'll need a query interpreter (there was no point in having one in the previous section). Fortunately,
this query interpreter should be pretty simple: take a string, lowercase it, and split it up into words.

We can use the NLTK library (`nltk.word_tokenize()` to split the sentences up into words.

Write the multiword query interpreter.

In [12]:
def multiword_query_interpreter(search_term):
    return [x.lower() for x in nltk.word_tokenize(search_term)]

As part of the indexing (data preparation) step, we'll need to apply the corpus of emails in lowercase
and word-separated as well. Create a new column in the dataframe for this.

The result should look like this:

```
0       [subject, :, tue, ,, 23, mar, 2004, 12, :, 06,...
1       [subject, :, slutty, milf, wants, to, meet, yo...
2       [subject, :, better, s, ., e, ., x, guar, ., a...
3       [subject, :, urgent, message, mr, francis, oma...
4       [subject, :, do, you, feel, safe, as, an, amer...
```

If you're feeling lazy, you could observe that the query interpreter would do the job for you.
(It's not common that the data prep and query interpreter are the same.)

In [13]:
import nltk

df['lowercase_words'] = df.email_text.map(multiword_query_interpreter)
df.lowercase_words

0       [subject, :, hpl, nom, for, may, 4, ,, 2001, (...
1       [subject, :, january, -, meter, 2186, clear, l...
2       [subject, :, revised, :, eastrans, nomination,...
3       [subject, :, re, :, fuel, application, of, the...
4       [subject, :, re, :, nominations, we, agree, ''...
                              ...                        
3667    [subject, :, bayer, -, march, 2001, volumes, j...
3668    [subject, :, revision, #, 1, -, hpl, nom, for,...
3669    [subject, :, re, :, hpl, discrepancy, is, this...
3670    [subject, :, daren, ,, equistar, will, be, bri...
3671    [subject, :, pennzenergy, property, details, -...
Name: lowercase_words, Length: 3672, dtype: object

Now write a function that takes a lower case word and searches through a Series of 
lowercase'd words, reporting True for the elements where it is present, and False otherwise.

So if you search for the word `cat` in this:
```
pandas.Series([['cat', 'banana', 'crab'], 
               ['cat', 'dog', 'elephant'], 
               ['frog', 'goat']]
             )
```
it should return:
```
0     True
1     True
2    False
dtype: bool
```

In [14]:
def exact_word_search(lowercase_wordlist_series, word_to_search_for):
    def word_is_present(where):
        return word_to_search_for in where
    return lowercase_wordlist_series.map(word_is_present)

exact_word_search(pandas.Series([['cat', 'banana', 'crab'], 
                                 ['cat', 'dog', 'elephant'], 
                                 ['frog', 'goat']]), 
                  'cat')

0     True
1     True
2    False
dtype: bool

Now write a function that can take a wordlist series, and a list of lower case words and find places in 
the wordlist series where all of them are present.

Remeber that `&` can be used on a pair of pandas Series objects to do a logical `and` on each
element.

The result of search for `['cat', 'crab']` in this series
```
pandas.Series([['cat', 'banana', 'crab'], 
               ['cat', 'dog', 'elephant'], 
               ['frog', 'goat']]
             )
```
should be
```
0     True
1    False
2    False
dtype: bool
```

In [15]:
def multiword_exact_search(lowercase_wordlist_series, words_to_search_for):
    survivors = pandas.Series(index=lowercase_wordlist_series.index, data=True)
    for word_to_search_for in words_to_search_for:
        survivors = survivors & exact_word_search(lowercase_wordlist_series, word_to_search_for)
    return survivors

multiword_exact_search(pandas.Series([['cat', 'banana', 'crab'], 
                                 ['cat', 'dog', 'elephant'], 
                                 ['frog', 'goat']]), 
                  ['cat', 'crab'])

0     True
1    False
2    False
dtype: bool

There are many ways to rank multi-word searches. Usually the key is to look for documents where
the words appear close together.

For now, let's assume that the words in a multi-word search only appear once each. Then the
standard deviation of their positions in the document captures the idea of closeness --- a
smaller standard deviation puts them closer together and we want smaller to mean "better match".

If there is only one distinct word, the standard deviation isn't well-defined, but
we could use the length of the email divided by the number of occurences as the ranking.
A shorter email that uses that word many times is likely to be a good candidate

In [16]:
def multiword_ranking(lowercase_wordlist_series, words_to_search_for):
    if len(words_to_search_for) == 1:
        return (lowercase_wordlist_series.map(len) /
                lowercase_wordlist_series.map(lambda x: len([t for t in x if t in words_to_search_for])))
    def stddev_ranker(where):
        return pandas.Series([i for (i,w) in enumerate(where) if w in words_to_search_for]).std()
    return lowercase_wordlist_series.map(stddev_ranker)

Create a snippet display function: **bold**-ify any word that is in the search terms. If you feel
like being fancy, you could make sure that _, * and $ are backslash escaped too.

In [17]:
def highlight_multiword_ranking(email_word_list, words_to_search_for):
    def boldify(word):
        if word in words_to_search_for:
            return f"**{word}**"
        else:
            return word.replace('_', '\_').replace('*', '\*').replace('$', '\$')
    return " ".join([boldify(x) for x in email_word_list])

highlight_multiword_ranking(['This', 'is', 'a', 'sample', 'sentence'], ['is', 'sample'])

'This **is** a **sample** sentence'

### Putting it all together.

This will look a lot like the process we did in the previous section.

- Set a variable for the search term (so that you can rerun different searches easily)

- Process that search term using the query interpreter

- Pass the interpreted query to your multi-word exact match search (remember to use the lowercase'd words as the series to view)

- Take the result of that and run it through the ranking algorithm and the snippet maker

- Put those results through the snippet viewer you created in the previous section (it should still work here)

In [18]:
search_term = "charges"
interpreted = multiword_query_interpreter(search_term)
relevant_emails = df.lowercase_words[multiword_exact_search(df.lowercase_words, interpreted)]
ranked_emails = multiword_ranking(relevant_emails, interpreted)
snippets = relevant_emails.map(lambda x: highlight_multiword_ranking(x, interpreted))
snippet_viewer(ranked_emails, snippets)

### Document 1849

subject : defs 2001 i have some changes to the defs deals for 2001 . we need to add demand fees for the over delivery and excess **charges** . i have attached the spreadsheets in case you need them . prod . deal demand fee feb 2001 157278 \$ 11 , 903 . 17 march 2001 157278 \$ 294 . 85 april 2001 229758 \$ 308 . 53 thanks , megan

### Document 1441

subject : defs may 2001 daren : please enter a demand fee on deal 157278 for may 2001 in the amount of \$ 369 . 69 . we need to bill defs for the remaining excess and over delivery **charges** . also , i was going back over my calc sheets and i found an error in oct 2000 . please enter a demand fee for \$ 647 . 35 on deal 157278 for oct 2000 . thanks , megan

### Document 2158

subject : duke exchange deal daren : i have several months that need to have the demand **charges** either added or adjusted . when katherine gave you the numbers the first time , there were some spot deals included in the exchange deals . the volumes have now been moved to the correct deals and the demand **charges** need to be corrected . i have listed the changes below . let me know if you would like to see the spreadsheets . deal 157278 3 / 00 add demand charge of \$ 73 , 403 . 47 for excess charge 4 / 00 change demand fee from \$ 1 , 507 . 56 to \$ 1 , 966 . 93 6 / 00 change demand fee from \$ 1 , 129 . 99 to \$ 359 . 97 deal 157288 3 / 00 change demand fee from \$ 3 , 526 . 98 to \$ 245 . 82 thanks , megan



--------

# Lemmatized search

It should be smart enough to understand *charge* and *charges* are the same thing. For this
we need to *lemmatize* each word. 

The NLTK library has a `nltk.stem.WordNetLemmatizer()` class with a `.lemmatize()` method
we can use for this. Remember that we will need to lemmatize the query term as well as the
corpus of emails.

You might need to download some NLTK models.
```
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
```

In [19]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Create lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /Users/gregb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/gregb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gregb/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Do you data preparation and make your query interpreter.

You will probably find that the search code and ranking code are the same as the
previous sections.

In [20]:
def lemmatizing_query(search_term):
    return [lemmatizer.lemmatize(x).lower() for x in nltk.word_tokenize(search_term)]

df['lemmatized_words'] = df.email_text.map(lemmatizing_query)

The snippet highlighter is trickier. We might need to highlight a word that wasn't exactly the
way it was in the search query. A neat way to resolve this is to find words in the lemmatized
word lists that matched, and then highlight the text in the unlemmatized version.

In [21]:
def higlighter_for_lemmatized_query(unlemmatized_word_list, lemmatized_word_list, terms_to_search_for):
    def boldify(unlemmatized_word, lemmatized_word):
        if lemmatized_word in terms_to_search_for:
            return f"**{unlemmatized_word}**"
        else:
            return unlemmatized_word.replace('_', '\_').replace('*', '\*').replace('$', '\$')
    return " ".join([boldify(x,y) for (x,y) in zip(unlemmatized_word_list,unlemmatized_word_list)])

When you search for `charges` or `charge` you should get the same results.

In [22]:
search_term = "charge"
interpreted = lemmatizing_query(search_term)
search_matches = multiword_exact_search(df.lemmatized_words, interpreted)
relevant_lemmatized_emails = df.lowercase_words[search_matches]
ranked_emails = multiword_ranking(relevant_lemmatized_emails, interpreted)
relevant_unlemmatized_emails = df.lemmatized_words[search_matches]
relevant_emails = pandas.DataFrame({'unlemmatized': relevant_unlemmatized_emails,
                                   'lemmatized': relevant_lemmatized_emails})
snippets = relevant_emails.apply(
    lambda row: higlighter_for_lemmatized_query(row['unlemmatized'], row['lemmatized'], interpreted),
    axis=1)
snippet_viewer(ranked_emails, snippets)

### Document 138

subject : pg & e texas pipeline kellie , pg & e will probably try to bill u for parking a volume of 10 , 263 for august 8 and 9 at \$ . 03 / mmbtu each day . please do not pay this **charge** when we receive this invoice . pg & e wa unable to make delivery into el paso because of high sulfur content in their gas and is trying to **charge** u with a parking **charge** for gas that they could not deliver . please let me know if you need any additional information . thanks .

### Document 88

subject : dec 2000 prod : panther pipeline demand **charge** please let me know if this is a vaild demand **charge** . the deal is under a gtc contract and daren indicated that he wa not aware of this deal unless it ha something to do with entex . fyi the volume flow at this meter ( # 981598 ) for dec 2000 wa zero . please advise - katherine

### Document 1572

subject : duke energy field 9 / 00 please add the demand **charge** for excess fee for 9 / 00 on sale deal 157278 in the amount of \$ 1 , 175 . 51 . thanks , megan



# Bag-of-words vectorization

For search, a bag-of-words and bag-of-ngrams vectorization can be quite effective. Often this is done at
the sentence level (high ranking) and also at the document level (low ranking). We'll just do sentences.

Create a new dataframe, which has each sentence from the original email dataframe as a separate row, and
has a cross-reference back to the original dataframe's index. NLTK has a `nltk.sent_tokenize()` function
that will be helpful for this.

It's often nice to have a sentence-number-within-email number as well. This is useful in snippet
creation.

In [23]:
sentence_df_prep = []
for idx, row in df.iterrows():
    sentences = nltk.sent_tokenize(row['email_text'])
    for sentence_number, sentence in enumerate(sentences):
        sentence_df_prep.append({'crossref': idx, 'sentence': sentence, 'sentence_number': sentence_number})
sentence_df = pandas.DataFrame.from_records(sentence_df_prep)
sentence_df.sample(5)

Unnamed: 0,crossref,sentence,sentence_number
17064,2047,2 . change the meter # and drn # at facility #...,11
14521,1750,the second will\nimmediately follow .,1
11602,1393,"18 or $ 33 , 022 .",20
11826,1415,"960 & 18 , 218 mmbtu @ $ 4 .",26
1725,215,"once the 6 "" line has been pigged , the produc...",9


In [24]:
sentence_df.shape

(31705, 3)

Calculate tfidf vectors for the sentences in this sentence dataframe. In an ideal world, you 
would want to use BM25 instead of tfidf, but neither Keras nor scikit-learn offers this as 
an option.

Including bigrams helps: when someone searches for a two-word phrase that appears a few times
in the corpus, it will get a huge boost.

Vectorization adaption may be a slow process: expect to wait a minute or two for it to complete.

If you have a GPU enabled, and your GPU doesn't have much memory, you might need to limit the vocabulary
size. (e.g. with `keras.layers.TextVectorization(max_tokens=20000, output_mode='tf_idf', ngrams=2)`).
In real life you wouldn't limit the vocabulary, since often someone wants to search for the document
that has the word "..." in it.

You will need to keep the tokenizer around for your query processing.

In [25]:
import keras

bow_vectorizer = keras.layers.TextVectorization(output_mode='tf_idf',
                                                #ngrams=2,
                                                max_tokens=20000
                                               )

Metal device set to: Apple M1


2023-10-09 00:08:30.497429: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-09 00:08:30.497534: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [26]:
%%time
bow_vectorizer.adapt(sentence_df.sentence)

2023-10-09 00:08:30.549236: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-10-09 00:08:30.613795: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


CPU times: user 17.9 s, sys: 12.4 s, total: 30.4 s
Wall time: 27.3 s


In [27]:
%%time
sentence_vectors = pandas.DataFrame(
    data=bow_vectorizer(sentence_df.sentence),
    index=sentence_df.index,
    columns=bow_vectorizer.get_vocabulary()
)
sentence_vectors.sample(5)

CPU times: user 296 ms, sys: 1.25 s, total: 1.55 s
Wall time: 1.91 s


Unnamed: 0,[UNK],the,to,ect,for,and,hou,enron,subject,on,...,135958,135895,135842,135708,1350,134987,134755,1344,1343,1342
9496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28795,0.0,5.3797,1.396265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1125,0.0,0.0,1.396265,2.950585,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20232,0.0,0.0,0.0,0.0,1.779899,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30240,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Your query interpreter will take some text, and pass it through the tokenizer.

In [28]:
def bow_query_interpreter(search_term):
    return bow_vectorizer(search_term)

bow_query_interpreter("charges for gas")

<tf.Tensor: shape=(20000,), dtype=float32, numpy=array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)>

Search and ranking can often be joined into one step: simply ask about the cosine similarity between
the query vector and the sentences in the email.

There are many libraries that include functionality for calculating the cosine similarity.

 - `sklearn.metrics.pairwise.cosine_similarity()`
 - `keras.losses.CosineSimilarity(reduction=tf.keras.losses.Reduction.NONE)`
 
Whatever way you choose, calculate the cosine similarity, find the sentences that match
most closely, and return the references to the top email documents (and which sentence
was the hit).

In [37]:
import keras.losses
import tensorflow as tf

def bow_query_search_and_ranking_function(sentence_vectors, query_vector):
    cosine_loss = keras.losses.CosineSimilarity(reduction=tf.keras.losses.Reduction.NONE)
    similarity = cosine_loss(query_vector, sentence_vectors)
    sentence_indexes = pandas.DataFrame({'crossref': sentence_df.crossref, 
                                         'sentence_number': sentence_df.sentence_number,
                                         'sentence': sentence_df.sentence,
                                         'similarity': similarity})
    sentence_indexes = sentence_indexes[sentence_indexes.similarity.abs() > 0.0001]
    best_documents = sentence_indexes.groupby('crossref').similarity.min().nsmallest(3)
    best_sentences = sentence_indexes[sentence_indexes.crossref.isin(list(best_documents.index))
                                     ].set_index('sentence_number').groupby('crossref').similarity.idxmin()
    return pandas.DataFrame({'document_scores': best_documents, 
                             'best_sentence_number': best_sentences}).sort_values('document_scores')

bow_query_search_and_ranking_function(sentence_vectors, 
                                     bow_query_interpreter("apache agreement"))

Unnamed: 0_level_0,document_scores,best_sentence_number
crossref,Unnamed: 1_level_1,Unnamed: 2_level_1
15,-0.606621,24
2906,-0.606621,21
216,-0.422475,8


We could get smart and display the words that were matched, but since we're operating at the sentence
level, we'll just make the right sentence bold. Create a highlight function that takes
an email and a sentence number and returns Markdown that highlights that sentence in the email.

In [38]:
def highlight_bow(email, highlight_sentence_number):
    def boldify(this_sentence_number, sentence):
        if this_sentence_number == highlight_sentence_number:
            return f"**{sentence}**"
        else:
            return sentence.replace('*', '\*').replace('$', '\$').replace('_', '\_')
    return " ".join([boldify(i,s) for i,s in enumerate(nltk.sent_tokenize(email))])

Let's pull it all together into one function:

- Take a search term, bag-of-words vectorize it

- Simultaneously search-and-rank that vectorized search term against the vectorized sentences

- Select the relevant documents from the results

- Make a Markdown snippet for them

- Display the resulting Markdown

In [39]:
def bow_search(search_term):
    query_vector = bow_query_interpreter(search_term)
    results = bow_query_search_and_ranking_function(sentence_vectors, query_vector)
    markdown_output = ""
    for idx,row in results.iterrows():
        email = df.loc[idx].email_text
        markdown_output += f"### Document {idx}\n\n"
        markdown_output += highlight_bow(email, row['best_sentence_number'])
        markdown_output += '\n\n'
    return IPython.display.Markdown(markdown_output)

Try it out! Some interesting phrases to search for:

- apache agreement

- contract volumes

- elephant

`Elephant` is out of vocabulary, so it matches the documents with the most out-of-vocabulary content.

In [44]:
bow_search("elephant")

### Document 1369

Subject: spinaker / n . padre island block 883 : allocations
per our various discussions , i am sending an email to reiterate the
disposition of volumes at meter 9862 , the lehman spinaker pay meter . at this
time , the general land office is transporting their share of production
( 21 . 8945 % interest ) and hpl is purchasing the remaining 78 . **1055 % .** the glo has nominated a transport volume at meter 9848 , effective september
1 , 2000 . during this time , some test gas flowed . i have had the hpl
purchase back - dated to coincide with the transport nomination , and have had
both deals moved to 9862 . the meter should probably be designated a
callout / swing so that the proper equity percentages can be allocated after
the production month by volume management . for october production forward , i am going to re - rank / confirm the pay meter
to ensure that the glo receives a volume which should be close to their
percentage after actuals close for the month . this should alleviate any
large balance swings on their agreement . anita : for september , i need an accounting arrangement on deal ticket
379424 ; hpl gathering agmt . at meter 9862 . please unallocate the transport
and purchase at 9848 . the meter is a daily swing right now . let me know
when you are ready and i will change the allocation methodology . i apologize for the length of this email , i want to make sure that we are all
on the same page , to the extent that we need to be , prior to this deal
getting too far down the road . please do not hesitate to call me if you have
any questions , comments , concerns , etc . i am at extension 35251 .
thank you all for your time and patience ,
mary

### Document 3171

Subject: fw : " red , white and blue out "
- - - - - original message - - - - -
from : carter , rhonda [ mailto : rcarter @ cooperinst . org ]
sent : friday , september 14 , 2001 12 : 33 pm
to : ' al \_ abbott @ compuserve . com ' ; ' mabner @ sprintmail . com ' ;
' aggiebob @ hotmail . com ' ; ' adamsck @ flash . net ' ; ' gadams @ promus . com ' ;
' pjadell @ yahoo . com ' ; ' bob @ cybersitebuilders . com ' ;
' worml 998 @ hotmail . com ' ; ' janie . beth @ prodigy . net ' ; ' gakin @ mccarthy . com ' ;
' vja @ flash . net ' ; ' locke . alder @ gte . net ' ; ' calexaol @ 7 - 11 . com ' ;
' erika @ publish . no . irs . gov ' ; ' ali @ buz . net ' ; ' brada @ ticnet . com ' ;
' svallen @ aol . com ' ; ' jand 30 @ aol . com ' ; ' allan @ stratsolgroup . com ' ;
' chuck \_ anderson @ oxy . com ' ; ' mdqsga 96 @ aol . com ' ;
' brian \_ anhalt @ bigfoot . com ' ; ' aranda @ nbstx . com ' ; ' aggiemom @ archer . cx ' ;
' jard @ nortelnetworks . com ' ; ' abarch @ airmail . net ' ; ' narguello @ yahoo . com ' ;
' jarmstrong @ tqtx . com ' ; ' mikie @ aggie . zzn . com ' ; ' ag 85 @ home . com ' ;
' kmarnold @ home . com ' ; ' hollya @ cyber - designs . com ' ;
' hughashburn @ netscape . net ' ; ' bob @ cybersitebuilders . com ' ;
' olinatkinson @ dellnet . com ' ; ' papaayres @ aol . com ' ; ' abackof 68 @ aol . com ' ;
' badgett @ ti . com ' ; ' kbailie @ nortel . com ' ; ' wjbaird @ mapsco . com ' ;
' jbaker @ ecomtrading . com ' ; ' tim . banigan @ nortelnetworks . com ' ;
' atbarlow @ mail . smu . edu ' ; ' arnonvic @ aol . com ' ;
' john \_ laurabarr @ email . msn . com ' ; ' jillmbarrow @ hotmail . com ' ;
' b - barton @ ti . com ' ; ' tbates @ why . net ' ; ' normabautista @ worldnet . att . net ' ;
' baweja @ aol . com ' ; ' gbaxley @ nt . com ' ; ' dabayers @ juno . com ' ;
' jbeard @ halff . com ' ; ' bearden . e @ grainger . com ' ; ' tbeaslel @ tuelectric . com ' ;
' kayebeatty @ aol . com ' ; ' triciabeaudreau @ hotmail . com ' ;
' abeckley @ executrain - dal . com ' ; ' scott . r . bellamy @ marshmc . com ' ;
' chiaggie @ aol . com ' ; ' dbenefield @ merit . com ' ; ' bryan @ dalmac . com ' ;
' bennie @ flash . net ' ; ' bergerd @ earthlink . net ' ; ' ted . e . bernard @ ac . com ' ;
' sberry @ nortel . com ' ; ' jody . bingham @ ps . net ' ; ' bobird @ att . com ' ;
' keithbird @ yahoo . com ' ; ' mbish @ nortel . com ' ; ' dawn . bitar @ ps . net ' ;
' bittners @ swbell . net ' ; ' akbjerke @ postoffice . swbell . net ' ;
' michael . blahitka @ intervoice - brite . com ' ; ' blairsl @ juno . com ' ;
' bnlblake @ flash . net ' ; ' gbock 2 @ excite . com ' ; ' bobb 761 @ worldnet . att . net ' ;
' jbond @ genuity . com ' ; ' bonsai 2 @ flash . net ' ; ' bonerhk @ earthlink . net ' ;
' warrenlb @ aol . com ' ; ' dbb @ sa - inc . com ' ; ' lynnbottlinger @ hotmail . com ' ;
' dkboughton @ home . com ' ; ' mbouma @ pgbpike . com ' ; ' bowden \_ rap @ msn . com ' ;
' jfbowen @ swbell . net ' ; ' cbowersl @ airmail . net ' ; ' scott . bowers @ eds . com ' ;
' mbag 92 @ aol . com ' ; ' lorna @ . com ' ;
' sheryl . bradley @ eds . com ' ; ' andybradshaw @ home . com ' ; ' bramlett @ home . com ' ;
' mwbranch @ aol . com ' ; ' tbrandish @ bigfoot . com ' ; ' kbrannon @ flash . net ' ;
' bebe - tx @ mindspring . com ' ; ' devere @ flash . net ' ; ' lgbrennan @ earthlink . net ' ;
' nicole @ dalmac . com ' ; ' tmbreeze @ gte . net ' ; ' gwb 2 @ flash . net ' ;
' john @ smithsummers . com ' ; ' bmbrinkl @ aol . com ' ; ' nateb 7899 @ aol . com ' ;
' melissabrooks @ mindspring . com ' ; ' rhbrooks @ vartec . net ' ;
' bbrooks @ sbair . com ' ; ' dbrosey @ airmail . net ' ; ' bbrown @ micron . com ' ;
' klbo 2 @ cs . com ' ; ' erich . browne @ central . sun . com ' ; ' deniseb @ ticnet . com ' ;
' jbrozovi @ usa . alcatel . com ' ; ' bruckm @ airmail . net ' ;
' bbruton @ scan - direct . com ' ; ' david . a . bryant @ bigfoot . com ' ;
' ccb @ nortelnetworks . com ' ; ' jnkbull @ netzero . com ' ; ' burchta 330 @ aol . com ' ;
' drburdenjr @ aol . com ' ; ' jburnett @ foxsports . net ' ; ' haleburr @ aol . com ' ;
' burrow @ nortel . ca ' ; ' rbl 419 @ aol . com ' ; ' mikebusch @ mail . com ' ;
' cbyrum @ goodmanfamily . com ' ; ' calkfamf @ home . com ' ;
' kcameron @ yahoo - inc . com ' ; ' jsmiley @ pisd . edu ' ; ' jajasoup @ aol . com ' ;
' laurie . canning @ ericsson . com ' ; ' jjcantwell @ worldnet . att . net ' ;
' djcarr @ texas . net ' ; ' richardjcarroll @ yahoo . com ' ; ' drviv @ yahoo . com ' ;
' rob @ startech . org ' ; ' tcarson 98 @ yahoo . com ' ; ' brandacarter @ microlabs . com ' ;
' lee \_ carter @ seha . com ' ; ' dcarter 768 @ aol . com ' ; carter , rhonda ;
' todd . carter @ fnc . fujitsu . com ' ; ' cwc 68 @ swbell . net ' ; ' jcash @ firstam . com ' ;
' tcastellanos @ usa . net ' ; ' wcaudi @ concentric . net ' ;
' cavanaug @ gustafson . com ' ; ' cschamberlin @ mindspring . com ' ;
' ebeth @ airmail . net ' ; ' cookie \_ chambers @ pagenet . com ' ;
' jchamp 5626 @ aol . com ' ; ' sherriel @ flash . net ' ; ' smchamp @ dhc . net ' ;
' svchandl @ garlandisd . net ' ; ' fectac @ aol . com ' ; ' chris . chastain @ ey . com ' ;
' ccchatham @ aol . com ' ; ' kevin . chilcoat @ fritolay . com ' ; ' mattc @ dallas . net ' ;
' jchoc @ msn . com ' ; ' shannon @ thechristianfamily . com ' ;
' christian @ medicine . tamu . edu ' ; ' jcipolla @ hotmail . com ' ; ' dclark @ dhc . net ' ;
' sclark @ dhc . net ' ; ' mclary @ elux . com ' ; ' brad @ bigl 2 sports . com ' ;
' clemmons @ home . com ' ; ' beth 2047 @ aol . com ' ; ' acoble @ cisco . com ' ;
' jjcoburn @ aol . com ' ; ' dbclaw @ hotmail . com ' ; ' matt \_ cole @ yahoo . com ' ;
' cac 75442 @ aol . com ' ; ' jcoll 75442 @ aol . com ' ; ' jorubycol @ aol . com ' ;
' collins 587 @ hotmail . com ' ; ' condoaggie @ aol . com ' ; ' swcbox @ aol . com ' ;
' crcandmac @ aol . com ' ; ' crcook @ gte . net ' ; ' martha \_ cook @ publicis - usa . com ' ;
' mustrdsd @ flash . net ' ; ' acooper 401 @ aol . com ' ; ' jcooper 95 @ yahoo . com ' ;
' karen . m . cope @ dal . frb . org ' ; ' kellyandamy @ sprintmail . com ' ;
' vc 4445 @ earthlink . com ' ; ' brendyc @ aol . com ' ; ' kdcornell @ compuserve . com ' ;
' sc 93 @ hotmail . com ' ; ' bcorrell @ aol . com ' ; ' mcortino @ swbell . net ' ;
' cowan 95 @ aol . com ' ; ' coxl 997 @ yahoo . com ' ; ' monarch @ usa . net ' ;
' jason \_ cox @ hotmail . com ' ; ' lacoyne @ flashcom . net ' ; ' garycl 2345 @ aol . com ' ;
' julesag 95 @ flash . net ' ; ' crawfordsl @ cdm . com ' ; ' phantom 495 @ aol . com ' ;
' ccriswel @ pisd . edu ' ; ' sec @ inetinc . com ' ; ' holly . a . cromack @ ac . com ' ;
' kcudlipp @ arimail . net ' ; ' mrculp @ home . com ' ; ' cathy . cupps @ eds . com ' ;
' curranc @ diamtech . com ' ; ' andyc @ gwmail . plano . gov ' ;
' lindsay \_ daigle @ yahoo . com ' ; ' ag 93 whoop @ hotmail . com ' ;
' dtddtd 444 @ aol . com ' ; ' edaniel @ flash . net ' ; ' tamidarby @ home . com ' ;
' cagladan @ usa . net ' ; ' smitadas @ ix . netcom . com ' ; ' cgwd 94 @ aol . com ' ;
' kay . daugherty 3 @ gte . net ' ; ' bob \_ daughrity @ cabp . com ' ;
' aggiel 984 @ juno . com ' ; ' jdd . rad @ gte . net ' ; ' riverl @ flash . net ' ;
' rogercdavis @ home . com ' ; ' stefaniedavis @ yahoo . com ' ; ' bamadavis @ aol . com ' ;
' dawsonsix @ aol . com ' ; ' heather @ icsi . net ' ; ' stephanie \_ s \_ day @ compusa . com ' ;
' cdelarios @ home . com ' ; ' cdeangulo @ msn . com ' ; ' ldeardurff @ mckinneyisd . net ' ;
' mdeardurff 62 @ msn . com ' ; ' victor . de . hoyos @ fritolay . com ' ;
' mrichmnd @ ix . netcom . com ' ; ' kelly 95 ag @ aol . com ' ; ' ivan 53 @ aol . com ' ;
' bdempsey @ dnaent . com ' ; ' macdeth @ swbell . net ' ; ' allandl @ airmail . net ' ;
' deweesw @ ttc . com ' ; ' tgd @ ffhm . com ' ; ' to \_ ronda @ airmail . net ' ;
' jmditrapani @ nextlink . com ' ; ' ledlugos @ aol . com ' ; ' dlugosch @ home . com ' ;
' melvausa @ netscape . net ' ; ' aol 93775 @ dlemail . itg . ti . com ' ;
' chas 41 @ airmail . net ' ; ' mdorsett @ uni - bell . org ' ;
' m - mdouglas @ worldnet . att . net ' ; ' dovers @ sprintmail . com ' ;
' lorip @ rsn . hp . com ' ; ' michelle . drawert @ gte . net ' ; ' tisdalel @ flash . net ' ;
' sdrotma @ pisd . edu ' ; ' ndsouza @ unt . edu ' ; ' gary . dubois @ pizzahut . com ' ;
' madudko @ aol . com ' ; ' fduewall @ wmcobb . com ' ; ' jduffy @ cisco . com ' ;
' dduffy @ mis - world . com ' ; ' blakey @ flash . net ' ; ' greg \_ dupree @ bigfoot . com ' ;
' michael . duran @ ps . net ' ; ' g - durham @ ti . coin ' ; ' crma @ flash . net ' ;
' travisdye @ home . com ' ; ' earnshaw @ flash . net ' ; ' mechols @ fastlane . net ' ;
' jason . eggl @ indsys . ge . com ' ; ' reicher @ tell . net ' ; ' eiland @ ti . com ' ;
' tome @ gwmail . plano . gov ' ; ' s \_ elliott @ hotmail . com ' ; ' lellis @ ch 2 m . com ' ;
' tedcarles @ earthlink . net ' ; ' stephen . elmendorf @ teradyne . com ' ;
' bembrey @ ccgmail . com ' ; ' rempey @ waymark . net ' ; ' mengels @ airmail . net ' ;
' mentrop @ yahoo . com ' ; ' jepps @ intecom . com ' ; ' donerb @ home . com ' ;
' lucy \_ vsi @ ix . netcom . com ' ; ' mike @ estesfinancial . com ' ; ' setch @ onebox . com ' ;
' kathyeudy @ yahoo . com ' ; ' pevers @ home . com ' ; ' kewing @ airmail . net ' ;
' deon . b . fair @ ac . com ' ; ' lisalynn 98 @ hotmail . com ' ; ' drjuiceplus @ home . com ' ;
' sfaseler @ lg . com ' ; ' rfeldman @ ascend . com ' ; ' j . felkner @ worldnet . att . net ' ;
' jferguso @ mony . com ' ; ' roger . ferguson @ fluor . com ' ; ' jtferrarol @ home . com ' ;
' tfiedler @ flash . net ' ; ' ififfick @ hharchitects . com ' ;
' davidfinley 82 @ yahoo . com ' ; ' duke . fisher @ wcom . com ' ;
' j \_ fishero @ yahoo . com ' ; ' dfitzgerald @ mesquiteisd . org ' ; ' lpfitz @ wt . net ' ;
' 102372 . 2423 @ compuserve . com ' ; ' fleck @ concentric . net ' ;
' jannet @ dallas . net ' ; ' fleitman @ msn . com ' ; ' samf @ dallas . net ' ;
' rjflorio @ worldnet . att . net ' ; ' gulfview @ gateway . net ' ;
' d - forbes @ rtis . ray . com ' ; ' bgfort @ earthlink . net ' ; ' clfoster @ airmail . net ' ;
' r . foster @ prelude . com ' ; ' gfoyt @ hdrinc . com ' ; ' sfrancis @ everdream . com ' ;
' halgodal @ flash . net ' ; ' hedgehogracing @ msn . com ' ;
' steve . french @ aggies . org ' ; ' jfreytag @ airmail . net ' ; ' blakef @ msn . com ' ;
' michael . froman @ octel . com ' ; ' afruhling @ metasolv . com ' ;
' fuentes @ noval . net ' ; ' hiroko @ rsn . hp . com ' ; ' fulkfamily @ home . com ' ;
' ron . fuqua @ usa . alcatel . com ' ; ' debra . galarde @ eds . com ' ;
' txhoss @ ix . netcom . com ' ; ' jared . galloway @ fnc . fujitsu . com ' ;
' aubree . garrett @ fnc . fujitsu . com ' ; ' toniandmikeg @ home . com ' ;
' cwgary @ ont . com ' ; ' 2 ags @ flash . net ' ; ' sgaster @ kpmg . com ' ;
' zgoner @ airmail . net ' ; ' dgedeon @ vectrix . com ' ;
' tara . gedeon @ brannforbes . com ' ; ' jcjones @ rsn . hp . com ' ;
' tageo @ mindspring . com ' ; ' tgeorge @ flash . net ' ; ' teresagill @ email . com ' ;
' rglover @ halff . com ' ; ' dfglynnl @ msn . com ' ; ' mgolaboff @ eqrworld . com ' ;
' judie \_ good @ yahoo . com ' ; ' gorski @ aggies . com ' ; ' algough @ yahoo . com ' ;
' neilgould @ usa . net ' ; ' sallsgraham @ hotmail . com ' ; ' pgranier @ portal . com ' ;
' begrant @ flash . net ' ; ' rgrantham @ worldnet . att . net ' ; ' tgravett @ wans . net ' ;
' chris \_ greer @ hp . com ' ; ' hgreer @ alldata . net ' ; ' chrisg @ micrografx . com ' ;
' donindfw @ ix . netcom . com ' ; ' dan @ productcentre . com ' ;
' jgroce @ lasercomm - inc . com ' ; ' juggernaut @ connect . net ' ;
' katie \_ gruebel @ hotmail . com ' ; ' amynurse @ hotmail . com ' ; ' bag 2 @ airmail . net ' ;
' kenneth \_ guest @ hp . com ' ; ' jgump @ mail . arco . com ' ; ' wylie . gunter @ eds . com ' ;
' tim . gutschlag @ fnc . fujitsu . com ' ; ' guzmans @ home . com ' ;
' cherihaby @ home . com ' ; ' julie \_ halloran @ yahoo . com ' ;
' chaltom @ conedrive . textron . com ' ; ' hamelb 21 @ ont . com ' ;
' talana 99 @ hotmail . com ' ; ' greg . hanks @ hanksbrokerage . com ' ;
' rharbin @ aris . com ' ; ' carrie . l . hardy @ fritolay . com ' ;
' scott \_ harkins @ msn . com ' ; ' jharper @ flash . net ' ;
' jharrington @ sagetelecom . net ' ; ' roynteri @ mail . com ' ;
' steveharrod @ msn . com ' ; ' hartfield @ ti . com ' ;
' terry . k . hartzog @ us . arthurandersen . com ' ; ' kharvey @ pcrrent . com ' ;
' marji . j . harvey @ mail . sprint . com ' ; ' b - haskettl @ ti . com ' ;
' kelly \_ hayes @ harwoodmarketing . com ' ; ' mhaye @ amkor . com ' ;
' alanh @ alliancearch . com ' ; ' mheath @ nextlink . com ' ; ' mheffner @ home . com ' ;
' jets @ ti . com ' ; ' glenn @ hc - cpa . com ' ; ' toddmel @ texoma . net ' ;
' kimberly . henderson @ ey . com ' ; ' dah 85 @ mindspring . com ' ;
' shenley @ flash . net ' ; ' rahennessy @ earthlink . com ' ; ' jherblin @ onramp . net ' ;
' carynlynn @ msn . com ' ; ' nascar @ mikeh . net ' ; ' travis @ herringangus . com ' ;
' anandted @ msn . com ' ; ' dherron @ pisd . edu ' ; ' mitchherzog @ yahoo . com ' ;
' sc \_ hester @ hotmail . com ' ; ' jheye @ psp . com ' ; ' mhickox @ fiskrob . com ' ;
' lori @ efficient . com ' ; ' phinojos @ micro . honeywell . com ' ;
' d . hirt @ dialogic . com ' ; ' danetami @ airmail . net ' ; ' randyhobert @ msn . com ' ;
' blakekimhodge @ yahoo . com ' ; ' will @ . com ' ;
' choldrid @ airmail . net ' ; ' jnh @ ti . com ' ; ' tholman @ gte . net ' ;
' sholmeso 0 @ msn . com ' ; ' jholstea @ jpi . com ' ; ' sholton @ ticnet . com ' ;
' holyoak @ flash . net ' ; ' phorton @ usa . alcatel . com ' ; ' scott @ horton . net ' ;
' thowes @ mail . arco . com ' ; ' chad @ tice . com ' ; ' hugghins @ gte . net ' ;
' jhummel @ memc . com ' ; ' markhunt @ bigfoot . com ' ; ' thehurd @ hex . net ' ;
' b - hutcheson @ ti . com ' ; ' j . r . iacoponelli @ mciworld . com ' ;
' billirish @ hotmail . com ' ; ' czjkjj @ msn . com ' ;
' pat . jackson @ fnc . fujitsu . com ' ; ' sjackson @ opsos . net ' ; ' ararat @ flash . net ' ;
' stevejames @ home . com ' ; ' chellejanow @ hotmail . com ' ;
' asmith \_ scuba @ yahoo . com ' ; ' bjehu @ yahoo . com ' ;
' ashlea \_ jenkins @ hotmail . com ' ; ' cjenson @ flash . net ' ; ' mljideas @ home . com ' ;
' slj @ waymark . net ' ; ' rjolly \_ 1 @ yahoo . com ' ; ' bsjones 50 @ hotmail . com ' ;
' craig - charlottejones @ worldnet . att . net ' ; ' danny @ lanyx . com ' ;
' mattjones @ ccgmail . com ' ; ' wjones @ clearsail . net ' ;
' sjordan 3 @ compuserve . com ' ; ' ryanjust @ hotmail . com ' ;
' chip @ cscfinancial . com ' ; ' lkcbsl @ home . com ' ; ' makall 5 @ flash . net ' ;
' twk @ msg . ti . com ' ; ' mkaplan @ augustmail . com ' ; ' ckarlik @ swbell . net ' ;
' shafia 30 @ hotmail . com ' ; ' mlkawas @ hotmail . com ' ; ' chipk @ nortel . com ' ;
' markkelley \_ wurzburg @ yahoo . com ' ; ' mkelly 2575 @ juno . com ' ;
' danken 8765 @ home . com ' ; ' wolfcamp @ hotmail . com ' ; ' mjkereluk @ msn . com ' ;
' ckerley @ apclink . com ' ; ' lkerr @ evl . net ' ; ' r . kessel @ ssss . com ' ;
' dkessler @ waymark . net ' ; ' mkessner @ hotmail . com ' ; ' troykey @ peoplepc . com ' ;
' akilpatrick @ kurion . com ' ; ' kings 2 @ flash . net ' ; ' kingsr @ home . com ' ;
' jkingston @ ti . com ' ; ' skirchner @ worldnet . att . net ' ;
' chuck @ digitalpilot . com ' ; ' michael . kleppe @ ericsson . com ' ;
' jklouda @ flash . net ' ; ' dennis \_ kniery @ hp . com ' ; ' sschulz @ mail . smu . edu ' ;
' lakohler @ raytheon . com ' ; ' james . kornegay @ eds . com ' ; ' knrkrause @ aol . com ' ;
' kckuddes @ altavista . com ' ; ' sakula @ flash . net ' ; ' skutchin @ leaelliott . com ' ;
' bladdusaw @ ti . **com ' ; ' 103745 .** 342 @ compuserve . com ' ; ' mellake @ yahoo . com ' ;
' paul . lake @ ps . net ' ; ' slakie @ texas . net ' ; ' jplane @ gte . net ' ;
' mlangloys @ aol . com ' ; ' rlanicek @ home . com ' ;
' barrett . lankford @ painewebber . com ' ; ' mlara @ pisd . edu ' ; ' gsl @ msn . com ' ;
' mikepl @ bnr . ca ' ; ' klavergne @ earthling . net ' ;
' winner @ sportsstandings . com ' ; ' mlecrone @ aol . com ' ; ' banglee @ ti . com ' ;
' coyote 97 @ swbell . net ' ; ' robertlee @ poboxes . com ' ;
' j . lemmons @ worldnet . att . net ' ; ' lesliel @ airmail . net ' ; ' lerich @ flash . net ' ;
' rlessmann @ home . com ' ; ' elethe @ gte . net ' ; ' mikelew @ nortelnetworks . com ' ;
' lewisr 691 @ home . com ' ; ' laliefer @ aol . com ' ; ' hkl 5320 @ dcccd . edu ' ;
' gmlz @ msg . ti . com ' ; ' blightsey @ systemdesk . com ' ;
' 74464 . 2612 @ compuserve . com ' ; ' jlind 2402 @ aol . com ' ;
' dlindstrom @ icidallas . com ' ; ' glinebaugh @ prodigy . net ' ;
' eflinhoff @ aol . com ' ; ' the . lisewskys @ prodigy . net ' ;
' mlish @ kennedywilson . com ' ; ' heidident @ aol . com ' ; ' katie 96 ag @ yahoo . com ' ;
' john \_ london @ acs - inc . com ' ; ' ro 219 @ aol . com ' ; ' balott @ aol . com ' ;
' wadel @ swbell . net ' ; ' tglovell @ onramp . net ' ; ' tlovell @ ticnet . com ' ;
' rmlowry 4 @ yahoo . com ' ; ' mploya @ ti . com ' ; ' aggie 97 @ hotmail . com ' ;
' jlugo @ rhaaia . com ' ; ' klukshin @ kpmg . com ' ; ' dluna @ raltron . com ' ;
' ped @ nortel . ca ' ; ' clyons @ metasolv . com ' ; ' paulandkarin @ msn . com ' ;
' rlyttons @ aol . com ' ; ' emaas 94 @ yahoo . com ' ; ' spam . bait @ worldnet . att . net ' ;
' neardal @ airmail . net ' ; ' mmachesney @ aol . com ' ; ' netaces @ airmail . net ' ;
' richard . maddox @ mci . com ' ; ' betty . magee @ homesbybetty . com ' ;
' jmagrude @ jpi . com ' ; ' pxm @ msg . ti . com ' ; ' tracemajor @ hotmail . com ' ;
' mmalakoff @ aol . com ' ; ' judy . j . manning @ fritolay . com ' ;
' norris @ mantooth . com ' ; ' marchand \_ darryl @ msn . com ' ; ' nrm 2000 @ hotmail . com ' ;
' mike . marino @ usoncology . com ' ; ' branonmarsh @ hotmail . com ' ;
' rmartin @ coserv . net ' ; ' dmason @ highpointtravel . com ' ; ' seanab @ gte . net ' ;
' debm 394 @ aol . com ' ; ' mathews - amy @ yahoo . com ' ; ' ags 84 @ aol . com ' ;
' equestlnm @ excite . com ' ; ' mjmattson @ home . com ' ; ' bmatulal @ airmail . net ' ;
' ilvjesus @ flash . net ' ; ' kmay 4001 @ aol . com ' ; ' cmayber @ pisd . edu ' ;
' jasonmayes @ earthlink . net ' ; ' jimbobq 88 @ aol . com ' ; ' mccaff @ anet - dfw . com ' ;
' bmccainl 62 @ aol . com ' ; ' almac @ wans . net ' ; ' mike . mcdonald @ ey . com ' ;
' dhm @ mcdowelllabel . com ' ; ' tmcevoy @ wordware . com ' ;
' trisheeey @ hotmail . com ' ; ' bmcgrego @ metrogroup . com ' ; ' rmckee @ ti . com ' ;
' jim \_ mcmahan @ ctxmort . com ' ; ' bmcmillan @ motion - dynamics . com ' ;
' pmeggs @ aol . com ' ; ' mendezn @ nortelnetworks . com ' ; ' jenabug @ flash . net ' ;
' m \_ mentzer @ hotmail . com ' ; ' sandymergen @ hotmail . com ' ;
' smerrill @ flash . net ' ; ' tiffany \_ merrill @ yahoo . com ' ;
' jmersiovsky @ metasolv . com ' ; ' emetting @ hntb . com ' ;
' sue \_ middleton @ juno . com ' ; ' jbm 326 @ aol . com ' ; ' barbmiller @ qualtx . com ' ;
' michael \_ c . \_ miller @ ac . com ' ; ' miller @ dallas . net ' ;
' hdjemills @ earthlink . net ' ; ' jmills @ dallas . net ' ;
' dminaldi @ contactdallas . com ' ; ' rminney @ entercon . com ' ; ' mlm @ ti . com ' ;
' jenmizar @ yahoo . com ' ; ' moonaggie @ cs . com ' ; ' jason @ aggies . org ' ;
' pipkins @ gateway . net ' ; ' danny . morris @ mciworld . com ' ;
' jcipolla @ hotmail . com ' ; ' cmorse @ waymark . net ' ; ' aggietx @ swbell . net ' ;
' m \_ muecke @ hotmail . com ' ; ' jeff . mundt @ wcom . com ' ; ' amurphy 96 @ hotmail . com ' ;
' jmurphy 4 @ hotmail . com ' ; ' dannym @ churchrealty . com ' ; ' cmyers @ mycon . com ' ;
' greg @ lsil . com ' ; ' jnlzaza @ earthlink . net ' ; ' erinsneedham @ hotmail . com ' ;
' tpneeley @ worldnet . att . net ' ; ' jnelson @ source . com ' ;
' jnerwich @ mindspring . com ' ; ' goonet @ hotmail . com ' ; ' rpnew @ aol . com ' ;
' jeff . newton @ fritolay . com ' ; ' chrisgnichols @ yahoo . com ' ;
' r . niedenfuehr @ worldnet . att . net ' ; ' nielsonc @ sprynet . com ' ;
' rniesen @ ti . com ' ; ' jnobll @ jcpenney . com ' ; ' timcathy @ flash . net ' ;
' merkicpa @ gte . net ' ; ' rnorris @ joefunkconstr . com ' ; ' aggiel @ airmail . net ' ;
' cnorton @ brierley . com ' ; ' janicen @ architeriors . com ' ;
' nnowik @ mhagroup . com ' ; ' toconnor @ varo . com ' ; ' melody . oliver @ eds . com ' ;
' s - oliverl @ ti . com ' ; ' adrienneolsen @ hotmail . com ' ; ' roneal @ ins - inc . com ' ;
' tfonofrio @ aol . com ' ; ' b - orem @ rtis . ray . com ' ; ' jetpilot @ sprintmail . com ' ;
' orr @ caprock . net ' ; ' kwunsch @ ci . garland . tx . us ' ; ' jott @ rsn . hp . com ' ;
' tamc 66 @ aol . com ' ; ' atm 97 @ aol . com ' ; ' powen 94 @ yahoo . com ' ;
' yohanp @ netscape . net ' ; ' palitza @ att . net ' ; ' dpalmer @ pisd . edu ' ;
' cparker @ garlandpower - light . org ' ; ' wanda . parker @ wjpenterprises . com ' ;
' tamu 97 @ airmail . net ' ; ' jpatoskie @ home . com ' ;
' judy . peacock @ worldnet . att . net ' ; ' david . a . pearl @ travelers . com ' ;
' katie @ lifelinehomehealth . com ' ; ' ppedison @ aol . com ' ;
' lpeichel @ nortelnetworks . com ' ; ' mpell @ uswebcks . com ' ; ' dannyp 83 @ gte . net ' ;
' david \_ perry @ 3 com . com ' ; ' picardl 999 @ hotmail . com ' ;
' friscoattorney @ aol . com ' ; ' dphillips @ pfsoutsourcing . com ' ;
' cpierce @ lee - eng . com ' ; ' kurtpifer @ hotmail . com ' ;
' stephen \_ pilcher @ yahoo . com ' ; ' wpindar 3 @ email . msn . com ' ;
' pingenot @ gte . net ' ; ' pinzon @ nortel . com ' ; ' mwpiper @ onramp . net ' ;
' dpitts @ acm . org ' ; ' ppjp @ airmail . net ' ; ' mplumer @ synhrgy . com ' ;
' randy @ pogueinc . com ' ; ' jerrypoin @ home . com ' ;
' tony . pollacia @ fritolay . com ' ; ' tammypon @ hmhs . com ' ;
' kent @ webdelight . net ' ; ' cporter @ nortelnetworks . com ' ;
' sporter @ texas . net ' ; ' porterfields @ prodigy . net ' ; ' texas \_ anm @ yahoo . com ' ;
' poteet @ dmans . com ' ; ' billpowello 4 @ home . com ' ; ' ammy 5 @ aol . com ' ;
' joshp @ thisco . com ' ; ' marykpowl @ syscodallas . com ' ;
' prater 2 @ earthlink . net ' ; ' dprattl @ home . com ' ; ' d - presley @ tamu . edu ' ;
' musicgrl 68 @ aol . com ' ; ' pauld @ homemail . com ' ; ' kpruitt @ gasequipment . com ' ;
' pprzada @ aol . com ' ; ' beckyp @ bmisystems . com ' ; ' impurdy @ 5 pillars . com ' ;
' mrpyatt @ airmail . net ' ; ' jlqjr @ gte . net ' ; ' scradford @ aol . com ' ;
' melissa \_ ragan @ richards . com ' ; ' eric . ragle @ cisco - eagle . com ' ;
' maheswaran \_ rajasekharan @ i 2 . com ' ; ' kikiaggie @ webcombo . net ' ;
' mramsey @ unitedad . com ' ; ' michael . rasmussen @ ps . net ' ; ' j - read @ tamu . edu ' ;
' jlreadpa @ aol . com ' ; ' reasor @ rsn . hp . com ' ; ' reck @ gateway . net ' ;
' cindy . redman @ eds . com ' ; ' dreed @ is . arco . com ' ; ' reedl 00 @ msn . com ' ;
' tdreed @ airmail . net ' ; ' solutionhr @ aol . com ' ; ' jreeves @ agave . com ' ;
' cremmele @ aol . com ' ; ' rrestivo @ eversoft . com ' ; ' erice 8 @ aol . com ' ;
' sanrice @ aol . com ' ; ' ct \_ richard @ hotmail . com ' ; ' mrichard @ arcmail . com ' ;
' krichards @ acsdallas . com ' ; ' paula . g . richmond @ fritolay . com ' ;
' jrickman @ hppclaw . com ' ; ' tlrigby @ home . com ' ; ' kcriggs @ yahoo . com ' ;
' mrightm @ mail . arco . com ' ; ' jriha @ businessobjects . com ' ; ' rrinker @ wtd . net ' ;
' rippees @ swbell . net ' ; ' rippel @ utdallas . edu ' ; ' writchie @ ci . irving . tx . us ' ;
' bradyroberts @ hotmail . com ' ; ' laserbaker @ worldnet . att . net ' ;
' frobert @ aol . com ' ; ' krisaggi @ aol . com ' ; ' ker @ ti . com ' ; ' roco @ nortel . com ' ;
' kjroeker @ airmail . net ' ; ' jimroseo 3 @ home . com ' ;
' suzanne \_ ross @ campbellsoup . com ' ; ' jim \_ rountree @ logiclsales . com ' ;
' eddie . rueffer @ mci . com ' ; ' srupprecht @ chubb . com ' ; ' jennyr @ wtd . net ' ;
' kimed @ hotmail . com ' ; ' jryan @ uswebcks . com ' ; ' emsalazar 25 @ hotmail . com ' ;
' k - salazarl @ ti . com ' ; ' jlsales @ waymark . net ' ;
' msanchez @ mckinneytexas . org ' ; ' steven . sarkissian @ painwebber . com ' ;
' danna @ nortelnetworks . com ' ; ' tsawyers @ aol . com ' ; ' scheumack @ juno . com ' ;
' dschmidt @ camozzi - usa . com ' ; ' pschmidt @ connect . net ' ; ' tammyms @ yahoo . com ' ;
' nathan . schockmel @ usa . alcatel . com ' ; ' kschoenhals @ metasolv . com ' ;
' schuelerjs @ aol . com ' ; ' diana \_ p \_ seal @ email . mobil . com ' ;
' pkemper @ 3 dfx . com ' ; ' sherri . seeger @ wylieisd . net ' ; ' maseeley @ avaya . com ' ;
' tseely @ attglobal . net ' ; ' rshackelford @ home . com ' ;
' shannons @ websurfer . net ' ; ' jtshannon @ ticnet . com ' ;
' loren . sharkey @ brinker . com ' ; ' rehan @ computer . org ' ; ' gryffynn @ aol . com ' ;
' xosloren @ ti . com ' ; ' roger . shellenberger @ exscol . exch . eds . com ' ;
' kshelton @ amfm . com ' ; ' samleannshields @ aol . com ' ; ' sbshin @ evl . net ' ;
' alsikes @ pbsj . com ' ; ' glenn \_ silva @ gmaccm . com ' ;
' frank . silva @ industrialrisk . com ' ; ' simmonds @ marykay . com ' ;
' atmrick @ aol . com ' ; ' isivin @ aol . com ' ; ' rskaggs @ hksinc . com ' ;
' bskalberg @ aol . com ' ; ' todd @ nkn . net ' ; ' dsmart @ dttus . com ' ;
' amy . l . smith @ eds . com ' ; ' egsmith @ home . com ' ;
' john \_ charles \_ smith @ compuserve . com ' ; ' john - h - smith @ raytheon . com ' ;
' ksmith @ kma - rjfs . com ' ; ' shanda @ wans . net ' ;
' michael . smith @ usa . alcatel . com ' ; ' rjsmith @ minutemaid . com ' ;
' rsmith @ metasolv . com ' ; ' dick \_ smith @ pagenet . com ' ;
' agent \_ maroon @ hotmail . com ' ; ' enviropure @ home . com ' ; ' unclewil @ home . com ' ;
' jsmitherman @ cinemark . com ' ; ' jim \_ snow @ millipore . com ' ;
' gpsparks @ hotmail . com ' ; ' tspo 92891 @ aol . com ' ; ' dspencer @ dbssystems . com ' ;
' lspielel @ txu . com ' ; ' txagl 987 @ aol . com ' ; ' g - stanford @ raytheon . com ' ;
' petgeoguru @ hotmail . com ' ; ' jstara @ arcmail . com ' ;
' kgstavin @ garlandisd . net ' ; ' tbstebbins @ aol . com ' ; ' jsteck @ ti . com ' ;
' steffler @ mindspring . com ' ; ' gsteglich @ home . com ' ; ' shane @ computer . org ' ;
' sastephen @ home . com ' ; ' dereks @ us . ibm . com ' ; ' tsteudtner @ aol . com ' ;
' jill . stevens @ risd . org ' ; ' jnelwyn @ aol . com ' ; ' dons @ gwmail . plano . gov ' ;
' mstewart 70 @ aol . com ' ; ' pstewart 86 @ hotmail . com ' ;
' rastewartl 2 @ hotmail . com ' ; ' msticken @ airmail . net ' ;
' cstockmoe @ yahoo . com ' ; ' k - stokes @ tamu . edu ' ; ' michael \_ stone @ nt . com ' ;
' mcstrietzel @ home . com ' ; ' staceys @ omassociates . com ' ;
' sstroth @ glitsch . com ' ; ' h - r . strozewski @ worldnet . att . net ' ;
' astryker @ swbell . net ' ; ' macecs @ hotmail . com ' ; ' smsturgeon @ kpmg . com ' ;
' sullivan 22 @ home . com ' ; ' normas @ airmail . net ' ; ' wswanson @ cyberramp . net ' ;
' rtank 20 @ aol . com ' ; ' matt \_ tanner @ txu . com ' ; ' ftargac @ hotmail . com ' ;
' taylorgr @ nortel . com ' ; ' taylorl @ airmail . net ' ; ' aggierob @ hotmail . com ' ;
' teresa . taylor @ st . com ' ; ' ticaw @ hotmail . com ' ; ' wst @ flash . net ' ;
' caceett @ hotmail . com ' ; ' punt 3442 @ aol . com ' ; ' chris . t @ prodigy . net ' ;
' denise . thatcher @ eds . com ' ; ' mjthed @ earthlink . net ' ;
' brandon . theis @ eds . com ' ; ' steve \_ thelen @ cushwake . com ' ;
' arthur . thomas @ ace - ina . com ' ; ' tthomas @ cooperinst . org ' ;
' dthomps 2 @ pisd . edu ' ; ' nthompson @ swst . com ' ; ' psthompson @ mindspring . com ' ;
' rthompso @ kofax . com ' ; ' tierney \_ thompson @ winston - school . org ' ;
' b - tinker @ ti . com ' ; ' tipp @ airmail . net ' ; ' atokarz @ usa . alcatel . com ' ;
' bevtoney @ aol . com ' ; ' kstowery @ mindspring . com ' ;
' patrick . traubert @ tripointglobal . com ' ; ' heidigigem 96 @ yahoo . com ' ;
' tiffanytrox @ yahoo . com ' ; ' dddtruitt @ juno . com ' ;
' tschetter @ worldnet . att . net ' ; ' 9 mtucker @ home . com ' ;
' oxymomloree @ aol . com ' ; ' cturner @ entest . net ' ; ' aisdal @ aol . com ' ;
' rachturney @ yahoo . com ' ; ' gulteig @ starrunner . net ' ; ' gutay @ airmail . net ' ;
' hutay @ yahoo . com ' ; ' valls @ earthlink . net ' ; ' hvanpelt @ msn . net ' ;
' dwv @ vanderburg . org ' ; ' timv @ cheerful . com ' ; ' r - cvaughn @ juno . com ' ;
' annette . vela @ homecomings . com ' ; ' jvetkoetter @ pipeline . com ' ;
' diane \_ vetter @ hotmail . com ' ; ' vicem @ hdvest . com ' ; ' jennifer @ vilches . org ' ;
' pvilches @ home . com ' ; ' spvill @ flash . net ' ; ' ssv @ attglobal . net ' ;
' kristi . l . vitek @ fritolay . com ' ; ' voltin @ airmail . net ' ; ' wagso 0 @ yahoo . com ' ;
' walessc @ nortelnetworks . com ' ; ' bwalker @ fmtinv . com ' ;
' brian . walker @ exchange - point . com ' ; ' deanwalker @ computer . org ' ;
' ken . walker @ mscsoftware . com ' ; ' shannon . wallace @ usa . net ' ;
' b \_ wallace @ prodigy . net ' ; ' wwallenl @ airmail . net ' ; ' mkwalle @ yahoo . com ' ;
' twaller @ excite . com ' ; ' drwaller @ hotmail . com ' ;
' maxwalters @ worldnet . att . net ' ; ' kwalzel @ cisco . com ' ;
' julie . warden @ mhmr . state . tx . us ' ; ' warner @ ont . com ' ; ' watersco @ flash . net ' ;
' julie . watkins @ eds . com ' ; ' apwo 397 @ juno . com ' ; ' patricia . watsono 2 @ ey . com ' ;
' jwebb @ dalsemi . com ' ; ' jkwebb 41 @ gateway . net ' ; ' debbiew @ mciworld . com ' ;
' robin - w @ juno . com ' ; ' sally \_ welch @ excite . com ' ;
' twelch @ nortelnetworks . com ' ; ' awaller @ csc . com ' ;
' gregwemhoener @ home . com ' ; ' susanwempe @ hotmail . com ' ; ' jwest 78 @ aol . com ' ;
' mike . west @ usa . alcatel . com ' ; ' joannwest @ earthlink . net ' ;
' mweynand @ flash . net ' ; ' weynandken @ johndeere . com ' ; ' mweynand @ flash . net ' ;
' mmw @ airmail . net ' ; ' ewheal @ jcpenney . com ' ; ' swhite @ tx . pathnet . net ' ;
' txag 93 sw @ flash . net ' ; ' chris @ iex . com ' ; ' wiegard @ nortel . com ' ;
' cindyw @ pobox . com ' ; ' jbouldin @ teleteam . com ' ; ' mike @ paragon - tx . com ' ;
' rtwilkinsn @ aol . com ' ; ' jerriw @ arn . net ' ; ' joeaggie 93 @ msn . com ' ;
' kristilw @ swbell . net ' ; ' a 50 @ flash . net ' ; ' rkwbdw 580 @ cs . com ' ;
' kswccw @ swbell . net ' ; ' designpath @ sprynet . com ' ; ' drw 58 ag @ aol . com ' ;
' wilsonaggies @ home . com ' ; ' normabautista @ worldnet . att . net ' ;
' skisheri @ aol . com ' ; ' emajo . wilson @ gte . net ' ; ' jenkonquin @ aol . com ' ;
' wvicw @ aol . com ' ; ' teresa . wood @ st . com ' ; ' lwood 963 @ flash . net ' ;
' tcwoolley @ writeme . com ' ; ' sworsham @ supermovers . com ' ;
' brad \_ worth @ csicontrols . com ' ; ' wrightson . family @ gte . net ' ;
' dwu @ tqtx . com ' ; ' dlwylie @ swbell . net ' ; ' christa . yakel @ sap - ag . de ' ;
' yarbrough \_ james @ hotmail . com ' ; ' suzan @ guyyork . com ' ;
' karen \_ znoj @ merck . com '
subject : " red , white and blue out "
subject : the osu game and aggie spirit
this just in ags ! if you are going to the osu game on sept 22 , a red , white
& blue out is being planned , just like the maroon out games for the osu
game . ags , what better statement can we aggies make , than to celebrate the love
and support for our country ' s freedom , and our patriotic nature , than this
way :
imagine . . . the fightin ' texas aggie band playing " the star spangled banner , "
playing military drills as they walk around the stadium , and we celebrate
our love for our country , and our support for all the heroes , alive and
deceased . color assignments are as follows :
3 rd deck : red
2 nd deck : white lst deck : blue
pass the word on . . . . we have 1 1 / 2 weeks ! ! the spirit of america , and the aggie spirit is still alive . god bless
america , the free nation . pass it on .

### Document 2579

Subject: transport on koch , beginning wednesday
christina ,
the indian springs plant delivery into hpl will be taken out of service for
repiping work effective wednesday morning the 15 th . ena will be moving
texas desk purchases via koch gateway from teco polk co . plant ( 12780 ) to
midcon needville ( 6350 ) . i have confirmed available capacity with koch and
made sure that the it agreement is evergreen and will work for these
purposes . please let me know what i can do to ensure that the nomination process works
as smoothly as possible . i will work with daren and liz to make sure that
the purchase and sales tickets are moved to koch . please let me know who is
responsible for the transport usage ticket on koch , if necessary , i will put
the ticket in this afternoon . we will be responsible for a \$ . 02 commodity
fee , aca of \$ . **0022 and 1 .** 6 % fuel ( monetizes to \$ . 09 approximately ) to move
these volumes . ena will be receiving 84 , 987 dth at the plant and delivering 83 , 627 into
midcon texas . we also have the option to deliver by displacement into midcon
at goodrich or edna , or displace deliveries to koch bayside . please let me know if you have any questions or need anything further from
me . mary
ext . 35251



# Optional 1

If you have access to OpenAI's API, and you can afford the charges (it will be about 15 US-cents, or about $0.25),
try using it for creating the embedding for the query and the emails.

Otherwise, try using another embedding from the MTEB leaderboard.

Your code will be almost the same as for the bag-of-words model.

# Optional 2

Python's standard library includes `difflib` which has a function `difflib.get_close_matches()` 

This lets you do approximate matching of words instead of exact matching. It's an alternative to 
doing lemmatization. Try it out.