In [3]:
import numpy as np 
import pandas as pd
import os

# Pre-Processing
Here, we pre-process data in the email bodies.

### Read in the manipulated, unlabeled data

In [4]:
train = pd.read_csv("../enron/unlabeled/train.csv")
train.head()

Unnamed: 0,Message-ID,Date,From,To,Subject,X-From,X-To,X-cc,X-bcc,X-Folder,X-Origin,X-FileName,content,user
0,<30054600.1075841565738.JavaMail.evans@thyme>,2002-01-31 03:37:11,frozenset({'pete.davis@enron.com'}),frozenset({'pete.davis@enron.com'}),Start Date: 1/30/02; HourAhead hour: 22;,Schedule Crawler<pete.davis@enron.com>@ENRON,"Davis, Pete </O=ENRON/OU=NA/CN=RECIPIENTS/CN=P...","Meyers, Albert </O=ENRON/OU=NA/CN=RECIPIENTS/C...",,"\ExMerge - Solberg, Geir\Deleted Items",SOLBERG-G,geir solberg 6-26-02.PST,\r\r\n\r\r\nStart Date: 1/30/02; HourAhead hou...,solberg-g
1,<33136589.1075852829115.JavaMail.evans@thyme>,2001-10-12 03:36:25,frozenset({'402075.57130981.1@1.americanexpres...,frozenset({'kpresto@enron.com'}),Help the Sept. 11 disaster relief effort,Membership Rewards <membershiprewards+402075.5...,kpresto@enron.com,,,\KPRESTO (Non-Privileged)\Deleted Items,Presto-K,KPRESTO (Non-Privileged).pst,----------------------------------------------...,presto-k
2,<4916370.1075861400181.JavaMail.evans@thyme>,2001-11-05 14:37:15,frozenset({'harry.arora@enron.com'}),frozenset({'iris.mack@enron.com'}),FW: Synthetic Peaker,"Arora, Harry </O=ENRON/OU=NA/CN=RECIPIENTS/CN=...","Mack, Iris </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Im...",,,"\HARORA (Non-Privileged)\Arora, Harry\Sent Items",Arora-H,HARORA (Non-Privileged).pst,\r\r\n-----Original Message-----\r\r\nFrom: B...,arora-h
3,<24276539.1075856274091.JavaMail.evans@thyme>,2000-10-19 20:25:00,frozenset({'enron.announcements@enron.com'}),frozenset({'all_ena_egm_eim@enron.com'}),Ameriflash Newsletter,Enron Announcements,All_ENA_EGM_EIM,,,\Vincent_Kaminski_Jun2001_1\Notes Folders\All ...,Kaminski-V,vkamins.nsf,NOTE FROM MARK FREVERT\r\r\n\r\r\nWith the wid...,kaminski-v
4,<28413439.1075854360990.JavaMail.evans@thyme>,2001-04-05 13:17:00,frozenset({'darron.giron@enron.com'}),frozenset({'phillip.love@enron.com'}),FW: the mullet!:,Darron C Giron,Phillip M Love,,,\Darron_Giron_Jun2001\Notes Folders\All documents,Giron-D,dgiron.nsf,---------------------- Forwarded by Darron C G...,giron-d


## Stop Word Removal, Lemmatization, and Stemming
Stop words are common words, such as "the" or "a", that do not contribute to the overall meaning of our text.

Lemmatization and stemming are special cases of normalization. They use different rulesets to reduce a word to its base form. Lemmatization uses dictionary lookup while stemming attempts to remove the end of a word (such as "ing" or "ies"), sometimes producing segments of words that are not linguistically correct. Lemmatization can also, theoretically, choose a base form from the context of word. For example, "meeting" can either be a noun or a verb and therefore requires context to determine its meaning. 

Our initial approach will use lemmatization over stemming, due to the tendency for stemming to leave word stems rather than valid words.

__Important! To use spaCy, run in terminal/prompt:__

```
pip install spacy
python -m spacy download en
```

In [160]:
import spacy
spacy.prefer_gpu()
nlp = spacy.load('en')

from spacy.lang.en.stop_words import STOP_WORDS

The goal of the clean function is to minimize different forms of words within the text. 

First, the function needs to convert all of the text to lowercase characters. The models we will use later would see "Banana" and "banana" as two different words due to the difference in capitalization. We do not want this kind of case-sensitivity and therefore must remove it by changing all letters to lowercase.

The function must also remove forwarded text within a message. We want to avoid any message showing up in our dataset more than once; if we were to leave in text that was forwarded to multiple people, then that text would have more weight in our models.

We will then lemmatize the text before applying stopword removal.

 First, we want to add some common punctuation to the list of stop words.

In [178]:
# Adding some common punctuation to the spaCy stopwords
for word in [".", ",", "!", "?", "\\r\\n", "\\r\\r\\n", "-PRON-"]:
    spacy.lang.en.stop_words.STOP_WORDS.add(word)
    nlp.vocab[word].is_stop = True

In [179]:
def clean(email_body):
    # This converts all characters in the email to lowercase.
    email_lower = " ".join([i for i in email_body.lower().split()])
    
    # This line removes all forwarded text from an email.
    no_forwards = email_lower.split('>from',1)[0].split('---',1)[0]
    
    # Lemmatization and stopword removal
    # initialize spacy 'en' model
    nlp = spacy.load('en', disable=['parser', 'ner'])
    # Convert the sentence to a spacy document
    document = nlp(email_body)
    # Extract the lemma for each token
    # If the lemma is not a stop word, then add it to the message using join()
    # If the lemma is a stop word, then it is simply left out of the cleaned message
    cleaned_message = " ".join([token.lemma_ for token in document if token.lemma_ not in STOP_WORDS])

    return cleaned_message

Let's test our function. If we pick an email at random, say number 13, this is the original text:

In [180]:
print(train['content'][13])

you're a doll.  thanks. on this one, feel free to control!  even if Cameron 
doesn't go up, prentice and i likely will, which means that we'll be stopping 
by napa (if you guys don't mind).


Now here is what the same email looks like after having been cleaned:

In [181]:
result = clean(train['content'][13])
print(result)

doll   thank feel free control   cameron 
 prentice likely mean stop 
 napa ( guy mind )


Whenever spaCy encounters a word it thinks is a pronoun, it replaces it with -PRON-. Because we added this to the list of stopwords, we will not see this token in our cleaned messages.

### Clean all the messages in our dataset. Then save them locally for faster loading in the future

In [None]:
# TODO