# Analyzing text


#### AN IMPORTANT ANNOUNCEMENT
Step 3 of the mini-project (downloading the dataset), which is further down in this notebook, takes a long time. You may want to do that step when you are taking a break from the rest of the notebook or, alternatively, you can download the file outside of this Python notebook

<img src="http://zacharski.org/files/courses/cs419/tiles.jpg" width="500"/>

So far we have been dealing with **structured data**. Structured data is ... well ... structured. This means that an instance of our data has nice attributes that can be represented in a DataFrame or a table:

make | mpg | cylinders | HP | 0-60 |
---- | :---: | :---: | :---: | :---: | :---: |
Fiat | 38 | 4 | 157   | 6.9 
Ford F150 | 19 | 6 | 386 | 6.3 
Mazda 3 | 37 | 4 | 155 |  7.5 
Ford Escape | 27 | 4 | 245 | 7.1 
Kia Soul | 31 | 4 | 164 | 8.5 

The majority of data in the world is **unstructured**. Take text for example. Suppose I have a corpus of twitter posts from President Trump and the Dalai Lama and my goal is to create a classifier that takes a tweet and tells me if it was produced by Trump or the Dalai Lama:

*The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion.*

*How low has President Obama gone to tapp my phones during the very sacred election*

We might consider  the columns of a table to be things like *first word of the tweet*, *second word of the tweet* and so on:


id | word 1 | word 2 | word 3 | word 4 |word 5 |word 6 | ... |
---- | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |
1 | The | purpose | of   | education |is | to | ...
2 | How | low | has |President | Obama | gone | ...

So we would be counting how many times the word *President* occurred as the fourth word of a tweet. **But that would be the wrong way to go**. 

A more common way to represent text is to treat the text as an unordered set of words, which is called the **bag of words** approach. 

## Bag of words
<img src="http://zacharski.org/files/courses/cs419/BagofWords.jpg" width="350"/>

With the bag of words approach we count word occurrences and the features (what we might think of as columns) are the words. For example, we take a bunch of Trump tweets and count word occurrences and do the same with the Dalai Lama tweets and we might get something like:

id | a | the | compassion | love |sad |fake | ... |
---- | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |
Trump | 42 | 27 | 1   | 5 |311 | 227 | ...
Dalai Lama | 72 | 103 | 176 |159 | 5 | 1 | ...

So, for example, Trump has used the word *compassion* once in all his tweets  but the Dalai Lama used it 103 times (this data is made-up).

This 'bag of words' allows us to use the classification methods we have been using. 

Converting **unstructured** text to something **structured** is a multistep process. Let's learn the bits before putting it together. And we will start with the last step first-- creating the bag of words.

## We are going to need a vectorizer.
We don't know what that is yet, but let's import the library:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

# now create a vectorizer

In [2]:
vectorizer = CountVectorizer()

Now we will create some text:

In [3]:
trump1 = "How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Obama bad (or sick) guy! Sad"
trump2 = "Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster - is imploding fast! Sad"
trump3 = "Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great, and getting major things done!"
trump4 = "Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks! Sad"
dalaiLama1 = "The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion."
dalaiLama2 = "Be a kind and compassionate person. This is the inner beauty that is a key factor to making a better world."
dalaiLama3 = "If our goal is a happier, more peaceful world in the future, only education will bring change."
dalaiLama4 = "Love and compassion are important, because they strengthen us. This is a source of hope"
tinyCorpus = [trump1, trump2, trump3, trump4, dalaiLama1, dalaiLama2, dalaiLama3, dalaiLama4]
tinyCorpus

['How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Obama bad (or sick) guy! Sad',
 'Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster - is imploding fast! Sad',
 "Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great, and getting major things done!",
 'Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks! Sad',
 'The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion.',
 'Be a kind and compassionate person. This is the inner beauty that is a key factor to making a better world.',
 'If our goal is a happier, more peaceful world in the future, only education will bring change.',
 'Love and compassion are important, because they strengthe

## fit the corpus
We have used `fit` before so this looks similar to what we have done but the results are a bit different.

In [4]:
vectorizer.fit(tinyCorpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

When we use the fit method, the vectorizer goes through the corpus, collects all the unique words and gives them an index number.
We can find out the index of a word by using the `.vocabulary_.get` function


In [5]:
print(vectorizer.vocabulary_.get('love'))
print(vectorizer.vocabulary_.get('obama'))

55
69


so the word *love* has index 55 and *Obama* has index 69.
Now let's create the bag of words:

In [6]:
text = vectorizer.transform(tinyCorpus)
print(text)


  (0, 5)	1
  (0, 24)	1
  (0, 26)	1
  (0, 34)	1
  (0, 36)	1
  (0, 38)	1
  (0, 42)	1
  (0, 50)	1
  (0, 56)	1
  (0, 62)	1
  (0, 67)	1
  (0, 69)	2
  (0, 73)	1
  (0, 79)	1
  (0, 82)	1
  (0, 83)	1
  (0, 89)	1
  (0, 90)	1
  (0, 91)	1
  (0, 96)	1
  (0, 99)	1
  (0, 103)	1
  (0, 104)	1
  (0, 109)	1
  (0, 110)	1
  :	:
  (6, 37)	1
  (6, 43)	1
  (6, 47)	1
  (6, 50)	1
  (6, 61)	1
  (6, 72)	1
  (6, 75)	1
  (6, 77)	1
  (6, 99)	1
  (6, 112)	1
  (6, 114)	1
  (7, 2)	1
  (7, 4)	1
  (7, 8)	1
  (7, 16)	1
  (7, 41)	1
  (7, 46)	1
  (7, 50)	1
  (7, 55)	1
  (7, 71)	1
  (7, 93)	1
  (7, 94)	1
  (7, 101)	1
  (7, 103)	1
  (7, 108)	1


That `print` was unnecessary but it does show the format of the text. So 

    (0, 69)	2
    
Means that in tweet 0, the word with index 69 (*Obama*) occurred twice.

This `(0, 69)` is a sparse matrix representation. Before talking about that representation, let's review the DataFrame method we have been using. Suppose we have a large email corpus. The columns of our table represent all the unique words in the corpus. We have columns for typical words like *Obama*, *compassion*, and *mess*. And we also have a column for the word *elephant*.  perhaps Trump mentioned it in exactly one utterance referring to his son's hunting expedition. And a column for *xylophone*.  We might have 50,000 columns. 

The rows of our table are the individual email messages. So if an email contained 2 occurrences of *mess* that row will have a 2 in the mess column. But most email messages don't mention *Obama*, *compassion* or *mess*, much less mention rarer words like *elephant* or *xylophone* so the vast majority of cells in our table will be zero. So representing this data in a table is inefficient. With the sparse matrix representation we just list the cells that are not zero. So in the example above `(0, 69)) 2 ` represented a '2' in row 0 column 69. 

#### Instead of doing fit and transform as separate steps, we can do fit and transform in one step:


In [27]:
features = vectorizer.fit_transform(tinyCorpus)


Let's move on to other pre-processing steps we might need.

## Low information words
For some applications, some words provide less information than others. For example,  the word *this* may be informative for some tasks. But for other tasks like deciding if the text is about pianos or motorcycles the word is considered uninformative. Other examples  of low-information words might be *the, a, this, that, on, of,*

Some data scientists believe that these low-information, high-frequency words constitute noise and they remove them in a pre-processing step. These words we are removing are called **stop words** 

For example, if the stop words are *a, and, be, the, will* and we have the sentence

         be a kind and compassionate person
         
we will end up with

              kind    compassionate  person
              
## Do this exactly once
When you installed Anaconda you did not install the stopword lists. To do so (and **you only need to do this once**) execute the following:

In [8]:
# ONLY DO THIS ONCE
import nltk
nltk.download('all', halt_on_error=False)

Now that we downloaded the lists to our laptops let's take a look at the English stopwords.

In [9]:
from nltk.corpus import stopwords

In [10]:
swords = stopwords.words('english')

In [11]:
len(swords)

179

In [12]:
print(swords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We are going to use these stopwords shortly, but let's continue the tour of pre-processing steps.

# stemming

You may know that the structure of a sentence is called syntax. So  the sentence *The dogs chased the ball*  consists of a noun phrase (NP - *the dog*) followed by a verb phrase (VP - *chased the ball*) and the VP consists of a verb (*chased*) followed by an NP (*the ball*) and finally the NP consists of a determiner *the* followed by a noun *ball*.  And we get a syntactic structure that looks like:

                                    S
                                  /   \
                                 /     \
                                /       \
                               NP        VP
                             /   \      |   \
                            Det   N     V    \   
                            |     |     |      NP  
                           the   dogs  chased  |  \
                                               |   \
                                              Det   N
                                               |    |
                                              the  ball
                                              
Similarly, words have internal structure. So *dogs* is really `dog+PLURAL` and *chased* is `chase+PAST`. This structure is called morphology and the analysis step is called morphological analysis. For many classification tasks, we don't care whether the person wrote *dogs* or *dog*. Or *chasing*, *chased*, or *chases* instead of *chase*. We might want to count all those variants of *chase* simply as *chase*. So instead of having separate attributes for *chase*, *chasing*, *chased*, and *chases*, we reduce it to 1. 

The absolute best way to do this task is with a morphological analyzer but it turns out that writing a good morphological analyzer is extremely tricky so data scientists use a much simpler solution called **stemming**. There are a number of stemming algorithms available to us. Here is how to use one called the Snowball Stemmer:

In [13]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [14]:
print(stemmer.stem('chasing'))
print(stemmer.stem('chased'))
print(stemmer.stem('chases'))

chase
chase
chase


The above look like real words, but this is not always the case with a stemmer:

In [15]:
print(stemmer.stem('everyone'))
print(stemmer.stem('please'))


everyon
pleas


## QUIZ - sort of

let's say we want to do all of

* creating a bag of words
* stem the words
* remove stop words

What order would we do this in?

Would we create the bag of words and then stem? Or vice versa?




# TF-IDF representation

We could represent a document as a bag of words and their probabilities. For example, in *Tom Sawyer* 4.6% of the words are *the* and 0.95% are *Tom*. But *the* probably occurs in most novels with that frequency. So in some sense *the* is uninteresting. On the other hand *Tom* probably occurs much more frequently in *Tom Sawyer* than it does in *Moby Dick*. One way to discount words that occur evenly throughout our document collection is to use TF-IDF.  

* TF: Term Frequency - each word uprated by how often the word occurs in the document.
* IDF Inverse Document Frequency - how often the word appears in the entire corpus

and the formula is

### $$ tfidf(t, d) = tf(t,d) \times idf(t) $$

where *t* is the term (the word) and *d* is the document.

To explain this I will use some made up data--the word counts of 5 emails:

id | the | sad | compassion |  
---- | :---: | :---: 
1 | 3 | 0 | 1 
2 | 3 | 0 | 0 
3 | 4 | 0 | 0 
4 | 3 | 2 | 0 
5 | 3 | 0 | 2


The intuition is this. Even though the word *the* occurs frequently in each email, it is unlikely to help us classify email because it occurs in **every** email. The words *sad* and *compassion* are more interesting as they don't occur uniformly in our collection. 

The TF part of TF-IDF refers to how often the word occurs in the document. So for example, the TF of *the* in document 1 is 3. And IDF is defined as:

### $$ idf(t)=\log\frac{1+n_d}{1+df(d,t)}+ 1 $$

$n_d$ is the total number of documents and $df(d,t)$ is how many documents the term *t* occurred in. 

So:

### $$ idf(the)=\log\frac{1+5}{1+5}+ 1 =  1.5 $$

### $$ idf(compassion)=\log\frac{1+5}{1+2}+ 1 = \log{2} + 1 =  2 $$

So, *the* in document 1 has a tf-idf of $3 \times 1.5 = 4.5$ and *compassion* has a tf-idf of $1 \times \ 2 = 2$

Here is how to do it in sklearn

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
features_train_transformed = vectorizer.fit_transform(tinyCorpus)
features_train = features_train_transformed.toarray()

# removing punctuation
suppose I have the following important email:

In [18]:
important_email = """
To: Ron Zacharski <ron.zacharski@gmail.com>
From: Susan Williams <desmondwilliams614@yahoo.com>
Reply-To: Susan Williams <deswill0119@yahoo.fr>
Message-ID: <1860373470.1061917.1488479328300@mail.yahoo.com>
Subject: Hello,
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8

Hello,

Greetings. With warm heart I offer my friendship and greetings, and I hope that this mail 
will meets you in good time.

However strange or surprising this contact might seem to you as we have not meet personally 
or had any dealings in the past. I humbly ask that you take due consideration of its 
importance and immense benefit.

My name is Susan Williams from Republic of Sierra-Leone. I have something very important 
that i would like to confide in you please,I have a reasonable amount of money which 
i inherited from my late father (Nine Million Five Hundred thousand United States Dollar}.
US$9.500.000.00.which I want to invest in your country with you and again in a very 
profitable venture.
"""


The first thing we could do is get the body of the email and strip out the header. We can do that as follows:


In [19]:
content = important_email.split("Content-Type: text/plain; charset=UTF-8")
print(content[1])



Hello,

Greetings. With warm heart I offer my friendship and greetings, and I hope that this mail 
will meets you in good time.

However strange or surprising this contact might seem to you as we have not meet personally 
or had any dealings in the past. I humbly ask that you take due consideration of its 
importance and immense benefit.

My name is Susan Williams from Republic of Sierra-Leone. I have something very important 
that i would like to confide in you please,I have a reasonable amount of money which 
i inherited from my late father (Nine Million Five Hundred thousand United States Dollar}.
US$9.500.000.00.which I want to invest in your country with you and again in a very 
profitable venture.



## Mindfulness
Okay, and here is my warning. So before you do some machine learning task, THINK. For example, don't blindly, automatically, delete stop words from a text -- think if that is required for your task. Similarly, before you strip out punctuation, decide on whether your task requires it. For example, maybe Spam detection would be improved by examining punctuation!!!! With that in mind, here is how you would remove punctuation.

### removing punctuation

In [20]:
import string
translator = str.maketrans('', '', string.punctuation)
text_string = content[1].translate(translator)
text_string

'\n\nHello\n\nGreetings With warm heart I offer my friendship and greetings and I hope that this mail \nwill meets you in good time\n\nHowever strange or surprising this contact might seem to you as we have not meet personally \nor had any dealings in the past I humbly ask that you take due consideration of its \nimportance and immense benefit\n\nMy name is Susan Williams from Republic of SierraLeone I have something very important \nthat i would like to confide in you pleaseI have a reasonable amount of money which \ni inherited from my late father Nine Million Five Hundred thousand United States Dollar\nUS950000000which I want to invest in your country with you and again in a very \nprofitable venture\n'

<h1 style="color:red">Partner Mini-Project</h1>
This project is from the Udacity Course *Introduction to Machine Learning*

We are going to work through, step-by-step, the pre-processing we must do to prepare a text for a classification task like Naive Bayes.

## Part 1. Remove punctuation
Below, the Udacity team gives us the function `parseOutText()`

`parseOutText()` takes the opened email and returns only the text part, stripping away any metadata that may occur at the beginning of the email, so what's left is the text of the message. 

The first step we need to do is remove the punctuation from the text part of the email.  Please make that change. 


In [21]:
def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
        """
    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()
    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        
        text_string = content[1]
        ### project part 1: remove punctuation
        
        
        
        ### project part 2: comment out the line below
        words = text_string
        

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        




    return words

def main():
    ff = open("../data/test_email.txt", "r")
    text = parseOutText(ff)
    print (text)



main()




Hi Everyone!  If you can read this message, you're properly using parseOutText.  Please proceed to the next part of the project!



You function should return

    Hi Everyone  If you can read this message youre properly using parseOutText  Please proceed to the next part of the project
    
## Part 2. Stemming

In `parseOutText()`, comment out the following line: 

`words = text_string` 

Augment `parseOutText()` so that the string it returns has all the words stemmed using a SnowballStemmer (use the nltk package, some examples that I found helpful can be found here: [http://www.nltk.org/howto/stem.html](http://www.nltk.org/howto/stem.html) ). Rerun parse_out_email_text.py, which will use your updated parseOutText() function--what’s your output now?

Hint: you'll need to break the string down into individual words, stem each word, then recombine all the words into one string.  Suppose I have a list of words:



In [6]:
lacksZs = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
lacksZs 

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

and suppose I have a function `addZ` that appends a `z` to a word. I can apply that function to each element of a list and create a string result as follows:


In [8]:
def addZ(word):
    return word + 'z'

tmp = map(addZ, lacksZs)
result = ' '.join(tmp)
result

'thez quickz brownz foxz jumpsz overz thez lazyz dogz'

that hint should help a bit when adding the stemming code.
Either edit your code above. Or copy and paste it below and make the edit here:

In [3]:
def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
        """
    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()
    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        
        text_string = content[1]
        ### project part 1: remove punctuation
        
        
        
        ### project part 2: comment out the line below
        words = text_string
        

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        




    return words

def main():
    ff = open("../data/test_email.txt", "r")
    text = parseOutText(ff)
    print (text)



main()



hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project


your function should return:

    hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project
    
## Part 3 - Downloading 400MB of the Enron Email Dataset
You only need to do this once (so using the 'Run All Cells' command on this notebook might not be a wise decision.

Prior to the Hillary Clinton email dataset, the Enron dataset was the largest publically available email set in the known universe. According to Wikipedia: "The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse."

Our goal for this mini-project is to use emails from two people to see if we can build a classifier that can identify the author of an email. 

Because of its size and the resulting length of time it will take, let's divide the code into 2 cells:

#### first download ...

Feel free to change the location of the download.

In [17]:
import urllib
url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
urllib.request.urlretrieve(url, filename="../enron_mail_20150507.tgz") 
print ("download complete!")



download complete!


#### now uncompress it!

In [19]:
import tarfile
import os
os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tgz", "r:gz")
tfile.extractall(".")
os.chdir("notebooks")
print ("you're ready to go!")

you're ready to go!


## Part 4 - read the data and stem it
Just to reiterate, this mini-project and the majority of the write-up is from Udacity. They did a great job in putting this together.

In the next code block, you will iterate through all the emails from Chris and from Sara. For each email, feed the opened email to parseOutText() and return the stemmed text string. Then do two things:

1. apply parseOutText to extract the text from the opened email
2. remove signature words (“sara”, “shackleton”, “chris”, “germani”--bonus points if you can figure out why it's "germani" and not "germany")
2. append the updated text string to word_data -- if the email is from Sara, append 0 (zero) to from_data, or append a 1 if Chris wrote the email.

Once this step is complete, you should have two lists: one contains the stemmed text of each email, and the second should contain the labels that encode (via a 0 or 1) who the author of that email is.

Running over all the emails can take a little while (5 minutes or more), so we've added a temp_counter to cut things off after the first 200 emails. Of course, once everything is working, you'd want to run over the full dataset.

In [4]:
# CODE FROM UDACITY

import os
import pickle
import re
import sys
import string # raz



"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""



from_sara  = open("../data/from_sara.txt", "r")
from_chris = open("../data/from_chris.txt", "r")

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    print('PROCESSING  ' + name)
    temp_counter = 0
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200 or 1 == 1:   # raz
            path = os.path.join('/Users/raz/Documents/machineLearning/', path[:-1])
            #print(path)
            email = open(path, "r")
            

            ### use parseOutText to extract the text from the opened email

            
            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]

            ### append the text to word_data
            
            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris

            email.close()    

print ("emails processed")
from_sara.close()
from_chris.close()
    
# uncomment this out when you have the above working. 
#with open('../data/word_data.pkl', 'wb') as word_file:
#    pickle.dump( word_data, word_file)
#with open('../data/email_authors.pkl', 'wb') as author_file:
#    pickle.dump( from_data, author_file)

print('Processing Complete')


PROCESSING  sara
PROCESSING  chris
emails processed
  0:  sbaile2 nonprivilegedpst susan pleas sen
  0:  sbaile2 nonprivilegedpst 1 txu energi tr
  0:  sbaile2 nonprivilegedpst all here the se
  0:  sbaile2 nonprivilegedpst   enron wholesa
  0:  sbaile2 nonprivilegedpst origin messag f
  0:  sbaile2 nonprivilegedpst we need to rese
  0:  sbaile2 nonprivilegedpst we cannot locat
  0:  sbaile2 nonprivilegedpst origin messag f
  0:  sbaile2 nonprivilegedpst did domin carol
  0:  sbaile2 nonprivilegedpst weezi pleas ema
  0:  sbaile2 nonprivilegedpst i alreadi spoke
  0:  sbaile2 nonprivilegedpst baltimor gas el
  0:  sbaile2 nonprivilegedpst someon from set
  0:  sbaile2 nonprivilegedpst fyi origin mess
  0:  sbaile2 nonprivilegedpst debbi is send c
  0:  sbaile2 nonprivilegedpst 1 citizen gas u
  0:  sbaile2 nonprivilegedpst susan pleas add
  0:  sbaile2 nonprivilegedpst receiv written 
  0:  sbaile2 nonprivilegedpst 1 receiv a voic
  0:  sbaile2 nonprivilegedpst myra gari 87072
  0:  sb

### Great job!
Just for a check, what do you get for `word_data[152]`? I get

    tjonesnsf stephani and sam need nymex calendar

## Part 5 - TF-IDF




<img src="https://upload.wikimedia.org/wikipedia/commons/3/32/Las_Cruces.jpg" width="700"/>

We are over the crest of the mountain and can see the cityscape of a structured representation!
(okay, just a fancy way of saying we are over the hump)
<img src="http://zacharski.org/files/courses/cs419/baylor.jpg" width="700"/>

#### TF-IDF
Transform the word_data into a tf-idf matrix using the sklearn TfIdf transformation. Remove english stopwords.
You can access the mapping between words and feature numbers using get_feature_names(), which returns a list of all the words in the vocabulary.

 How many different words are there?  (You should get 38,757 unique words)

38757



what is word number 34597? by that I mean the traditional computer science definition where there is word number 0.

    stephaniethank

## Part 6: divide the data into a training and test set

When you do the split use the arguments:

    test_size=0.1, random_state=42

How many entries are in the test set?  (I get 1758)

1758

## Part 7: Train a Classifier(SGBoost?) and use it to make predictions

When you do the split use the arguments:

### What is your accuracy?

0.99146757679180886

# Finished!