# Text Learning via ML

1. Bag Of Words
2. StopWords
3. Stemming
4. NLTK
5. TF-IDF

# Bag of Words

Bag of Words is like tossing all the words we encounter as individual tokens with their frequency of occurance in input. If the word happens to repeat we will have only one instance of it with tied frequency on how many times it has been repeated.

** Note: Loves and Love are not same, advance tooling to be studied later will help cover such issues **
![ML](images/ml62.png)

** Note: 'Chicago Bulls' is sports team, but at this stage system can't contextually understand this, will treat this as separate words, advance tooling to be studied later will help cover such issues **

![ML](images/ml61.png)


## Let's implement BagOfWords with sklearn

** In sklearn Bag Of Words is referred as CountVectorizer as it doing nothing counting word frequencies **

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?']

# Fit words from corpus, which means convert them to indices
bag_of_words = vectorizer.fit(corpus)

# Transform count the words frequency, keep one instance with with count
bag_of_words = vectorizer.transform(corpus)

## *** Alternative above two steps can be done in one as : vectorizer.fit_transform(corpus)

<4x9 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [15]:
print "Example: (0,1) stands for 0th document (first document) & 1st word, ..."
print " which is nothing by ''this'', followed by occurence count in that document ..."
print("")
print bag_of_words

Example: (0,1) stands for 0th document (first document) & 1st word, ...
 which is nothing by ''this'', followed by occurence count in that document ...

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 6)	1
  (0, 8)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	2
  (1, 6)	1
  (1, 8)	1
  (2, 0)	1
  (2, 4)	1
  (2, 6)	1
  (2, 7)	1
  (3, 1)	1
  (3, 2)	1
  (3, 3)	1
  (3, 6)	1
  (3, 8)	1


In [8]:
vectorizer.get_feature_names()

[u'and',
 u'document',
 u'first',
 u'is',
 u'one',
 u'second',
 u'the',
 u'third',
 u'this']

In [7]:
vectorizer.vocabulary_

{u'and': 0,
 u'document': 1,
 u'first': 2,
 u'is': 3,
 u'one': 4,
 u'second': 5,
 u'the': 6,
 u'third': 7,
 u'this': 8}

In [33]:
print "WE HAVE FEATURE LIST OR VOCABULARY OF 9 WORDS AS LISTED ABOVE ..."
print "IT IS SORTED IN ALPHABETICAL ORDER ..."
print "THE ARRAY VALUES BELOW SHOWS 0 OR 1 IF WORD FOUND IN DOCUMENT AT VOCAB INDEX ..."
print "HENCE THE ARRAY LEGTH AS 9 ...\n"

print 'This is the first document.'
print  bag_of_words.toarray()[0]

print("")

print 'This is the second second document.'
print  bag_of_words.toarray()[1]
print("")

print 'And the third one.'
print  bag_of_words.toarray()[2]
print("")

print 'Is this the first document?'
print  bag_of_words.toarray()[3]
print("")

print  bag_of_words.toarray()
print("")

WE HAVE FEATURE LIST OR VOCABULARY OF 9 WORDS AS LISTED ABOVE ...
IT IS SORTED IN ALPHABETICAL ORDER ...
THE ARRAY VALUES BELOW SHOWS 0 OR 1 IF WORD FOUND IN DOCUMENT AT VOCAB INDEX ...
HENCE THE ARRAY LEGTH AS 9 ...

This is the first document.
[0 1 1 1 0 0 1 0 1]

This is the second second document.
[0 1 0 1 0 2 1 0 1]

And the third one.
[1 0 0 0 1 0 1 1 0]

Is this the first document?
[0 1 1 1 0 0 1 0 1]

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]



In [9]:
vectorizer.vocabulary_.get("third")

7

# Stopwords - Low information words which occur often

![ML](images/ml63.png)

# NLTK - Natural Language Toolkit

It is a tool kit to help with Natural Language Processing.

Let's use this to derive common stop words in english language

In [36]:
from nltk.corpus import stopwords

sw = stopwords.words("english")

In [37]:
print "Let's see first 10 stopwords"
sw[:10]

Let's see first 10 stopwords


[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

In [38]:
print "We have Total {} Stopwords ".format(len(sw))

We have Total 153 Stopwords 


# STEMMING 

Some words mean the same example: Response, Respond etc. These are ideally counted as seperated words but they mean the same - so what count them multiple times. **Stemming** is a technique which helps solves this.

Stemming has process called 'Stemmer' - It basically group words with same context and group them under same root, so all these words can then theortically be replaced as one root and in vocabulary counted as one. Example below ..

![ML](images/ml64.png)

Stemming is complex technique, there are profressional linguist who work very hard to define what words should be formed under same group, so ideally we do not do this ourself, but use the pre-defined understanding from common public sources example NLTK.

There are many stemmers available, we will be trying one of the called 'SnowballStemmer'

In [39]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

stemmer.stem("response")

u'respons'

In [40]:
stemmer.stem("responsive")

u'respons'

In [41]:
stemmer.stem("respond")

u'respond'

** So all above outputs shows all these words belong to same stem/root **

In [42]:
stemmer.stem("unresponsive")

u'unrespons'

** Ideally unresponsive should be same Stem as before, but as we see for this particular stemmer this under different stem - so we will see what our needs demands and may require usto do some fine tuning afterwards, or it could be that we may want to keep it as-is **

## Order of Operation Thus Far

![ML](images/ml65.png)

This may intuitively makes sense too, if we do stemming later than BoW, we will not adding much value removing the noise from our vocabulary.

Here's an example that might help if this is all a little abstract:

Suppose that the text in question is "responsibility is responsive to responsible people" (ok, this doesn't make sense as a sentence, but you know what I mean...)

If you put into bag of words straightaway, you get something like

    [is:1 
    people: 1
    responsibility: 1
    responsive: 1
    responsible:1]

and then applying stemming gives you 

    [is:1
    people:1
    respon:1
    respon:1
    respon:1]

(if you can even find a way to stem the count vectorizer object in sklearn, the most likely outcome of trying would just be that your code would crash...)

Then you would need another post-processing step to get to the following bag of words, which is what you'd get straightaway if you stemmed first:

    [is:1 
    people:1
    respon:3]

Obviously the second is probably the one you want, so stemming first gets you the right answer here.

# TF-IDF 

![ML](images/ml66.png)

We will do hands-on little later as part of project below ..

## MINI-PROJECT

In the beginning of this class, you identified emails by their authors using a number of supervised classification algorithms. In those projects, we handled the preprocessing for you, transforming the input emails into a TfIdf so they could be fed into the algorithms. Now you will construct your own version of that preprocessing step, so that you are going directly from raw data to processed features.

You will be given two text files: one contains the locations of all the emails from Sara, the other has emails from Chris. You will also have access to the parseOutText() function, which accepts an opened email as an argument and returns a string containing all the (stemmed) words in the email.

parseOutText() takes the opened email and returns only the text part, stripping away any ' metadata that may occur at the beginning of the email, so what's left is the text of the message. 

In [92]:
#!/usr/bin/python

from nltk.stem.snowball import SnowballStemmer
import string

def parseOutText(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
        """
    

    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()

    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    
    #Instantiate Stemmer
    stemmer = SnowballStemmer("english")
    
    if len(content) > 1:
        ### remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        ### project part 2: comment out the line below
        #words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        
        to_stem = text_string.replace('\n', ' ')
        
        for word in to_stem.split(" "):
            words += " " + stemmer.stem(word)
        
    return words



Quick warm-up to see how it works ..

Original Content from Email FIle: 

******
To: Katie_and_Sebastians_Excellent_Students@udacity.com
<br>From: katie@udacity.com
<br>X-FileName:

Hi Everyone!  If you can read this message, you're properly using parseOutText.  Please proceed to the next part of the project!

*******

In [93]:
def main():
    ff = open("../ud120-projects/text_learning/test_email.txt", "r")
    text = parseOutText(ff)
    print text

if __name__ == '__main__':
    main()

   hi everyon  if you can read this messag your proper use parseouttext  pleas proceed to the next part of the project 


### So our parser is working fine on test data, let's dwell in to real emails & parse them out ..

In [94]:
#!/usr/bin/python

import os
import pickle
import re
import sys

sys.path.append( "../ud120-projects/tools/" )
#from parse_out_email_text import parseOutText

"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""

from_sara  = open("../ud120-projects/text_learning/from_sara.txt", "r")
from_chris = open("../ud120-projects/text_learning/from_chris.txt", "r")

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        # remove counter once code is tested ..
        # if temp_counter < 200:
        path = os.path.join('../ud120-projects/', path[:-1])
        #print path
        email = open(path, "r")

        ### use parseOutText to extract the text from the opened email
        text = parseOutText(email)
        #print "Text -- : ", text 

        ### use str.replace() to remove any instances of the words
        ### ["sara", "shackleton", "chris", "germani"]
        for rm_word in ["sara", "shackleton", "chris", "germani"]:
            text.replace(rm_word, '')

        ### append the text to word_data
        word_data.append(text)

        ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
        if name=="sara":
            from_data.append(0)
        else:
            from_data.append(1)

        email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )


### in Part 4, do TfIdf vectorization here
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')

tfidf_result = vectorizer.fit_transform(word_data)

emails processed


In [95]:
print word_data[152]

  tjonesnsf  stephani and sam need nymex calendar


In [107]:
vectorizer.get_feature_names()[20000:20010]

[u'forsman',
 u'forster',
 u'forstercorpenronenron',
 u'forsyth',
 u'forsythenronenronxg',
 u'forsythhouect',
 u'forsythhouectect',
 u'fort',
 u'fortemconedcom',
 u'fortemconedcomenron']

In [97]:
len(vectorizer.get_feature_names())

39605

In [113]:
import numpy as np

def display_scores(vectorizer, tfidf_result):
    # http://stackoverflow.com/questions/16078015/
    scores = zip(vectorizer.get_feature_names(),
                 np.asarray(tfidf_result.sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    
    #Counter to Stop avoiding all be printed
    print_counter = 1
    
    for item in sorted_scores:
        print "{0:50} Score: {1}".format(item[0], item[1])
        print_counter += 1
        if 10%print_counter:
            break
        
        
display_scores(vectorizer, tfidf_result)

sara                                               Score: 703.56324937
cgermannsf                                         Score: 523.276976525


In [100]:
vectorizer.get_feature_names()[34598]

u'skeena'