## Today, we're teaching computers how to read

Computers are great at crunching numbers.   But crunching words?  Not so much...   So, today, we're going to send our computer to school and teach it to read.   How? By converting words to numbers.  

In this tutorial I'll cover the most basic parts of 'language processing.' There's a lot to language processing, or text analytics, and this is only a start, but you can do alot with the things we cover here. 

### What's a Corpus?

Lets start with a brief corpus of documents.  A corpus is a collection.   

In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [5]:
docA = "the cat sat on my face"
docB = "the dog sat on my bed"

### Tokenizing
Most of the time when we work on text, we can use the 'Bag Of Words' model to represent a document.   In the BOW model, each document can be thought of as a bag of words...

In [6]:
bowA = docA.split(" ")
bowB = docB.split(" ")

In [41]:
bowB
bowA
a = bowA
b = bowB

['the', 'dog', 'sat', 'on', 'my', 'bed']

['the', 'cat', 'sat', 'on', 'my', 'face']

Splitting a document up into the component words like this is called 'tokenizing.'

Ok, so the documents are tokenized, but how do we convert a tokenized BOW into numbers?  

There are a few strategies.   One simple strategy could be to create a vector of all possible words, and for each document count how many times each word appears.

In [9]:
wordSet= set(bowA).union(set(bowB))
wordSet

{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'}

In [11]:
#all words in all bags/documents
wordSet

{'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the'}

In [12]:
#I'll create dictionaries to keep my word counts.
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

In [14]:
#This is what one of them looks like
wordDictA
wordDictB

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

{'bed': 0, 'cat': 0, 'dog': 0, 'face': 0, 'my': 0, 'on': 0, 'sat': 0, 'the': 0}

In [16]:
#now I'll count the words in my bags.
for word in bowA:
    wordDictA[word]+=1

for word in bowB:
    wordDictB[word]+=1

In [17]:
wordDictA
wordDictB

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

{'bed': 1, 'cat': 0, 'dog': 1, 'face': 0, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

In [18]:
#Lastly I'll stick those into a matrix.
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0,1,0,1,1,1,1,1
1,1,0,1,0,1,1,1,1


Boom!  We just converted words into a linear algebra problem!  Computers can handle linear algebra, mission accomplished.

### Not So Fast...

Mission almost accomplished.   The problem with our counting strategy is that we use alot of words commonly, that just don't mean much.  In fact, the most commonly used word in the english language (the) makes up 7% of the words we speak, which is double the frequency of the next most popular word (of).   The distribution of words in language is a power law distribution, which is the basis for Zipf's law. [(Wikipedia)](http://en.wikipedia.org/wiki/Zipf%27s_law)

So, if we construct our document matrix out of counts, then we end up with numbers that don't contain much information, unless our goal was to see who uses 'the' most often.  

### TF-IDF - A better Strategy

Rather than just counting, we can use the [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) score of a word to rank it's importance.   

The tfidf score of a word, w, is:
$$tf(w) * idf(w)$$

Where tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

And where idf(w) = log(Number of documents / Number of documents that contain word w ).

In [23]:
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bowCount)
    return tfDict

In [45]:
wordDictA
wordDictB
stfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
tfBowA
tfBowB

{'bed': 0, 'cat': 1, 'dog': 0, 'face': 1, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

{'bed': 1, 'cat': 0, 'dog': 1, 'face': 0, 'my': 1, 'on': 1, 'sat': 1, 'the': 1}

{'bed': 0.0,
 'cat': 0.16666666666666666,
 'dog': 0.0,
 'face': 0.16666666666666666,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'the': 0.16666666666666666}

{'bed': 0.16666666666666666,
 'cat': 0.0,
 'dog': 0.16666666666666666,
 'face': 0.0,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666,
 'the': 0.16666666666666666}

In [43]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
    #counts the number of documents that contain a word w
    idfDict = dict.fromkeys(docList[0].keys(),0)
    print (idfDict)
    for doc in docList:
        print (doc.items())
        for word, val in doc.items():
            if val > 0:
                idfDict[word] +=1
    print (idfDict)            
    #divide N by denominator above, take the log of that
    for word, val in idfDict.items():
        idfDict[word]= math.log(N / float(val)) 

    return idfDict
    
   
    

In [44]:
[wordDictA, wordDictB]
idfs = computeIDF([wordDictA, wordDictB])
idfs

[{'bed': 0,
  'cat': 1,
  'dog': 0,
  'face': 1,
  'my': 1,
  'on': 1,
  'sat': 1,
  'the': 1},
 {'bed': 1,
  'cat': 0,
  'dog': 1,
  'face': 0,
  'my': 1,
  'on': 1,
  'sat': 1,
  'the': 1}]

{'cat': 0, 'on': 0, 'bed': 0, 'dog': 0, 'face': 0, 'my': 0, 'sat': 0, 'the': 0}
dict_items([('cat', 1), ('on', 1), ('bed', 0), ('dog', 0), ('face', 1), ('my', 1), ('sat', 1), ('the', 1)])
dict_items([('cat', 0), ('on', 1), ('bed', 1), ('dog', 1), ('face', 0), ('my', 1), ('sat', 1), ('the', 1)])
{'cat': 1, 'on': 2, 'bed': 1, 'dog': 1, 'face': 1, 'my': 2, 'sat': 2, 'the': 2}


{'bed': 0.6931471805599453,
 'cat': 0.6931471805599453,
 'dog': 0.6931471805599453,
 'face': 0.6931471805599453,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

In [47]:
def computeTFIDF(tfBow, idfs):
    print (tfBow)
    print (idfs)
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val * idfs[word]
    return tfidf


In [48]:
tfidfBowA =  computeTFIDF(tfBowA, idfs)
tfidfBowA
tfidfBowB = computeTFIDF(tfBowB, idfs)
tfidfBowB

{'cat': 0.16666666666666666, 'on': 0.16666666666666666, 'bed': 0.0, 'dog': 0.0, 'face': 0.16666666666666666, 'my': 0.16666666666666666, 'sat': 0.16666666666666666, 'the': 0.16666666666666666}
{'cat': 0.6931471805599453, 'on': 0.0, 'bed': 0.6931471805599453, 'dog': 0.6931471805599453, 'face': 0.6931471805599453, 'my': 0.0, 'sat': 0.0, 'the': 0.0}


{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'face': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

{'cat': 0.0, 'on': 0.16666666666666666, 'bed': 0.16666666666666666, 'dog': 0.16666666666666666, 'face': 0.0, 'my': 0.16666666666666666, 'sat': 0.16666666666666666, 'the': 0.16666666666666666}
{'cat': 0.6931471805599453, 'on': 0.0, 'bed': 0.6931471805599453, 'dog': 0.6931471805599453, 'face': 0.6931471805599453, 'my': 0.0, 'sat': 0.0, 'the': 0.0}


{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'face': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

In [88]:
#Lastly I'll stick those into a matrix.
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,cat,dog,face,my,on,sat,the
0,0.0,0.115525,0.0,0.115525,0,0,0,0
1,0.115525,0.0,0.115525,0.0,0,0,0,0
