# Exercise 07: TF-IDF

The following exercise uses results from our parsing to calculate a *term frequency - inverse document frequency* (TF-IDF) metric to construct *feature vectors* per document. First we'll load a *stopword* list, for common words to ignore from the analysis:

In [1]:
import pynlp

stopwords = pynlp.load_stopwords("stop.txt")
print(stopwords)

{'an', 'both', 'get', 'not', 'for', 'each', 'since', 'us', 'or', 'while', 'at', 'your', 'that', 'who', 'now', 'all', 'the', 'about', 'also', 'on', 'come', 'from', 'and', 'how', 'same', 'use', 'out', 'so', 'than', 'new', 'such', 'let', 'up', 'to', 'few', 'find', 'feel', 'what', 'it', 'as', 'its', 'when', 'our', 'you', 'which', 'take', 'try', 'but', 'want', 'much', 'go', 'then', 'there', 'they', 'two', 'other', 'next', 'if', 'with', 'handle', 'do', 'can', 'have', 'be', 'of', 'see', 'where', 'this', 'more', 'write', 'one', 'over', 'like', 'by', 'around', 'in', 'just', 'we', 'a', 'my', "n't", 'some', 'their', 'i'}


Next, we'll use a function from our `pynlp` library to iterate through the keywords for one of the parsed HTML documents:

In [2]:
%sx ls *.json

['a1.json', 'a2.json']

In [3]:
json_file = "a1.json"

for lex in pynlp.lex_iter(json_file):
    print(lex)

WordNode(raw='Almost', root='almost', pos='RB')
WordNode(raw='a', root='a', pos='DT')
WordNode(raw='year', root='year', pos='NN')
WordNode(raw='ago', root='ago', pos='RB')
WordNode(raw=',', root=',', pos='.')
WordNode(raw='we', root='we', pos='PRP')
WordNode(raw='published', root='publish', pos='VBD')
WordNode(raw='our', root='our', pos='PRP$')
WordNode(raw='now', root='now', pos='RB')
WordNode(raw='-', root='-', pos='.')
WordNode(raw='annual', root='annual', pos='JJ')
WordNode(raw='landscape', root='landscape', pos='NN')
WordNode(raw='of', root='of', pos='IN')
WordNode(raw='machine', root='machine', pos='NN')
WordNode(raw='intelligence', root='intelligence', pos='NN')
WordNode(raw='companies', root='company', pos='NNS')
WordNode(raw=',', root=',', pos='.')
WordNode(raw='and', root='and', pos='CC')
WordNode(raw='goodness', root='goodness', pos='NN')
WordNode(raw='have', root='have', pos='VBP')
WordNode(raw='we', root='we', pos='PRP')
WordNode(raw='seen', root='see', pos='VBN')
WordNode

We need to initialize some data structures for counting keywords. BTW, if you've heard about how Big Data projects use [word count](http://spark.apache.org/examples.html) programs to demonstrate their capabilities, here's a major use case for that. Even so, our examples are conceptually simple, built for relatively small files, and are not intended to scale:

In [4]:
from collections import defaultdict

files = ["a4.json", "a3.json", "a2.json", "a1.json"]
files_tf = {}

d = len(files)
df = defaultdict(int)

Iterate through each parsed file, tallying counts for `tf` for each document while also tallying counts for `df` across all documents:

In [5]:
for json_file in files:
    tf = defaultdict(int)

    for lex in pynlp.lex_iter(json_file):
        if (lex.pos != ".") and (lex.root not in stopwords):
            tf[lex.root] += 1

    files_tf[json_file] = tf

    for word in tf.keys():
        df[word] += 1

## print results for just the last file in the sequence
print(json_file, files_tf[json_file])

FileNotFoundError: [Errno 2] No such file or directory: 'a4.json'

Let's take a look at the `df` results overall. If there are low-information common words in the list that you'd like to filter out, move them to your *stopword* list.

In [None]:
for word, count in sorted(df.items(), key=lambda kv: kv[1], reverse=True):
  print(word, count)

Finally, we make a second pass through the data, using the `df` counts to normalize `tf` counts, calculating the `tfidf` metrics for each keyword:

In [None]:
import math

for json_file in files:
    tf = files_tf[json_file]
    keywords = []

    for word, count in tf.items():
        tfidf = float(count) * math.log((d + 1.0) / (df[word] + 1.0))
        keywords.append((json_file, tfidf, word,))

Let's take a look at the results for one of the files:

In [None]:
for json_file, tfidf, word in sorted(keywords, key=lambda x: x[1], reverse=True):
    print("%s\t%7.4f\t%s" % (json_file, tfidf, word))

Question: how does that vector of ranked keywords compare with your reading of the text from the HTML file?