# TFIDF (Term Frequency * Inverse Document Frequency) versus Term Counts

This is an example of how TFIDF provides a better idea of how important a term is to a particular document, by comparing its frequency across other documents in a corpus.

The corpus is the NLTK's corpus of 9 documents, and the particular document examined is text3, which is the Book of Genesis.

This example computes the TFIDF for both God and Jacob in the book of Genesis, and then compares them.  Using TFIDF you can argue that for this particular book (`text3`) Jacob is more important than God since God occurs in many books across the corpus.

In [1]:
import string ; import numpy as np
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# Make a list of the terms (words) in the Book of Genesis, stripped of punctuation
terms = [t for t in text3[:] if t not in string.punctuation]
terms[:8]

['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and']

In [3]:
# Make a term count and display the top 50 terms in descending order.  
# Use a collections.Counter object.
from collections import Counter
term_count = Counter(terms)

# The 50 most common words in Genesis are:
term_count.most_common( 50 )

[('and', 2428),
 ('the', 2411),
 ('of', 1358),
 ('And', 1250),
 ('his', 651),
 ('he', 648),
 ('to', 611),
 ('unto', 590),
 ('in', 588),
 ('that', 509),
 ('I', 484),
 ('said', 476),
 ('him', 387),
 ('a', 342),
 ('my', 325),
 ('was', 317),
 ('for', 297),
 ('it', 290),
 ('with', 289),
 ('me', 282),
 ('thou', 272),
 ('is', 267),
 ('thy', 267),
 ('s', 263),
 ('thee', 257),
 ('be', 254),
 ('shall', 253),
 ('they', 249),
 ('all', 245),
 ('God', 231),
 ('them', 230),
 ('not', 224),
 ('which', 198),
 ('father', 198),
 ('will', 195),
 ('land', 184),
 ('Jacob', 179),
 ('came', 177),
 ('her', 173),
 ('LORD', 166),
 ('were', 163),
 ('she', 161),
 ('from', 157),
 ('Joseph', 157),
 ('their', 153),
 ('son', 152),
 ('sons', 142),
 ('upon', 139),
 ('had', 139),
 ('have', 132)]

In [4]:
'''The very most common words ("and", "the", etc.) are not very interesting to examine.
They occur across all the documents in the corpus (I confirmed), so their IDF will be 
zero (see formula below), that will be boring.  Instead, note the raw frequency 
(term counts) of God and Jacob shown above.  Out of the total 2780 words, their 
raw frequencies suggest that out God may have a slightly more important role than 
Jacob . . . but we'll see what their respective TFIDF values say.
'''
len(term_count), term_count['God'], term_count['Jacob']

(2780, 231, 179)

In [5]:
God = {} ; Jacob = {}   # empty dicts for storing values

## Compute TF

In [6]:
'''I will use these raw frequencies as their TF (term frequencies) without any 
further normalizing such as "augmented" frequency that would take into account 
the length of Genesis compared to the other documents in the corpus.  That is 
because I only care about the relative importance of the terms "God" and "Jacob"
within the document Genesis.  Within all the other documents I only care whether
these two terms occur or not, not how often the occur in those other documents.  So:
'''
God['TF']   = term_count['God']
Jacob['TF'] = term_count['Jacob']
God['TF'], Jacob['TF']

(231, 179)

## Compute IDF
Across the documents, each term's frequency would be the number of documents the term appears in, divided by the total number of documents.  We want the *INVERSE* of that, because the inverse fraction grows if the term is more *rare* across documents.  If the term is more common across documents, then the fraction $\longrightarrow$ 1.

Formally, define:  IDF = log[ ( total # of documents in corpus ) / ( # of documents the term appears in ) ]

Note:  This does not ask how many occurrences of the term appear in a document.  Even a single occurrence is enough to count that document in the denominator.  Maybe in reality a better measure might be to only count a document if the term appears at least a few times in a document.  But for now, I'm proceeding with this standard definition.

In [7]:
corpus = text1, text2, text3, text4, text5, text6, text7, text8, text9
n_docs = len(corpus)

# keep track of how many documents contain "God", and how many "Jacob"
God['docs']   = 0
Jacob['docs'] = 0

for i, document in enumerate(corpus):
    
    terms = [t for t in document[:] if t not in string.punctuation]
    c = Counter(terms)
    print('In text {} there are {:3} instances of God and {:3} instances of Jacob'
                .format( str(i+1), c['God'], c['Jacob']))
    if c['God']   > 0:   God['docs'] += 1
    if c['Jacob'] > 0: Jacob['docs'] += 1

# compute IDFs
God['IDF']   = np.log( n_docs /   God['docs'] )
Jacob['IDF'] = np.log( n_docs / Jacob['docs'] )

print("\nGod's IDF   = {:.3f}".format(God['IDF']))
print("Jacob's IDF = {:.3f}".format(Jacob['IDF']))

In text 1 there are 132 instances of God and   1 instances of Jacob
In text 2 there are  10 instances of God and   0 instances of Jacob
In text 3 there are 231 instances of God and 179 instances of Jacob
In text 4 there are  97 instances of God and   0 instances of Jacob
In text 5 there are   1 instances of God and   0 instances of Jacob
In text 6 there are  11 instances of God and   0 instances of Jacob
In text 7 there are   1 instances of God and   2 instances of Jacob
In text 8 there are   0 instances of God and   0 instances of Jacob
In text 9 there are  33 instances of God and   0 instances of Jacob

God's IDF   = 0.118
Jacob's IDF = 1.099


In [8]:
# Finally, compute TFIDF = TF * IDF
God['TFIDF']   =   God['TF'] *   God['IDF']
Jacob['TFIDF'] = Jacob['TF'] * Jacob['IDF']

print("\nGod's TFIDF   = {:5.1f}".format(God['TFIDF']))
print("Jacob's TFIDF = {:5.1f}".format(Jacob['TFIDF']))
print("Relative importance (arguably) of Jacob over God in Genesis = "+
      "Jacob['TFIDF'] / God['TFIDF'] = {:2.1f}"
      .format(Jacob['TFIDF'] / God['TFIDF']))


God's TFIDF   =  27.2
Jacob's TFIDF = 196.7
Relative importance (arguably) of Jacob over God in Genesis = Jacob['TFIDF'] / God['TFIDF'] = 7.2


__Conclusion:__  Jacob's TFIDF in Genesis is ~7 times higher than God's, indicating that Jacob arguably plays a particularly important role in `text3`, maybe even more important than God because God is a term that occurs across most books in the corpus while Jacob rarely appears outside `text3`.