# TFDIF (Term Frequency * Inverse Document Frequency) versus Term Counts
### This is an example of how TFIDF provides a better idea of how important a term is to a particular document, by comparing its frequency across other documents in a corpus.
### The corpus is the NLTK's corpus of 9 documents.
### The particular document examined is text3, which is the Book of Genesis.
### This example computes the TFIDF for both God and Jacob in the book of Genesis, and then compares them.  Theology aside, TFIDF argues for Jacob being particularly important to the Book of Genesis, arguably more important that God, given God's greater presence across the corpus.

In [None]:
import string ; import numpy as np
from nltk.book import *

In [None]:
# Make a list of the terms (words) in the Book of Genesis, stripped of punctuation
terms = [t for t in text3[:] if t not in string.punctuation]
terms[:8]

In [None]:
# Make a term count and display the top 50 terms in descending order.  
# Use a collections.Counter object.
from collections import Counter
term_count = Counter(terms)

# The 50 most common words in Genesis are:
term_count.most_common( 50 )

In [None]:
'''The very most common words ("and", "the", etc.) are not very interesting to examine.
They occur across all the documents in the corpus (I confirmed), so their IDF will be 
zero (see formula below), that will be boring.  Instead, note the raw frequency 
(term counts) of God and Jacob shown above.  Out of the total 2780 words, their 
raw frequencies suggest that out God may have a slightly more important role than 
Jacob . . . but we'll see what their respective TFIDF values say.
'''
len(term_count), term_count['God'], term_count['Jacob']

In [None]:
God = {} ; Jacob = {}   # empty dicts for storing values

## Compute TF

In [None]:
'''I will use these raw frequencies as their TF (term frequencies) without any 
further normalizing such as "augmented" frequency that would take into account 
the length of Genesis compared to the other documents in the corpus.  That is 
because I only care about the relative importance of the terms "God" and "Jacob"
within the document Genesis.  Within all the other documents I only care whether
these two terms occur or not, not how often the occur in those other documents.  So:
'''
God['TF']   = term_count['God']
Jacob['TF'] = term_count['Jacob']
God['TF'], Jacob['TF']

## Compute IDF
Across the documents, each term's frequency would be the number of documents the term appears in, divided by the total number of documents.  We want the *INVERSE* of that, because the inverse fraction grows if the term is more *rare* across documents.  If the term is more common across documents, then the fraction $\longrightarrow$ 1.

Formally, define:  $IDF = \log\left(\frac{\mbox{total # of documents in corpus}}{\mbox{# of documents the term appears in}}\right)$

Note:  This does not ask how many occurrences of the term appear in a document.  Even a single occurrence is enough to count that document in the denominator.  Maybe in reality a better measure might be to only count a document if the term appears at least a few times in a document.  But for now, I'm proceeding with this standard definition.

In [None]:
corpus = text1, text2, text3, text4, text5, text6, text7, text8, text9
n_docs = len(corpus)

# keep track of how many documents contain "God", and how many "Jacob"
God['docs']   = 0
Jacob['docs'] = 0

for i, document in enumerate(corpus):
    
    terms = [t for t in document[:] if t not in string.punctuation]
    c = Counter(terms)
    print('In text {} there are {:3} instances of God and {:3} instances of Jacob'
                .format( str(i+1), c['God'], c['Jacob']))
    if c['God']   > 0:   God['docs'] += 1
    if c['Jacob'] > 0: Jacob['docs'] += 1

# compute IDFs
God['IDF']   = np.log( n_docs /   God['docs'] )
Jacob['IDF'] = np.log( n_docs / Jacob['docs'] )

print("\nGod's IDF   = {:.3f}".format(God['IDF']))
print("Jacob's IDF = {:.3f}".format(Jacob['IDF']))

In [None]:
# Finally, compute TFIDF = TF * IDF
God['TFIDF']   =   God['TF'] *   God['IDF']
Jacob['TFIDF'] = Jacob['TF'] * Jacob['IDF']

print("\nGod's TFIDF   = {:5.1f}".format(God['TFIDF']))
print("Jacob's TFIDF = {:5.1f}".format(Jacob['TFIDF']))
print("Relative importance (arguably) of Jacob over God in Genesis = "+
      "Jacob['TFIDF'] / God['TFIDF'] = {:2.1f}"
      .format(Jacob['TFIDF'] / God['TFIDF']))

### Jacob's TFIDF in Genesis is ~7 times higher than God's, indicating that Jacob arguably plays a particularly important role, even compared to God who surely must have an important role in Genesis.