# Lab 5 : TF-IDF Intro

We are going to calculate TFIDF for small document set.
We will use TextBlob library we just learned.

Adopted with thanks from http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/

In [4]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/ubuntu/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/ubuntu/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/ubuntu/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/ubuntu/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/ubuntu/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/ubuntu/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package

True

## Basic setup

In [5]:
## basic stuff

import math
from textblob import TextBlob as tb
import nltk

# setup nltk
from os.path import expanduser
nltk.data.path.append( expanduser("~") + "/data/nltk_data")

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    containing = n_containing(word, bloblist)
    if containing == 0:
        return 0
    else:
        #return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
        return math.log(len(bloblist) / containing)

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)


## Calculating TDIDF

In [6]:
## Exceute the basic-stuff above first

# analyzing
d1 = "the brown dog likes the white cow"
d2 = "the grass is brown"
d3 = "the spotted cow likes green grass"


doclist = [tb(d1), tb(d2), tb(d3)]

for i, tbdoc in enumerate(doclist):
    print("===document {}".format(i+1))
    words = tbdoc.words
    print("   # total words in document {}".format(len(words)))
    wordscore = {}
    for w in words:
        print("   "  + w)
        wtf = tf(w, tbdoc)
        widf = idf(w, doclist)
        wtfidf = tfidf(w, tbdoc, doclist)
        wordscore[w] = wtfidf
        print("        tf: {},  idf: {},  tfidf: {}".format(wtf, widf, wtfidf))
    
    # sort the words by tdidf score for this document
    wordscore_sorted = sorted(wordscore.items(), key=lambda x: x[1], reverse=True)
    print ("   words sorted by TFIDF : " )
    for word, score in wordscore_sorted:
        print ("      {} ({}), ".format(word, score))
        
    print("")


===document 1
   # total words in document 7
   the
        tf: 0.2857142857142857,  idf: 0.0,  tfidf: 0.0
   brown
        tf: 0.14285714285714285,  idf: 0.4054651081081644,  tfidf: 0.05792358687259491
   dog
        tf: 0.14285714285714285,  idf: 1.0986122886681098,  tfidf: 0.15694461266687282
   likes
        tf: 0.14285714285714285,  idf: 0.4054651081081644,  tfidf: 0.05792358687259491
   the
        tf: 0.2857142857142857,  idf: 0.0,  tfidf: 0.0
   white
        tf: 0.14285714285714285,  idf: 1.0986122886681098,  tfidf: 0.15694461266687282
   cow
        tf: 0.14285714285714285,  idf: 0.4054651081081644,  tfidf: 0.05792358687259491
   words sorted by TFIDF : 
      dog (0.15694461266687282), 
      white (0.15694461266687282), 
      brown (0.05792358687259491), 
      likes (0.05792358687259491), 
      cow (0.05792358687259491), 
      the (0.0), 

===document 2
   # total words in document 4
   the
        tf: 0.25,  idf: 0.0,  tfidf: 0.0
   grass
        tf: 0.25,  idf: 0.4054

## construct term matrix
when we search for a term 'cow' which document gets ranked first?

In [7]:
## Exceute the basic-stuff above first

d1 = "the brown dog likes the white cow"
d2 = "the grass is brown"
d3 = "the spotted cow likes green grass"

tb_doclist = [tb(d1), tb(d2), tb(d3)]
term_matrix = {}

for i, tbdoc in enumerate(tb_doclist):
    words = tbdoc.words
    for w in words:
        doc_dict = term_matrix.get(w, {})
        wtfidf = tfidf(w, tbdoc, tb_doclist)
        doc_dict["d{}".format(i+1)] = wtfidf
        term_matrix[w] = doc_dict  # update
        

# sort the term matrix by highest ranking doc
term_matrix_sorted = {}
for term, doc_dict in term_matrix.items():
    #print(term)
    #print(doc_dict)
    doc_list_sorted = sorted(doc_dict.items(), key = lambda x:x[1], reverse=True)
    #print(doc_list_sorted)
    term_matrix_sorted[term] = doc_list_sorted
    
# finally print out the sorted matrix
# printing the terms in alphabet order
for term in sorted(term_matrix_sorted.keys()):
    print (term)
    print (term_matrix_sorted[term])
    print()
    


brown
[('d2', 0.1013662770270411), ('d1', 0.05792358687259491)]

cow
[('d3', 0.06757751801802739), ('d1', 0.05792358687259491)]

dog
[('d1', 0.15694461266687282)]

grass
[('d2', 0.1013662770270411), ('d3', 0.06757751801802739)]

green
[('d3', 0.1831020481113516)]

is
[('d2', 0.27465307216702745)]

likes
[('d3', 0.06757751801802739), ('d1', 0.05792358687259491)]

spotted
[('d3', 0.1831020481113516)]

the
[('d1', 0.0), ('d2', 0.0), ('d3', 0.0)]

white
[('d1', 0.15694461266687282)]

