# Lab Objectives

This lab aims to show how tf-idf works by walking you through an example, including:

1. From Documents to Bags-of-words
2. Transform raw tf to the log term frequency 
3. Adding idf to the weighted matrix
4. The final term-document TF-IDF matrix

Please try out the following cells and run the python code in your notebook. 

***
***This is not an assignment and you do not need to submit it***

# From Documents to Bags-of-words

In the Vector Space Model (VSM), we view each document as a *bag of words*. (Bag means multiset)

In python, the best way to represent a bag is via `collections.counter`.


In [1]:
# This is a helper function.
# The cprint function is defined to print NumPy arrays with a more compact format

from pprint import pprint

import numpy as np
import contextlib

@contextlib.contextmanager
def printoptions(*args, **kwargs):
    original = np.get_printoptions()
    np.set_printoptions(*args, **kwargs)
    yield 
    np.set_printoptions(**original)

## compact print (numpy array) 
def cprint(x):
    with printoptions(precision=3, suppress=True, linewidth=120):
        print(x)        
    
# test
x = np.random.random(10)
cprint(x)

[0.935 0.042 0.914 0.797 0.364 0.79  0.283 0.486 0.39  0.405]


Now we count the number of times each of these words appears in each document:

In [2]:
doc_list = ['Julie loves me more than Linda loves me',
            'Jane likes me more than Julie loves me',
            'He likes basketball more than baseball']

bags = []

from collections import Counter

# we use lower() here so that the sorted vocabulary is easier to see.  
# For simplicity, we ignore the steps of stop words removal and stemming during tokenization.
bags = [Counter(doc.lower().split()) for doc in doc_list]

for bag in bags:
    print(bag)

Counter({'loves': 2, 'me': 2, 'julie': 1, 'more': 1, 'than': 1, 'linda': 1})
Counter({'me': 2, 'jane': 1, 'likes': 1, 'more': 1, 'than': 1, 'julie': 1, 'loves': 1})
Counter({'he': 1, 'likes': 1, 'basketball': 1, 'more': 1, 'than': 1, 'baseball': 1})


Now we construct the vocabulary and the doc-term matrix. 

In [3]:
# A helper function to get the vocabulary
import itertools

vocabulary = sorted(list(set(itertools.chain(*[list(b) for b in bags]))))
vocabulary

['baseball',
 'basketball',
 'he',
 'jane',
 'julie',
 'likes',
 'linda',
 'loves',
 'me',
 'more',
 'than']

In [4]:
# A helper function to print vocabulary vertically.


def print_vocabulary_vertically(voc, leading_str = '', spacing=2, align=1):
    # align = 0: align top; otherwise, align bottom
    max_len = max([len(v) for v in voc])
    for i in range(max_len):
        if align == 0:
            line = [v[i] if i < len(v) else ' ' for v in voc]
        else:
            line = [' ' if i < max_len - len(v) else v[i-max_len] for v in voc]
        print('{}{}'.format(leading_str, (' '*spacing).join(line)))

print_vocabulary_vertically(vocabulary, align=0)

b  b  h  j  j  l  l  l  m  m  t
a  a  e  a  u  i  i  o  e  o  h
s  s     n  l  k  n  v     r  a
e  k     e  i  e  d  e     e  n
b  e        e  s  a  s         
a  t                           
l  b                           
l  a                           
   l                           
   l                           


In [5]:
# printing out the raw term frequency.



for doc in doc_list:
    print(doc)
print()
vec_list = []
print_vocabulary_vertically(vocabulary, leading_str=' ')
print('-'*70)
for bag in bags:
    vec = [bag[v] for v in vocabulary] # Counter['non-existing-key'] = 0
    vec_list.append(vec)
    print(vec)

Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
He likes basketball more than baseball

    b                           
    a                           
 b  s                           
 a  k                           
 s  e                           
 e  t        j  l  l  l         
 b  b     j  u  i  i  o     m  t
 a  a     a  l  k  n  v     o  h
 l  l  h  n  i  e  d  e  m  r  a
 l  l  e  e  e  s  a  s  e  e  n
----------------------------------------------------------------------
[0, 0, 0, 0, 1, 0, 1, 2, 2, 1, 1]
[0, 0, 0, 1, 1, 1, 0, 1, 2, 1, 1]
[1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1]


# Transform raw tf to tf by $1 + \log(tf)$.

In [6]:
from math import log
def normalize_tf(tf):
    if tf == 0:
        return 0.0
    else:
        return 1.0 + log(tf)

In [8]:
# TO ASK: what does log function do ???

tf_vec_list = []
for vec in vec_list:
    tf_vec_list.append([normalize_tf(val) for val in vec])

print_vocabulary_vertically(vocabulary, leading_str='   ', spacing=6)
print('-'*80)
cprint(np.matrix(tf_vec_list))

          b                                                               
          a                                                               
   b      s                                                               
   a      k                                                               
   s      e                                                               
   e      t                    j      l      l      l                     
   b      b             j      u      i      i      o             m      t
   a      a             a      l      k      n      v             o      h
   l      l      h      n      i      e      d      e      m      r      a
   l      l      e      e      e      s      a      s      e      e      n
--------------------------------------------------------------------------------
[[0.    0.    0.    0.    1.    0.    1.    1.693 1.693 1.    1.   ]
 [0.    0.    0.    1.    1.    1.    0.    1.    1.693 1.    1.   ]
 [1.    1.    1.    0.    0.   

# Adding idf (WHAT DOES THIS PART ONWARDS MEAN)

The above vectors in `vec_list` only keep track of term frequencies (`tf`s). Now we make each element $w_{d, t} = tf(d, t) \cdot idf(t)$. 

Note that the textbook version of idf will have 0 if `df == ndocs`. Therefore, to deal with the corner cases, we can follow the implementation that gives: 
$$idf(t) = 1 + \log(\frac{ndocs}{df + 1})$$

BTW, you will always need to do this to tackle new or specific problems you are facing when you join the workforce:) Therefore, a deep understanding of the principles behind the knowledge you learned at the uni is important. 

In [9]:
from math import log
def idf(cnt, ndocs): 
    return 1.0 + log(ndocs/(cnt+1))  

In [10]:
ndocs = len(doc_list)
# voc = [ (v, [b[v] for b in bags]) for v in vocabulary] # if you want to see the individual counts
voc = [(v, sum([b[v] for b in bags])) for v in vocabulary]
pprint(voc)

[('baseball', 1),
 ('basketball', 1),
 ('he', 1),
 ('jane', 1),
 ('julie', 2),
 ('likes', 2),
 ('linda', 1),
 ('loves', 3),
 ('me', 4),
 ('more', 3),
 ('than', 3)]


We will record in `voc` the idf values. 

In [11]:
voc = [(v, idf( sum([b[v] for b in bags]), ndocs)) for v in vocabulary]
idf_dict = dict(voc)
pprint(voc)
# print(idf_dict['he'])

[('baseball', 1.4054651081081644),
 ('basketball', 1.4054651081081644),
 ('he', 1.4054651081081644),
 ('jane', 1.4054651081081644),
 ('julie', 1.0),
 ('likes', 1.0),
 ('linda', 1.4054651081081644),
 ('loves', 0.7123179275482191),
 ('me', 0.4891743762340093),
 ('more', 0.7123179275482191),
 ('than', 0.7123179275482191)]


# The final term-document TF-IDF matrix

We choose to implement it based on matrix operations supported by `numpy`. E.g., to get $\{a_i b_i\}_{i=1}^n$ from two vector $\vec{A} = \{a_i\}_{i=1}^n$ and $\vec{B} = \{b_i\}_{i=1}^n$, we first construct a diagonal matrix $\mathbf{D_A}$ whose diagonal elements are $a_i$, then we can obtain the desired result by the standard matrix multiplication of $\vec{B} \mathbf{D_A}$. 

In [12]:
import numpy as np

def build_idf_matrix(idf_vector):
    idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
    np.fill_diagonal(idf_mat, idf_vector)
    return idf_mat

idf_matrix = build_idf_matrix([v[1] for v in voc])
cprint(idf_matrix)

[[1.405 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.   ]
 [0.    1.405 0.    0.    0.    0.    0.    0.    0.    0.    0.   ]
 [0.    0.    1.405 0.    0.    0.    0.    0.    0.    0.    0.   ]
 [0.    0.    0.    1.405 0.    0.    0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    1.    0.    0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.    1.    0.    0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.    0.    1.405 0.    0.    0.    0.   ]
 [0.    0.    0.    0.    0.    0.    0.    0.712 0.    0.    0.   ]
 [0.    0.    0.    0.    0.    0.    0.    0.    0.489 0.    0.   ]
 [0.    0.    0.    0.    0.    0.    0.    0.    0.    0.712 0.   ]
 [0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.712]]


In [13]:
doc_term_matrix_tfidf = []

#performing tf-idf matrix multiplication
for vec in tf_vec_list:
    doc_term_matrix_tfidf.append(np.dot(vec, idf_matrix))
cprint(np.matrix(doc_term_matrix_tfidf)) # np.matrix() just to make it easier to look at

[[0.    0.    0.    0.    1.    0.    1.405 1.206 0.828 0.712 0.712]
 [0.    0.    0.    1.405 1.    1.    0.    0.712 0.828 0.712 0.712]
 [1.405 1.405 1.405 0.    0.    1.    0.    0.    0.    0.712 0.712]]


Look at some elements in matrix m.

- `m[0][6] = 1.405`. This is because $tf(d_0, 'linda') = 1$, and $idf('linda') = 1.4054651081081644$. 
- `m[0][8] = 0.828`. This is because $tf(d_0, 'me') = 1.693$, and $idf('me') = 0.4891743762340093$.  

---
***end***