## Term Frequency Inverse Document Frequency

$Tf-IDF formula: $

$ tf(t,d) =   \dfrac{\text{(frequency of term t in document d)}} {\text{(total number of terms in d)}} $

$ idf(t) =  \log{\dfrac{1 + n}{1 + df(t)} + 1} $

$where$ 
$\\ n: \text{total number of documents in corpus} \\$
$ df(t): \text{number of documents having term t} $

$Finally,\\$ 
$Tfidf(t,d) = tf(t,d) * idf(t) $

$\textbf{Note: }\text{The formula for IDF provided above is the one used in sklearn. There are a few variations of it.}$
$\\\text{One of the variant being is as follows:} \\$
$idf(t) = \log{(\dfrac{n}{df(t)})} $

## Example: Manual Calculation of Tf-IDF

In [4]:
# assign documents
d0 = 'I am happy today'

d1 = 'Mohan is happy today'

- In d0 there are four terms: 'I', 'am', 'happy' & 'today'. 
- Let's calculate tf-idf of 'I'. 

In [5]:
import numpy as np
import math

In [6]:
tf_I_d0 = 1/4 

In [7]:
idf_I = np.log((1+2)/(1+1))+1

In [8]:
Tfidf_I_d0 = tf_I_d0 * idf_I
Tfidf_I_d0

0.3513662770270411

#### Let's now calculate tf-idf of 'happy' also in d0. 

In [9]:
tf_idf_happy_d0 = 1/4 * (np.log((1+2)/(1+2))+1)
tf_idf_happy_d0

0.25

#### so tf-idf values for 'I', 'am', 'happy' and 'today' in d0 will be 0.3513662770270411, 0.3513662770270411, 0.25 and 0.25 respectively

## Calculating Tf-IDF using sklearn 

In [10]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
# merge documents into a single corpus
string = [d0, d1]

In [12]:
string

['I am happy today', 'Mohan is happy today']

In [13]:
# create object
tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b',lowercase=False)

In [14]:
# get tf-df values
result = tfidf.fit_transform(string)

In [15]:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())



Word indexes:
{'I': 0, 'am': 2, 'happy': 3, 'today': 5, 'Mohan': 1, 'is': 4}

tf-idf values in matrix form:
[[0.57615236 0.         0.57615236 0.40993715 0.         0.40993715]
 [0.         0.57615236 0.         0.40993715 0.57615236 0.40993715]]


In [16]:
result.toarray()[0]

array([0.57615236, 0.        , 0.57615236, 0.40993715, 0.        ,
       0.40993715])

### tf-idf of 'I' is 0.57615236 Different from what we calculated manually! Why? 
### Because, sklearn performs vector normalization which is nothing but L2 norm of the tf-idf vector. Let's verify. 
$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{|\mathbf{x}|_2} = \frac{(x_1, x_2, \ldots, x_n)}{\sqrt{x_1^2 + x_2^2 + \ldots + x_n^2}} $

In [17]:
tf_idf_vector_do = [0.3513662770270411, 0.3513662770270411, 0.25, 0.25]

l2_norm_d0 = np.sqrt(2*math.pow(0.3513662770270411,2)+2*math.pow(0.25,2))

In [18]:
tf_idf_I_normalized = 0.3513662770270411/l2_norm_d0
tf_idf_I_normalized

0.5761523551647353