# Term Frequency-Inverse Document Frequency (TF-IDF)

In NLP an independent text entity is known as document and the collection of all these documents over the project space is known as corpus. TF-IDF stands for Term Frequency-Inverse Document Frequency. The entire technique can be studied by studying _TF_ and _IDF_ separately.

_Term-Frequency_ is a measure of frequency of appearance of term *t* in a document *d*. In other words, the probability of finding term *t* in a document *d*. `Mathematically:`

![](./../assets/embedding/tf.jpg)

*Inverse-Document-Frequency* is a measure of inverse of probability of finding a document that contains term _t_ in a corpus. In other words, a measure of the importance of term _t_. `Mathematically:`

![](./../assets/embedding/idf.jpg)

We can now compute the TF-IDF score for each word in the corpus. Words with a higher score are more important. TF-IDF score is high when both IDF and TF values are high. So, TF-IDF gives more importance to words that are:
1. More frequent in the entire corpus
2. Rare in the corpus but frequent in the document.

Now this TF-IDF score is used as a value for each shell of the document-term matrix, just like the frequency of words in case of Bag-of-Words. The formula below is used to compute TF-IDF score for each shell:

![](./../assets/embedding/tf-idf.jpg)

`scikit-learn` is again one of the best choices to work with TF-IDF similar to Bag-of-Words. `TfidfVectorizer` module serves as a great aid for this. Installing the required package is the first task we want to go through (if not installed already):

`pip3 install sklearn`

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
corpus = [
    'About the bird, the bird, bird bird bird',
    'You heard about the bird',
    'The bird is the word'
]

In [3]:
vectorizer = TfidfVectorizer()
output = vectorizer.fit_transform(corpus)

print(output.todense())

[[0.23256045 0.90301967 0.         0.         0.36120787 0.
  0.        ]
 [0.42018292 0.32630952 0.55249005 0.         0.32630952 0.
  0.55249005]
 [0.         0.30523155 0.         0.51680194 0.61046311 0.51680194
  0.        ]]
