### TF-IDF : Term Frequency-Inverse Document Frequency

This [video by ritvik math](https://www.youtube.com/watch?v=OymqCnh-APA) is an amazing explanation of TF-IDF. Highly recommended if you need a refresher!

Also this tutorial is very good : https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

![image.png](attachment:image.png)

TFIDF gives a measure of how important a word or a term is in a given corpus of documents. It tries to identify the most significant words in a document.

TFIDF is a combination of two metrics: Term Frequency (how often a term appears in a document) and Inverse Document Frequency (how many docs contain that term). Term Frequency alone is not a good measure of assessing the importance of a word in a document because certain words like "a", "an", "the", "and", "or" appear many many times in every document and are not very unique/important.

This is why we need IDF (inverse document frequency) to counter balance the Term frequency. IDF gives a negligible score to frequently occuring words and gives a higher score/weight only to the words which are rarer in the corpus.

TFIDF can be very helpful in transforming text into a meaningful representation of numbers (tfidf features) which can then be used for fitting ML algorithms.

### Some Jargon:

1. $t$ : refers to a term or a specific word
2. $d$ : denotes a specific document
3. $D$ : denotes a corpus/collection of multiple documents

4. Term Frequency : how frequently a given term occurs in a specific document.

$$ tf(t,d) = \frac{\text{Number of occurences of term t in document d}}{\text{Number of total words in document d}} $$

5. Inverse Document Frequency : is a measure of how many documents in the corpus contain the term, or how rare or unique the term is across the entire corpus of documents.

    The terms that occur less frequently in the entire corpus of documents and rare words will have a higher IDF score. Whereas the terms that occur very frequently, like articles "a", "an", "the", and commonly occuring grammar words will have much lower IDF scores because they're not unique and occur abundantly in all the documents.


$$ idf(t,D) = \log\left( \frac{\text{Number of documents in the corpus D}}{\text{Number of documents that contain the term t}} \right) $$


Note: different sources give slightly different formulas for calculating the __idf__ but the genereal idea of it remains the same.

6. TF-IDF : is a combination of both the term frequency and the inverse document frequency. 

$$ tfidf(t,d,D) = tf(t,d) \times idf(t,D) $$

For example, suppose you have a 100 documents and the word "the" appears in every single one of several times. And you're looking at a particular document $d_1$ where "the" occurs 100 times and $d_1$ itself contains 2000 words.

Then the term frequency of the word "the" in $d_1$ will be:
$$ tf(\text{"the"}, d_1) = \frac{100}{2000} = 0.05 $$

In other words, 5% of all the words in $d_1$ are "the".


The inverse document frequency of "the" will be: 
$$ idf(\text{"the"}, D) = \log(\frac{100}{100}) = \log(1) = 0 $$

Multiply the tf and the idf together to get the __tf-idf__ = $0.05 \times 0 = 0$

Now, suppose the word "government" appears 20 times in $d_1$ and it does not appear in any other document in the corpus $D$, then:
$$ tf(\text{"government"}, d_1) = \frac{20}{2000} = 0.001 $$
$$ idf(\text{"government"}, D) = \log(\frac{100}{1}) = \log(100) = 2 $$

The __tfidf__ will be $0.001 \times 2 = 0.002$