### What is TF-IDF?
- TF stands for **Term Frequency** and denotes the ratio of  number of times a particular word appeared in a Document to total number of words in the document.

        Term Frequency(TF) = [number of times word appeared / total no of words in a document],
- Term Frequency values ranges between 0 and 1. If a word occurs more number of times, then it's value will be close to 1.
- IDF stands for **Inverse Document Frequency** and denotes the log of ratio of total number of documents/datapoints in the whole dataset to the number of documents that contains the particular word.

        Inverse Document Frequency(IDF) = [log(Total number of documents / number of documents that contains the word)],
- In IDF, if a word occured in more number of documents and is common across all documents, then it's value will be less and ratio will approaches to 0. 
- Finally:

        TF-IDF = Term Frequency(TF) * Inverse Document Frequency(IDF)

***Demo***

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [3]:
#let's create the vectorizer and fit the corpus and transform them accordingly
v = TfidfVectorizer() # create the vectorizer
v.fit(corpus) 
transform_output = v.transform(corpus) # Chuyển đổi corpus thành ma trận TF-IDF

In [4]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [6]:
all_feature_names = v.get_feature_names_out() # get all feature names

for word in all_feature_names: 
    indx = v.vocabulary_.get(word)

    idf_score = v.idf_[indx]

    print(f"Word: {word} - Index: {indx} - IDF: {idf_score}")

Word: already - Index: 0 - IDF: 2.386294361119891
Word: am - Index: 1 - IDF: 2.386294361119891
Word: amazon - Index: 2 - IDF: 2.386294361119891
Word: and - Index: 3 - IDF: 2.386294361119891
Word: announcing - Index: 4 - IDF: 1.2876820724517808
Word: apple - Index: 5 - IDF: 2.386294361119891
Word: are - Index: 6 - IDF: 2.386294361119891
Word: ate - Index: 7 - IDF: 2.386294361119891
Word: biryani - Index: 8 - IDF: 2.386294361119891
Word: dot - Index: 9 - IDF: 2.386294361119891
Word: eating - Index: 10 - IDF: 1.9808292530117262
Word: eco - Index: 11 - IDF: 2.386294361119891
Word: google - Index: 12 - IDF: 2.386294361119891
Word: grapes - Index: 13 - IDF: 2.386294361119891
Word: iphone - Index: 14 - IDF: 2.386294361119891
Word: ironman - Index: 15 - IDF: 2.386294361119891
Word: is - Index: 16 - IDF: 1.1335313926245225
Word: loki - Index: 17 - IDF: 2.386294361119891
Word: microsoft - Index: 18 - IDF: 2.386294361119891
Word: model - Index: 19 - IDF: 2.386294361119891
Word: new - Index: 20 - 

In [8]:
print(transform_output.toarray()) # in ra ma trận TF-IDF

[[0.24266547 0.         0.         0.         0.         0.
  0.         0.24266547 0.         0.         0.40286636 0.
  0.         0.         0.         0.24266547 0.11527033 0.24266547
  0.         0.         0.         0.         0.72799642 0.
  0.         0.24266547 0.         0.        ]
 [0.         0.         0.         0.         0.30652086 0.5680354
  0.         0.         0.         0.         0.         0.
  0.         0.         0.5680354  0.         0.26982671 0.
  0.         0.         0.30652086 0.         0.         0.
  0.         0.         0.30652086 0.        ]
 [0.         0.         0.         0.         0.30652086 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.26982671 0.
  0.         0.5680354  0.30652086 0.         0.         0.
  0.5680354  0.         0.30652086 0.        ]
 [0.         0.         0.         0.         0.30652086 0.
  0.         0.         0.         0.         0.         0.
  0.