#  TF-IDF
AI Solution Architect | CTO and Co-founder at Treeleaf/Anydone

## <a name="count-vectorization"></a> One Hot Encoding

In [1]:
def one_hot_encoding(tokens):
    unique_tokens = list(set(tokens))
    return {token: [1 if token == t else 0 for t in unique_tokens] for token in unique_tokens}

tokens = ["natural", "language", "processing"]
one_hot = one_hot_encoding(tokens)
print("One-Hot Encoding:", one_hot)

One-Hot Encoding: {'language': [1, 0, 0], 'processing': [0, 1, 0], 'natural': [0, 0, 1]}


## TF-IDF Vectorization

Term frequency (TF): Term frequency is simply the ratio of the count of a
word present in a sentence, to the length of the sentence.

TF is basically capturing the importance of the word irrespective of the
length of the document. For example, a word with the frequency of 3 with
the length of sentence being 10 is not the same as when the word length of
sentence is 100 words. It should get more importance in the first scenario;
that is what TF does.

Inverse Document Frequency (IDF): IDF of each word is the log of
the ratio of the total number of rows to the number of rows in a particular
document in which that word is present.

IDF = log(N/n), where N is the total number of rows and n is the
number of rows in which the word was present.

IDF will measure the rareness of a term. Words like “a,” and “the” show
up in all the documents of the corpus, but rare words will not be there
in all the documents. So, if a word is appearing in almost all documents,
then that word is of no use to us since it is not helping to classify or in
information retrieval. IDF will nullify this problem.

TF-IDF is the simple product of TF and IDF so that both of the
drawbacks are addressed, which makes predictions and information
retrieval relevant

In [4]:
import numpy as np
np.set_printoptions(linewidth=120)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Prepare the text data
text = [
    "The quick brown fox jumped over the lazy dog.",
    "The dog.",
    "The fox"
]

# Step 2: Create TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Step 3: Fit the vectorizer to the text (build vocabulary)
vectorizer.fit(text)

# Step 4: Print vocabulary
print("Vocabulary:")
print(vectorizer.vocabulary_)

# Step 5: Print IDF values
print("IDF Values:")
print(vectorizer.idf_)

# Step 6: Transform the text to TF-IDF features
tfidf_matrix = vectorizer.transform(text)

# Step 7: Print the TF-IDF matrix
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Step 8: Get feature names to understand the matrix
print("Feature Names:")
print(vectorizer.get_feature_names_out())

# Vocabulary:
# {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
# IDF Values:
# [1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1.        ]
# TF-IDF Matrix:
# [[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646 0.36388646 0.42983441]
#  [0.         0.78980693 0.         0.         0.         0.         0.         0.61335554]
#  [0.         0.         0.78980693 0.         0.         0.         0.         0.61335554]]
# Feature Names:
# ['brown' 'dog' 'fox' 'jumped' 'lazy' 'over' 'quick' 'the']


Vocabulary:
{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
IDF Values:
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
TF-IDF Matrix:
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]
 [0.         0.78980693 0.         0.         0.         0.
  0.         0.61335554]
 [0.         0.         0.78980693 0.         0.         0.
  0.         0.61335554]]
Feature Names:
['brown' 'dog' 'fox' 'jumped' 'lazy' 'over' 'quick' 'the']
