# Vectorization Techniques [2nd June 2025]:

- This is a method of converting the Textual Data into machine understandatble data (numerical data). This converted / encoded data are known as vectors.
- Methods [TF-IDF, Count_Vectorizer, BERT vectorization].
- It's a part of NLP...

## 1. TF-IDF Vectorization:

- Term Frequency - Inverse Document Frequency   
- TF *(w,d)*= (occur. of word *w* in document *d*) / (tot. no. of words in document *d*)
- IDF *(w,D)* = ln((tot. no. of docs(N) in corpus D ) / (no. of documents containing w))

- TF-IDF *(w,d,D)* = TF * IDF


Disadv.: 
- Cannot capture semantics.
- Computationally expensive.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


corpus = [
    "I am Kirtan Ghelani",
    "Kirtan Ghelani is an AI Enthusiast",
    "AI is the latest trend in the technology",
    "Today's growing technology is like a two-sided sword."
]

vectorizer = TfidfVectorizer()

# This is basically transforming the textual data into numerical data so we use .fit_transform()
X = vectorizer.fit_transform(corpus)
print(f"Feature Names: {vectorizer.get_feature_names_out()}")

print(X.shape)

Feature Names: ['ai' 'am' 'an' 'enthusiast' 'ghelani' 'growing' 'in' 'is' 'kirtan'
 'latest' 'like' 'sided' 'sword' 'technology' 'the' 'today' 'trend' 'two']
(4, 18)


## 2. Count Vectorizer:

- Simple and hashing kinda implementation.
- It is effective where the frequency of the word is a key feature...

Disadv.:
- ignores context & order of words...


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(f"Features: {vectorizer.get_feature_names_out()}")
X.toarray()

Features: ['ai' 'am' 'an' 'enthusiast' 'ghelani' 'growing' 'in' 'is' 'kirtan'
 'latest' 'like' 'sided' 'sword' 'technology' 'the' 'today' 'trend' 'two']


array([[0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1]],
      dtype=int64)

## 3. BERT Vectorization:

- Bidirectional Encoder Representations from Transformers.
- Transformer-based model (it pretrains bidirectional representations by jointly conditioning on both left & r8 context in all layers.)
- Can be fine-tuned for specific tasks.


Disadv.:
- Large model size.
- High computational requirements for training & inference.

In [43]:
import random
import torch
from transformers import AutoTokenizer, AutoModel

In [4]:
random_seed = 32
random.seed(random_seed)

# 
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

In [66]:
# BERT Tokenizer:

model_id = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

In [67]:
text_data = f"""Hello, my name is Kirtan Ghelani. 
I am 22 year old AI-ML Developer working at SculptSoft Private Limited."""

In [71]:
# Tokenizing / Tokenizing the raw text_data:
encoded_input = tokenizer(
    text_data,
    return_tensors = 'pt',
)


# This can be feeded to model like this...
encoded_output = model(**encoded_input)

print(f""">>>>>>>> Shape: {encoded_input['input_ids'].shape}
encoded input (Tokens):
{encoded_input['input_ids']}
{'-'*85}
""")

>>>>>>>> Shape: torch.Size([1, 32])
encoded input (Tokens):
tensor([[  101,  7592,  1010,  2026,  2171,  2003, 11382, 13320,  2078,  1043,
         16001,  7088,  1012,  1045,  2572,  2570,  2095,  2214,  9932,  1011,
         19875,  9722,  2551,  2012,  8040,  5313, 22798, 15794,  2797,  3132,
          1012,   102]])
-------------------------------------------------------------------------------------



In [72]:
# Decoding the tokenized data... [here we decode the model output ]
decoded_txt = tokenizer.decode(encoded_input['input_ids'][0])
decoded_txt

'[CLS] hello, my name is kirtan ghelani. i am 22 year old ai - ml developer working at sculptsoft private limited. [SEP]'