# Day 2 — Tokenization (and what actually goes into the model)

**Goal:** Understand what happens *before* an LLM produces text: tokenization → IDs → attention masks → back to text.

**You will learn:**
- What a *tokenizer* does (BERT-style WordPiece vs GPT-2 BPE).
- How strings become **token IDs** and how IDs decode back to text.
- What **special tokens** (`[CLS]`, `[SEP]`) and **attention masks** are.
- (Next steps) How token IDs become **embeddings** the model can understand.

In [None]:
# Install/Import minimal deps
!pip -q install transformers

from transformers import AutoTokenizer, AutoModel
import torch

We import PyTorch (torch) because Hugging Face models are implemented using it by default.
This allows us to:

Run computations on CPU/GPU

Manage tensors and model weights

Control device placement (e.g., cuda for GPU)

Let’s quickly test whether GPU support is available in our Colab environment.

In [None]:
# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## A. Tokenization with BERT (WordPiece)

- BERT uses **WordPiece**: splits uncommon words into subwords with the `##` prefix.
- The tokenizer turns text → **token strings** → **token IDs**.
- We also get an **attention mask**: 1 = real token, 0 = padding (for batching).

In [None]:
text = "Transformers are amazing! Neural networks learn patterns from data."

tok_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

enc = tok_bert(
    text,
    add_special_tokens=True,     # add [CLS] ... [SEP]
    return_attention_mask=True,
    return_tensors="pt"
)

print("Input text:\n", text, "\n")
print("Token strings:\n", tok_bert.convert_ids_to_tokens(enc["input_ids"][0]))
print("\nToken IDs:\n", enc["input_ids"][0].tolist())
print("\nAttention mask:\n", enc["attention_mask"][0].tolist())

## B. Decoding back to text & special tokens

- **[CLS]**: classification token at the start (BERT uses it as a sentence summary).
- **[SEP]**: separator token (end of sentence; also used between sentence pairs).
- **Decoding** reconstructs text from token IDs (not always identical spacing/punctuation, but equivalent).

We’ll also map tokens ↔ IDs to see the vocabulary lookup.


In [None]:
ids = enc["input_ids"][0]
decoded = tok_bert.decode(ids, skip_special_tokens=True)

print("Decoded text (no specials):\n", decoded, "\n")

# Token ↔ ID table (compact)
tokens = tok_bert.convert_ids_to_tokens(ids)
pairs = list(zip(tokens, ids.tolist()))
for t, i in pairs[:20]:
    print(f"{t:>15}  -> {i}")
if len(pairs) > 20:
    print("... (truncated)")

## C. Compare with GPT-2 (Byte-Pair Encoding, no pad token)

- GPT-2 uses **BPE** (slightly different subword scheme).
- It has **no PAD token** by default (so we usually don’t batch with padding unless we set one).
- Notice how the token strings differ and there are **no `[CLS]`/`[SEP]`** special tokens by default.

In [None]:
tok_gpt2 = AutoTokenizer.from_pretrained("gpt2")
# GPT-2 has no pad token by default; set one if needed for batching later
tok_gpt2.pad_token = tok_gpt2.eos_token

enc_gpt2 = tok_gpt2(
    text,
    add_special_tokens=True,      # GPT-2 will add its own BOS/EOS style if configured
    return_attention_mask=True,
    return_tensors="pt",
    padding=True
)

print("GPT-2 token strings:\n", tok_gpt2.convert_ids_to_tokens(enc_gpt2["input_ids"][0]))
print("\nGPT-2 token IDs:\n", enc_gpt2["input_ids"][0].tolist())
print("\nGPT-2 attention mask:\n", enc_gpt2["attention_mask"][0].tolist())

decoded_gpt2 = tok_gpt2.decode(enc_gpt2["input_ids"][0], skip_special_tokens=True)
print("\nGPT-2 decoded:\n", decoded_gpt2)

### Quick observations

- **BERT (WordPiece):** You’ll see `[CLS]` at start and `[SEP]` at end; uncommon words break into `##subwords`.
- **GPT-2 (BPE):** No `[CLS]/[SEP]`, different subword splits; by default no PAD token.
- **Attention mask:** All 1s here (no real padding needed for a single sentence), but essential when batching sequences of different lengths.


##D. Hugging Face `pipeline`

The Hugging Face `pipeline` is a high-level API that makes it easy to use pre-trained models for common tasks without worrying about low-level details.  

Some example tasks include:  
- **Sentiment Analysis** → Detect positive/negative tone  
- **Text Generation** → Generate new text based on a prompt  
- **Translation** → Translate text between languages  
- **Question Answering** → Extract answers from a passage  

We’ll start with a **sentiment analysis** pipeline as our first example.

In [None]:
from transformers import pipeline

# Load sentiment-analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Test on a sample sentence
result = sentiment_analyzer("I am excited to learn about Hugging Face and LLMs!")
print(result)

What happens when we run a pipeline for the first time?

- When we call `pipeline("sentiment-analysis")`, Hugging Face automatically:  
  1. **Downloads** the pre-trained model (if not already cached).  
  2. **Loads** the model into memory using PyTorch (since we installed `torch`).  
  3. **Runs inference** on the input text and gives structured output.  

- The model will be stored in a local cache (usually under `~/.cache/huggingface/transformers/`),  
  so the next time you run it, it won’t need to download again.


### Running Sentiment Analysis on Multiple Sentences

We can pass a list of sentences to the `pipeline` instead of a single string.  
This allows us to analyze multiple inputs in one go.  

This is useful when you want to quickly check several text samples without looping manually.

In [None]:
texts = [
    "I love learning new things about AI!",
    "This tutorial is a bit challenging, but I’m getting there.",
    "I really don’t enjoy debugging errors late at night..."
]

results = sentiment_analyzer(texts)

for text, result in zip(texts, results): #zip() function takes multiple iterables (like lists, tuples, or strings) as arguments and returns an iterator of tuples. Each tuple contains corresponding elements from the input iterables.
    print(f"Text: {text}")
    print(f"Label: {result['label']}, Score: {result['score']:.4f}\n")

### Using a Specific Pretrained Model

By default, the pipeline uses a standard pretrained model chosen by Hugging Face.  
However, Hugging Face hosts thousands of models, and we can specify which one to use by passing the model name.  

For example, we'll use **`distilbert-base-uncased-finetuned-sst-2-english`**,  
a lightweight model trained specifically for sentiment analysis.

In [None]:
custom_classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

result = custom_classifier("Hugging Face makes NLP super easy!")
print(result)

Why Use Custom Models?

- **Flexibility**: Different tasks may need different models. Hugging Face provides thousands of them.  
- **Performance Trade-off**:  
  - Some models are smaller (faster but slightly less accurate).  
  - Others are larger (more accurate but slower).  
- **Domain Specificity**: You might need models fine-tuned for finance, healthcare, legal text, or even social media.  
- **Control**: Choosing a model explicitly ensures reproducibility instead of depending on Hugging Face's defaults (which might change).  


### Batch Processing with Pipelines

- The Hugging Face pipeline can take a **list of inputs** instead of just one string.  
- This is useful when analyzing many texts together, saving time and improving efficiency.  
- Each input is processed independently, and the results are returned in the same order as the inputs.  

In practice:  
- For **real-world applications**, you’ll often deal with a dataset or a collection of sentences.  
- Batch processing allows you to send them together instead of calling the model one-by-one.

In [None]:
# Using the sentiment analysis model on multiple inputs (batch processing)

results = custom_classifier([
    "I love how simple Hugging Face makes NLP!",
    "This is too complicated, I don't like it.",
    "The course is okay, but could be better."
])

for idx, res in enumerate(results, 1):
    print(f"Sentence {idx}: {res}")

### Controlling Model Outputs with `return_all_scores`

- By default, the pipeline only returns the **most likely label** with its score.  
- If we set `return_all_scores=True`, the pipeline will return the **probability for every possible label**.  
- This gives us a more detailed view of how confident the model is across different classes.  

For example:  
- Instead of just saying `"LABEL_1: Positive"`, we also see the probabilities for `"Negative"` and `"Neutral"`.  
- This is useful for applications where you care about **model confidence** and not just the top prediction.


In [None]:
# Getting detailed scores for all possible labels

detailed_results = custom_classifier(
    "I had an average experience, nothing too exciting.",
    return_all_scores=True
)

print(detailed_results)

Getting Probabilities for All Labels (`top_k=None`)

- Earlier, we used `return_all_scores=True`, but this is now deprecated.  
- The new way is to set `top_k=None`.  
- This returns the probability scores for **all possible labels** instead of just the top prediction.  
- Useful when we want to analyze model confidence across all categories, not just the winner.  


In [None]:
# Getting detailed scores for all possible labels using top_k=None
detailed_results = custom_classifier(
    "I had an average experience, nothing too exciting.",
    top_k=None
)

print(detailed_results)

### Formatting Sentiment Analysis Results

- The model output contains both **label** (Positive/Negative/Neutral) and **score** (confidence level).  
- By formatting the results, we can make them easier to read.  
- This step is useful when presenting results to end-users or when logging model predictions in a real application.  


In [None]:
# Extracting and formatting results for better readability
texts = [
    "I love Hugging Face, it makes NLP fun!",
    "This is too confusing and frustrating.",
    "The movie was decent, not great but not bad."
]

results = custom_classifier(texts)

for text, res in zip(texts, results):
    label = res['label']
    score = round(res['score'], 4)
    print(f"Text: {text}\n → Sentiment: {label} (Confidence: {score})\n")

### Zero-Shot Classification

- Unlike sentiment analysis, **Zero-Shot Classification** allows us to classify text into **any set of categories** we choose, without needing a custom-trained model.  
- This works using **Natural Language Inference (NLI)**:  
  - The model checks if the text *entails* or *contradicts* a given label.  
- In our example, we provide the labels: `education`, `politics`, `sports`, `technology`.  
- The model then assigns probabilities to each label, showing how well the text fits each category.  


In [None]:
# Load zero-shot classification pipeline
zero_shot_classifier = pipeline("zero-shot-classification")

# Example text
text = "I have been reading a lot about deep learning and transformers lately."

# Candidate labels (categories we want to test against)
candidate_labels = ["education", "politics", "sports", "technology"]

# Perform classification
result = zero_shot_classifier(text, candidate_labels)

print(result)

Exploring Zero-Shot Classification Outputs in Detail

When we use the `zero-shot-classification` pipeline, Hugging Face returns a dictionary containing:
- **`labels`** → the candidate labels we provided (in descending order of probability).
- **`scores`** → the probability assigned to each label.
- **`sequence`** → the original input sentence.

This means the model doesn't just give one label, but instead ranks all the labels we asked it to compare against.

By inspecting the output, we can:
- See which labels the model thinks are most relevant.
- Understand the relative confidence (scores) for each label.
- Verify whether the label order matches our expectations.


In [None]:
# Deeper look into zero-shot classification outputs
detailed_result = zero_shot_classifier(
    "I enjoy solving challenging machine learning problems.",
    candidate_labels=["education", "sports", "technology", "entertainment", "work"],
)

print(detailed_result)  # Raw output

# Pretty printing
for label, score in zip(detailed_result['labels'], detailed_result['scores']):
    print(f"{label:<15} --> {score:.4f}")

In [None]:
import matplotlib.pyplot as plt

# Extract labels and scores
labels = detailed_result["labels"]
scores = detailed_result["scores"]

# Plot
plt.figure(figsize=(8, 5))
plt.barh(labels, scores, color="skyblue")
plt.xlabel("Confidence Score")
plt.title("Zero-Shot Classification Results")
plt.gca().invert_yaxis()  # Highest score on top
plt.show()

### Conclusion on Hugging Face Pipelines

We have explored Hugging Face pipelines for **sentiment analysis** and **zero-shot classification**.  
Similarly, pipelines can be used for a wide range of NLP tasks such as:

- **Question Answering**  
- **Named Entity Recognition (NER)**  
- **Text Summarization**  
- **Translation**  
- **Text Generation**  

The key point is that **pipelines abstract away the complexities** of how transformers work internally.  
Under the hood, every pipeline is performing the following steps:

1. **Tokenization** – Converting input text into tokens and token IDs.  
2. **Model Forward Pass** – Feeding token IDs into a pre-trained model.  
3. **Post-Processing** – Converting model outputs into human-readable results.  

This makes pipelines extremely convenient for quick experimentation, but to truly understand how transformers process text, we need to go one level deeper.  


## E. From Tokens to Model Inputs

When we call a tokenizer, we don’t just want raw tokens.  
For models, the tokenizer prepares a full set of inputs:

- **`input_ids`** → numerical IDs for each token  
- **`attention_mask`** → tells the model which tokens are real and which are padding  
- **`token_type_ids`** (optional, used in models like BERT for sentence pairs)

Let’s tokenize our sentence into model-ready inputs.


In [None]:
# Load the tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Prepare model-ready inputs
inputs = tokenizer(
    "Hugging Face makes NLP simple and powerful!",
    return_tensors="pt"   # return PyTorch tensors
)

print(inputs)


### Passing Tokenized Inputs to the Model

Now that we have tokenized inputs, we can pass them into a pretrained model.  
For this, we’ll use **BERT base uncased** (`bert-base-uncased`).  

The model outputs:
- **`last_hidden_state`** → embeddings for each token in the input  
- **`pooler_output`** → a single embedding for the whole sentence (commonly the `[CLS]` token)

These are the raw hidden representations (embeddings) before any task-specific head like classification is applied.


In [None]:
# Set manual seed for reproducibility
torch.manual_seed(42)

# If using CUDA
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

Why Set a Manual Seed?

When working with deep learning models, some operations can be non-deterministic  
(e.g., weight initialization, dropout, CUDA kernels).  
By setting a **manual seed**, we make sure that:

- Random number generation is consistent across runs  
- Embeddings and results (when randomness is involved) remain reproducible  
- Debugging and comparing outputs becomes easier

In PyTorch, we use `torch.manual_seed()` for this.

In [None]:
from transformers import AutoModel

# Load BERT model (without any classification head)
model = AutoModel.from_pretrained("bert-base-uncased")

# Pass inputs through the model
with torch.no_grad():  # no gradient calculation, just inference
    outputs = model(**inputs)

# Extract embeddings
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output

print("Last hidden state shape:", last_hidden_state.shape)
print("Pooler output shape:", pooler_output.shape)

Visualizing Token Embeddings

Each token in the input sentence is mapped to a high-dimensional vector (embedding).  
The `last_hidden_state` gives us these embeddings:

- Shape: `[batch_size, sequence_length, hidden_size]`
- For **BERT base**, `hidden_size = 768`

We can inspect:
1. The tokens
2. Their corresponding embedding vectors


In [None]:
# Get tokens back from tokenizer
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Pick embeddings for the first sentence in the batch
embeddings = last_hidden_state[0]  # shape: (seq_len, hidden_size)

# Print tokens with first 5 dimensions of their embeddings
for token, vector in zip(tokens, embeddings):
    print(f"{token:10s} -> {vector[:5].numpy()}")

Sentence Embeddings with `pooler_output`

So far, we’ve looked at **token embeddings** (one vector per token).  
But many tasks require a **single vector for the entire sentence** (e.g., similarity, clustering, classification).  

- In BERT-based models, the **`pooler_output`** provides a fixed-size embedding for the whole input.  
- It’s derived from the `[CLS]` token representation, passed through a linear layer + `tanh` activation.  
- This makes it suitable for downstream tasks like classification and semantic similarity.  

We’ll extract `pooler_output` for a sentence to get its embedding.


In [None]:
from transformers import BertModel, BertTokenizer

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Example sentence
sentence = "Hugging Face makes working with transformers easy!"

# Tokenize
inputs = tokenizer(sentence, return_tensors="pt")

# Get outputs
with torch.no_grad():
    outputs = model(**inputs)

# Extract sentence embedding
sentence_embedding = outputs.pooler_output  # shape: [1, hidden_size]

print("Sentence Embedding Shape:", sentence_embedding.shape)
print("Sentence Embedding Vector (first 5 values):", sentence_embedding[0][:5])

Comparing `pooler_output` vs Mean Pooling for Sentence Embeddings

BERT provides two common ways to obtain sentence-level embeddings:

1. **`pooler_output`**  
   - Based only on the `[CLS]` token representation.  
   - Passed through a linear layer and `tanh` activation.  
   - Often used for classification tasks in BERT’s original design.

2. **Mean Pooling**  
   - Averages all token embeddings (excluding padding).  
   - Often preferred in sentence-transformers and modern retrieval systems.  
   - Captures more holistic sentence information.

Let’s compute both for the same sentence and compare their shapes and vectors.


In [None]:
# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Example sentence
sentence = "Transformers provide an easy way to work with NLP models."

# Tokenize
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# 1. Pooler Output (CLS token)
cls_embedding = outputs.pooler_output  # shape: [1, hidden_size]

# 2. Mean Pooling (excluding padding)
last_hidden = outputs.last_hidden_state  # shape: [1, seq_len, hidden_size]
attention_mask = inputs['attention_mask']
mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
sum_embeddings = torch.sum(last_hidden * mask_expanded, 1)
sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
mean_embedding = sum_embeddings / sum_mask

print("CLS Pooler Output Shape:", cls_embedding.shape)
print("Mean Pooling Shape:", mean_embedding.shape)
print("\nFirst 5 values of CLS Pooler Output:", cls_embedding[0][:5])
print("\nFirst 5 values of Mean Pooling:", mean_embedding[0][:5])

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Example sentences
sentences = [
    "Transformers are powerful for NLP tasks.",
    "Natural language processing is evolving rapidly.",
    "BERT is a transformer-based model.",
    "I love eating pizza on weekends.",
    "The weather today is sunny and pleasant."
]

# Get embeddings
cls_embeddings = []
mean_embeddings = []

for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)

    # CLS pooler output
    cls_embeddings.append(outputs.pooler_output.squeeze().numpy())

    # Mean pooling
    last_hidden = outputs.last_hidden_state
    attention_mask = inputs['attention_mask']
    mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
    sum_embeddings = torch.sum(last_hidden * mask_expanded, 1)
    sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
    mean_embeddings.append((sum_embeddings / sum_mask).squeeze().numpy())

# Convert to numpy
import numpy as np
cls_embeddings = np.array(cls_embeddings)
mean_embeddings = np.array(mean_embeddings)

# PCA for 2D visualization
pca = PCA(n_components=2)
cls_2d = pca.fit_transform(cls_embeddings)
mean_2d = pca.fit_transform(mean_embeddings)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(cls_2d[:, 0], cls_2d[:, 1], color='blue', label='CLS Pooler')
plt.scatter(mean_2d[:, 0], mean_2d[:, 1], color='red', label='Mean Pooling')

for i, sentence in enumerate(sentences):
    plt.annotate(f"S{i+1}", (cls_2d[i, 0], cls_2d[i, 1]), color='blue')
    plt.annotate(f"S{i+1}", (mean_2d[i, 0], mean_2d[i, 1]), color='red')

plt.title("Sentence Embeddings: CLS Pooler vs Mean Pooling (PCA 2D)")
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.legend()
plt.grid(True)
plt.show()

Visualization of Sentence Embeddings: CLS Pooler vs Mean Pooling

* Blue points = CLS pooler embeddings
* Red points = Mean pooling embeddings
* Labels S1–S5 correspond to sentences

The graph above shows a **2D representation of sentence embeddings** obtained using two different methods from the BERT model:

- **CLS Pooler Output (Blue points):** Uses the final hidden state of the special `[CLS]` token, which is trained during BERT’s pretraining for classification-like tasks.  
- **Mean Pooling (Red points):** Computes the mean of all token embeddings (considering the attention mask), providing a more generalized representation of the sentence.

Each point (S1–S5) represents one sentence from the input set. Sentences that are semantically closer tend to appear nearer in the 2D space, though this is an approximation.

---

Why PCA?

**Principal Component Analysis (PCA)** is a dimensionality reduction technique used to project high-dimensional data (here, 768-dimensional BERT embeddings) into a lower-dimensional space (2D for visualization).

Key points about PCA:
- It identifies the directions (principal components) in which the data varies the most.
- The first component explains the highest variance, the second explains the next highest, and so on.
- By using only the first two components, we can plot the embeddings in a way that preserves as much of their variance (structure) as possible while making it human-interpretable.

---

This visualization allows us to see how different pooling strategies (CLS vs Mean) can create slightly different spatial relationships between sentence embeddings.


Semantic Similarity Comparison: CLS Pooler vs Mean Pooling

We computed the **cosine similarity** between sentence embeddings generated using two different methods:

- **CLS Pooler Output:** Uses the `[CLS]` token representation.
- **Mean Pooling:** Averages all token embeddings for the sentence.

---

Why Cosine Similarity?

Cosine similarity measures the cosine of the angle between two vectors in high-dimensional space.  
It ranges from **-1 (completely opposite)** to **+1 (identical)**.  
For sentence embeddings:
- **Higher cosine similarity** → more semantically similar.
- **Lower cosine similarity** → less semantically related.

---

Observations

- **CLS Pooler:** Sometimes fails to capture nuanced relationships, as it was originally trained for classification tasks (e.g., NSP) rather than sentence-level similarity.
- **Mean Pooling:** Often produces more stable and semantically meaningful similarities across diverse sentences, especially when fine-tuning is not performed.

---

When to Use Which?

- Use **Mean Pooling** when building general-purpose sentence embeddings (e.g., clustering, semantic search, retrieval).
- Use **CLS Pooler** when fine-tuning the model for a specific task where `[CLS]` is trained to represent sentence meaning.

---

This step helps us decide which embedding strategy is better suited before proceeding to **training or fine-tuning for downstream NLP tasks**.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# Manual seed for reproducibility
torch.manual_seed(42)

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Sample sentences
sentences = [
    "I love machine learning.",
    "Deep learning is a subset of machine learning.",
    "The weather is sunny today.",
    "I enjoy coding in Python.",
    "Artificial intelligence is fascinating."
]

# Tokenize
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True, max_length=32)

# Get model outputs
with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state   # (batch, seq_len, hidden_dim)
    cls_pooler_output = outputs.pooler_output        # (batch, hidden_dim)

# Mean pooling: average all token embeddings (excluding padding)
attention_mask = inputs['attention_mask'].unsqueeze(-1)  # (batch, seq_len, 1)
masked_hidden = last_hidden_states * attention_mask
mean_pooling = masked_hidden.sum(dim=1) / attention_mask.sum(dim=1)

# Compute cosine similarity matrices
cls_similarity = cosine_similarity(cls_pooler_output)
mean_similarity = cosine_similarity(mean_pooling)

# Plot heatmaps
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(cls_similarity, annot=True, xticklabels=sentences, yticklabels=sentences, cmap="coolwarm", ax=axes[0])
axes[0].set_title("Cosine Similarity - CLS Pooler")

sns.heatmap(mean_similarity, annot=True, xticklabels=sentences, yticklabels=sentences, cmap="coolwarm", ax=axes[1])
axes[1].set_title("Cosine Similarity - Mean Pooling")

plt.tight_layout()
plt.show()

## Clustering Sentence Embeddings with KMeans

After obtaining sentence embeddings using **mean pooling**, we applied **KMeans clustering** to group similar sentences.

### Why KMeans?

KMeans partitions data into *k* clusters by minimizing the within-cluster variance.  
It is commonly used for:
- Grouping semantically similar texts
- Organizing large text corpora
- Topic discovery in unsupervised NLP

### Steps Performed

1. **Embedding Selection:** Used mean-pooled BERT embeddings.
2. **KMeans Clustering:** Chose `k=2` for our small dataset (can be tuned).
3. **Dimensionality Reduction:** Applied PCA to project 768-dimensional embeddings into 2D for visualization.
4. **Cluster Interpretation:** Each cluster groups sentences with related meanings.

### Observations

- Sentences about machine learning and AI formed one cluster.
- Unrelated sentences (e.g., weather, general coding) tended to form another cluster.

This sets the foundation for tasks like **semantic search**, **document organization**, and **topic modeling**.


# **Day 2 Summary: Tokenization & Embeddings**

This summary documents the theoretical and conceptual learning from **Day 2**, covering tokenization and embeddings in Hugging Face.

---

## **1. Tokenization**

### **What is Tokenization?**

* Tokenization is the process of splitting raw text into **tokens** (subwords, words, or characters) that a model can process.
* BERT uses **WordPiece tokenization**, while models like GPT-2 use **Byte-Pair Encoding (BPE)**.

---

### **How is it Performed?**

* Hugging Face provides **`AutoTokenizer`** to automatically load the correct tokenizer:

  * `from transformers import AutoTokenizer`
  * `tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")`

* Key functions:

  * **`tokenizer()`** – Encodes text into input IDs, attention masks, etc.
  * Parameters:

    * `padding='max_length'` – Pads to a fixed length.
    * `truncation=True` – Truncates longer sequences.
    * `return_tensors='pt'` – Returns tensors for PyTorch.

---

### **Where is it Used?**

* **Preprocessing step** in all NLP tasks: classification, QA, summarization, generation, etc.

---

### **When is it Applied?**

* **Before inference or training**, to convert text into model-readable format.

---

### **Why is Tokenization Important?**

* Models work on **token IDs, not raw text**.
* Subword tokenization handles **rare, unknown, or misspelled words** effectively while keeping vocabulary size manageable.

---

## **2. Embeddings**

### **What Are Embeddings?**

* **Dense numerical vectors** representing tokens, words, or entire sentences.
* Capture **semantic meaning and context**.

---

### **How Are They Generated?**

* Use Hugging Face **`AutoModel`** to get hidden states:

  * `from transformers import AutoModel`
  * `model = AutoModel.from_pretrained("bert-base-uncased")`
* Output:

  * **`last_hidden_state`** – Embeddings for each token.
  * **`pooler_output`** – Embedding representing the entire sentence (via `[CLS]` token in BERT).

---

### **Where Are They Used?**

* In **downstream tasks** like:

  * Text classification
  * Semantic search
  * Clustering
  * Question answering
  * Retrieval-Augmented Generation (RAG)

---

### **When to Use CLS vs Mean Pooling?**

* **CLS (`pooler_output`)**: Best for classification tasks.
* **Mean pooling (averaging token embeddings)**: Best for sentence similarity, clustering, or retrieval.

---

### **Why Are Embeddings Important?**

* Bridge between **human language** and **machine learning algorithms**.
* Enable computation of **similarity (cosine, Euclidean)** and facilitate **transfer learning**.

---

## **Key Hyperparameters & Parameters**

* **`max_length`**: Controls sequence length during tokenization.
* **`return_tensors`**: Decides the output tensor type (`'pt'` for PyTorch, `'tf'` for TensorFlow).
* **Pooling method**: `mean`, `max`, or `cls`.
* **Normalization**: Optional step (e.g., L2 normalization) for similarity-based tasks.