# Day 3: Semantic Similarity & Clustering

- Use sentence embeddings (from Day 2) to compute semantic similarity.
- Explore cosine similarity as a measure of closeness between sentence meanings.
- Perform clustering using KMeans to group similar sentences.
- Visualize clusters for interpretability.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import numpy as np

In [None]:
sentences = [
    "I love playing football.",
    "Soccer is my favorite sport.",
    "The weather is sunny today.",
    "It's raining outside.",
    "I enjoy watching movies on weekends.",
    "Films are a great way to relax."
]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Mean pooling for sentence embeddings
embeddings = outputs.last_hidden_state.mean(dim=1)
embeddings.shape

Before we move to cosine similarity, we need to ensure our sentence embeddings are accurate. Instead of taking a plain mean across all tokens (which includes padding), we’ll mask padding tokens using attention_mask.

### Why Masked Mean Pooling?

In BERT-based models:
- `last_hidden_state` → gives embeddings for each token, including `[PAD]` tokens.
- If we simply average all token embeddings, padding will skew results.

To fix this:
- Use `attention_mask` (1 for real tokens, 0 for padding).
- Multiply embeddings by the mask.
- Compute the mean only across actual tokens.

This ensures better sentence-level representations for similarity tasks.

In [None]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state # [batch_size, seq_len, hidden_dim]
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
    sum_embeddings = (token_embeddings * mask).sum(1)
    sum_mask = mask.sum(1)
    return sum_embeddings / sum_mask

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

sentence_embeddings = mean_pooling(outputs, inputs['attention_mask'])
print("Shape of embeddings:", sentence_embeddings.shape)

Example explanation
Suppose we have 3 tokens per sentence, each with 4-dimensional embeddings, and padding for unused positions.

---

Token embeddings (last_hidden_state):

Sentence 1: [[1, 2, 3, 4], [2, 2, 2, 2], [0, 0, 0, 0]]  ← last row is padding

Sentence 2: [[1, 1, 1, 1], [3, 3, 3, 3], [5, 5, 5, 5]]  ← all tokens are real

---

Attention mask:

Sentence 1: [1, 1, 0]  → only first 2 tokens are real

Sentence 2: [1, 1, 1]  → all tokens are real

---
Naive Mean (without masking)

For Sentence 1:

Sum all tokens → [1+2+0, 2+2+0, 3+2+0, 4+2+0] = [3, 4, 5, 6]

Divide by 3 (total tokens) → [1, 1.33, 1.67, 2] ← padding skewed the mean downward.

---

Masked Mean Pooling

Multiply embeddings by mask:

Sentence 1: [[1,2,3,4], [2,2,2,2], [0,0,0,0]] × [1,1,0]
→ [[1,2,3,4], [2,2,2,2], [0,0,0,0]]

Sum → [3,4,5,6]

Divide by sum of mask (1+1 = 2) → [1.5, 2, 2.5, 3] ← correct!

### PCA Visualization of Sentence Embeddings

- Each sentence embedding from BERT is 768-dimensional (for `bert-base-uncased`).
- Visualizing in 768 dimensions is impossible, so we use **PCA (Principal Component Analysis)**:
  - PCA reduces dimensionality while preserving as much variance as possible.
  - We project the embeddings from 768D → 2D.
- This lets us visually inspect how sentences group based on meaning.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Convert to numpy
embeddings_np = sentence_embeddings.detach().numpy()

# Apply PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_np)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], color='blue')

for i, sentence in enumerate(sentences):
    plt.annotate(f"{i+1}", (embeddings_2d[i, 0], embeddings_2d[i, 1]))

plt.title("PCA of Sentence Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()

**Interpreting the PCA Plot**

The PCA plot shows each sentence represented as a **point in 2D space**. Here’s how to read it:

1. **Each point = one sentence embedding**  
   - Numbers/labels correspond to the sentence index.

2. **Distance = similarity (roughly)**  
   - Points closer together mean their embeddings are more similar → sentences likely have similar meaning.
   - Distant points → less semantic similarity.

3. **Clusters = groups of semantically related sentences**  
   - If multiple points form a cluster, the model sees them as related in meaning.
   - For example, "I love playing football" and "Soccer is my favorite sport" should appear near each other.

4. **Axes (PC1 & PC2)**  
   - Do not directly represent any word or meaning.  
   - They are **principal components**, directions that capture the most variance in high-dimensional data.

5. **PCA is only an approximation**  
   - We reduced from 768D → 2D, so some meaning relationships may get compressed or distorted.

In short:
- **Close points → semantically similar.**
- **Far points → different topics.**
- Clusters reveal **latent themes** within your text set.

### Cosine Similarity for Semantic Comparison

Cosine similarity measures how close two vectors are in terms of their **direction**, not their length.

- Formula:
  **cos(θ) = (A · B) / (||A|| ||B||)**
  
- Values range from:
  - **+1** → identical direction (high similarity)
  - **0** → orthogonal (no similarity)
  - **-1** → opposite direction (completely dissimilar)

Why cosine similarity?
- Sentence embeddings may differ in scale/magnitude.
- Cosine similarity focuses on **meaningful orientation**, making it perfect for semantic comparison.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise cosine similarity
cos_sim_matrix = cosine_similarity(sentence_embeddings)

print("Cosine Similarity Matrix:\n", np.round(cos_sim_matrix, 2))

In [None]:
# Show pairwise similarities
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(f"Similarity between '{sentences[i]}' and '{sentences[j]}': {cos_sim_matrix[i][j]:.2f}")

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(cos_sim_matrix, annot=True, cmap="coolwarm", xticklabels=range(1, len(sentences)+1), yticklabels=range(1, len(sentences)+1))
plt.title("Cosine Similarity Matrix (Sentence Embeddings)")
plt.xlabel("Sentence Index")
plt.ylabel("Sentence Index")
plt.show()

**Why Cosine Similarity?**

When working with sentence or word embeddings, each text is represented as a high-dimensional vector.
- Two sentences with similar meaning may have embeddings with **different magnitudes** (lengths) but **similar directions**.
- Example: "I love football" and "I enjoy soccer" may point in a similar direction in the embedding space but differ in vector length due to different words.

**Cosine similarity solves this by focusing only on the angle between vectors:**
- It measures **how close two vectors point to the same direction** regardless of their scale.

---

**How Does It Enable Semantic Similarity?**

- In modern NLP models (like BERT, GPT), embeddings encode semantic meaning.
- If two sentences mean the same thing, their embeddings tend to point in a **similar direction in the latent space**.
- Cosine similarity helps us **quantify this semantic closeness**.

**Example:**
- "The cat is on the mat" vs "A feline rests on the rug" → High cosine similarity (meanings are close).
- "The cat is on the mat" vs "It’s raining today" → Low cosine similarity (meanings are unrelated).

This is why cosine similarity is widely used for:
- **Semantic search**
- **Clustering similar texts**
- **Duplicate detection**
- **Recommendation systems based on meaning rather than exact words**

### Clustering Sentence Embeddings with KMeans

- **What is KMeans?**
  - A clustering algorithm that divides data points into *k* clusters.
  - Each cluster is represented by a centroid (mean of points in that group).

- **Why KMeans for text?**
  - Our sentence embeddings are high-dimensional vectors.
  - KMeans groups similar vectors together based on their positions in space.
  - Useful for topic grouping, document clustering, or semantic organization.

- **Key hyperparameter:**
  - `n_clusters`: Number of groups we want. It should reflect how many distinct topics we expect.

In [None]:
from sklearn.cluster import KMeans

# Choose number of clusters (let's assume 3: sports, weather, movies)
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(sentence_embeddings)

# Show cluster assignments
for sentence, cluster_id in zip(sentences, clusters):
    print(f"[Cluster {cluster_id}] {sentence}")

#### Choosing `n_clusters` for KMeans

The number of clusters (`n_clusters`) is not always obvious, especially for text data.  
Here are common strategies to determine it:

---

##### 1. **Domain Knowledge / Intuition**
- If you already know how many topics or groups to expect, set `n_clusters` directly.
- Example: If your dataset contains sentences about sports, weather, and movies, you might set `n_clusters=3`.

---

##### 2. **Elbow Method**
- Plot the **inertia (within-cluster sum of squared distances)** for different cluster counts.
- Inertia decreases as you add more clusters, but after a certain point, the improvement slows down → forming an "elbow" shape.
- The elbow point suggests a good trade-off between too few and too many clusters.

---

##### 3. **Silhouette Score**
- Measures how similar a point is to its own cluster compared to other clusters.
- Ranges from **-1 (bad clustering) to +1 (good clustering)**.
- Try multiple cluster counts and choose the one with the **highest silhouette score**.

---

##### 4. **Practical Simplicity**
- For small datasets (like our 6-sentence example), using intuition is often enough.
- For large datasets, combine the above methods for a more informed choice.

---

**In this project:**
- We will start with **3 clusters** because our sentences intuitively belong to three topics:  
  1. Sports (football, soccer)  
  2. Weather (sunny, raining)  
  3. Movies (films, weekends)


In [None]:
# Apply PCA again if not already available
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(sentence_embeddings.detach().numpy())

# Plot clusters
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    embeddings_2d[:, 0], embeddings_2d[:, 1],
    c=clusters, cmap='viridis', s=100
)

# Annotate points with sentence index
for i, sentence in enumerate(sentences):
    plt.annotate(f"{i+1}", (embeddings_2d[i, 0] + 0.02, embeddings_2d[i, 1]))

plt.title("KMeans Clusters on Sentence Embeddings (PCA Reduced)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(scatter, label='Cluster ID')
plt.grid(True)
plt.show()

### Determining Optimal Number of Clusters

- **Elbow Method**: Plots inertia (sum of squared distances within clusters) vs. number of clusters.
  - Look for a "bend" or "elbow" in the curve.
  - After this point, adding more clusters doesn't reduce inertia significantly.

- **Silhouette Score**: Measures how well samples are clustered.
  - Ranges from -1 (bad) to +1 (good).
  - Higher = better defined clusters.


In [None]:
from sklearn.metrics import silhouette_score

inertias = []
silhouette_scores = []
k_range = range(2, 6)  # testing from 2 to 5 clusters

for k in k_range:
    kmeans_k = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans_k.fit_predict(sentence_embeddings)
    inertias.append(kmeans_k.inertia_)
    silhouette_scores.append(silhouette_score(sentence_embeddings, labels))

# Plot both metrics
fig, ax1 = plt.subplots(figsize=(8, 5))

ax1.plot(list(k_range), inertias, 'b-o', label='Inertia (Elbow Method)')
ax1.set_xlabel('Number of Clusters (k)')
ax1.set_ylabel('Inertia', color='b')

ax2 = ax1.twinx()
ax2.plot(list(k_range), silhouette_scores, 'r-s', label='Silhouette Score')
ax2.set_ylabel('Silhouette Score', color='r')

plt.title("Elbow & Silhouette Analysis for KMeans")
plt.grid(True)
plt.show()

- **Blue Line (Inertia)**:
  - Inertia measures how tightly points are grouped within each cluster.
  - Lower is better, but diminishing returns appear as clusters increase.
  - In your plot, inertia keeps decreasing (as expected), but the "bend" is near **k = 3**.

- **Red Line (Silhouette Score)**:
  - Measures how distinct and well-separated clusters are.
  - Higher is better (closer to +1 means well-defined clusters).
  - Here, silhouette score is highest at **k = 2**, but drops after that.

**Interpretation:**
- If you care about *clear separation* → k = 2 is best.
- If you care about *topic granularity (as per your dataset's 3 themes)* → k = 3 is a reasonable choice.

### Alternative Visualization: Hierarchical Clustering or t-SNE

- **Hierarchical Clustering**: Builds a tree of clusters (dendrogram).
- **t-SNE**: Non-linear dimensionality reduction technique, better for visualizing high-dimensional embeddings.
- **UMAP**: Similar to t-SNE but faster and preserves more global structure.

Here we’ll try t-SNE to better visualize semantic relationships.

In [None]:
from sklearn.manifold import TSNE

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5, n_iter=1000)
tsne_embeddings = tsne.fit_transform(sentence_embeddings.detach().numpy())

# Plot t-SNE clusters
plt.figure(figsize=(8, 6))
plt.scatter(tsne_embeddings[:, 0], tsne_embeddings[:, 1], c=clusters, cmap='viridis', s=100)
for i, sentence in enumerate(sentences):
    plt.annotate(f"{i+1}", (tsne_embeddings[i, 0] + 0.02, tsne_embeddings[i, 1]))

plt.title("t-SNE Visualization of Sentence Clusters")
plt.grid(True)
plt.show()

- Each point is a sentence embedding reduced to 2D using t-SNE.
- Colors represent KMeans cluster assignments.
- Numbers are sentence indices.

**Observations:**
- Points with similar topics are closer together.
- For example:
  - Sentences 1 & 2 are grouped (likely similar topic).
  - Sentence 5 is slightly isolated (possible outlier or unique context).
- Clusters are reasonably separated but not perfectly — normal for small datasets.

**Key Note:**
- t-SNE emphasizes local relationships, not exact distances.
- Distances between far-apart clusters are less meaningful than cluster tightness.

### Semantic Search Demo

We can now use our embeddings to build a **basic semantic search engine**:
- Take a query sentence.
- Compute its embedding.
- Calculate cosine similarity with all sentence embeddings.
- Return the most semantically similar sentence(s).

In [None]:
query = "I enjoy watching sports."
query_inputs = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    query_output = model(**query_inputs)
query_embedding = mean_pooling(query_output, query_inputs['attention_mask'])

# Compute similarity with existing sentences
similarities = cosine_similarity(query_embedding, sentence_embeddings).flatten()

# Find top match
top_idx = similarities.argmax()
print(f"Query: {query}")
print(f"Most similar: {sentences[top_idx]} (score: {similarities[top_idx]:.2f})")

# **SUMMARY**

## **What We Explored**
- Learned how to measure **semantic similarity** between sentences using embeddings.
- Used **cosine similarity** to quantify closeness between sentence vectors.
- Performed **KMeans clustering** to group semantically similar sentences.
- Visualized embeddings using **PCA and t-SNE** to observe natural clusters.
- Applied **Elbow Method & Silhouette Analysis** to decide the optimal number of clusters.
- Built a **mini semantic search engine** to retrieve the most relevant sentence to a query.

---

## **Key Concepts**
1. **Cosine Similarity**
   - Measures the angle between two vectors.
   - Closer to **1 → more similar**, closer to **0 → less similar**.

2. **Clustering**
   - **KMeans** groups sentences into clusters by minimizing within-cluster variance.
   - `n_clusters` determines how many groups to form.
   - **Choosing k**:
     - *Elbow Method* → find the bend where inertia reduction slows down.
     - *Silhouette Score* → choose k with the highest cluster separation.

3. **Dimensionality Reduction**
   - **PCA**: Linear technique to project high-dimensional data into 2D for visualization.
   - **t-SNE**: Non-linear technique, preserves local structure, better for small datasets.

4. **Semantic Search**
   - Compute embedding for query → Compare with dataset embeddings → Retrieve most similar.

---

## **Insights from Our Data**
- Elbow & silhouette suggest **k=2 or k=3** as reasonable choices.
- t-SNE visualization showed logical grouping but with slight overlap (expected for small datasets).
- Semantic search worked by retrieving the most contextually similar sentence.

---

## **Key Functions & Parameters Used**
- **Hugging Face Trans**
