
# 📗 Notebook 2: Semantic Similarity and Vector Closeness

This notebook explores how **embeddings help compare the meaning of sentences**.

**Semantic similarity** is about comparing the *meaning* of two pieces of text — not just whether they use the same words.

Examples:
- “The dog barked loudly.”
- “The canine made noise.”

These don’t share many exact words, but mean nearly the same thing. A basic keyword match wouldn’t see the connection — but a language model does, by encoding meaning in vectors.

When a model like `SentenceTransformer` reads a sentence, it creates an **embedding** — a list of numbers that represents the sentence’s meaning. These embeddings are **vectors in a high-dimensional space**.

### 🧲 Analogy:
- Each sentence is a dot in a big space.
- Sentences with similar meaning are nearby dots.
- The distance between them tells us how semantically similar they are.



### 📐 How Do We Measure Closeness?

The most common method is **cosine similarity**. It measures the *angle* between two vectors.

- Cosine similarity = 1.0 → exactly the same direction (perfect match)
- ≈ 0.7–0.9 → similar meaning
- = 0.0 → no similarity
- < 0 → opposite meanings (rare with sentence embeddings)


In [1]:

import logging
import warnings
from transformers.utils import logging as hf_logging
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


# Suppress all user warnings, including HF_TOKEN ones
warnings.filterwarnings("ignore", category=UserWarning)

# Suppress Hugging Face logs
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
hf_logging.set_verbosity_error()
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)

model = SentenceTransformer('all-MiniLM-L6-v2')



## 🔍 Quick Warm-Up Example

A simple test of three sentences. We expect sentences 1 and 2 to be close in meaning, and sentence 3 to differ.


In [None]:

sentences = [
    "The cat sits on the mat.",
    "A feline is on a rug.",
    "I went to the market."
]
embeddings = model.encode(sentences)
similarity_matrix = cosine_similarity(embeddings)

# Display similarity matrix
pd.DataFrame(similarity_matrix, index=sentences, columns=sentences)



## 🧠 Semantic Clustering of Expanded Sentences

Now let’s explore 6 more varied sentences and see how similar topics cluster together.


In [None]:

sentences = [
    "The stock market crashed in 2008.",
    "Financial markets collapsed in the late 2000s.",
    "My dog loves playing fetch in the park.",
    "Throwing a ball for my puppy at the park is fun.",
    "I had toast for breakfast today.",
    "Breakfast was just some plain white bread."
]
vectors = model.encode(sentences)
similarity_matrix = cosine_similarity(vectors)
reduced = PCA(n_components=2).fit_transform(vectors)

# Scatter plot
plt.figure(figsize=(10, 7))
for i, sentence in enumerate(sentences):
    plt.scatter(reduced[i, 0], reduced[i, 1])
    plt.annotate(f"{i+1}. {sentence}", (reduced[i, 0]+0.01, reduced[i, 1]+0.01), fontsize=9)
plt.title("2D Visualization of Sentence Embeddings")
plt.grid(True)
plt.show()



### 🔥 Heatmap: Cosine Similarity Between Sentences

Visualizing pairwise similarity scores.

How to Read the Heatmap
* Each cell shows the similarity between a pair of sentences.
* he diagonal (top-left to bottom-right) will always show 1.0 because each sentence is perfectly similar to itself.
* Lighter blue = higher similarity
* Darker blue = lower similarity

📌 Why It’s Useful

This helps you:
* Spot clusters of sentences with similar meanings.
* Identify outliers that are semantically different.
* Visually explore which sentences are most alike — especially helpful in summarization, clustering, or search applications.


In [None]:

sns.heatmap(similarity_matrix, annot=True, xticklabels=sentences, yticklabels=sentences, cmap="Blues", fmt=".2f")
plt.title("Cosine Similarity Between Sentences")
plt.show()
