<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBDS-EN/blob/main/04_semantics/semantics_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 4: Dense Semantics and Vector Spaces (Lab)

**Learning Objective:**
Understand how transforming text into high-dimensional numerical vectors (Embeddings) solves the fragility of classical rule-based NLP.

We will revisit the exact failure from Session 2 (Synonyms and Passive Voice) and prove mathematically how geometry and Cosine Similarity solve the problem of meaning.

In [None]:
# 1. Environment Setup
# We use sentence-transformers to generate state-of-the-art dense embeddings.
!pip install -q sentence-transformers numpy matplotlib seaborn

### Phase 1: The Geometry of Meaning
We will load a small, highly efficient Embedding model. This model acts as a function: $f(text) \rightarrow \mathbb{R}^{384}$. It projects any text into a 384-dimensional space.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load a pre-trained model optimized for semantic similarity
model_id = 'all-MiniLM-L6-v2'
embedder = SentenceTransformer(model_id)

print(f"Model loaded. Vector dimensionality: {embedder.get_sentence_embedding_dimension()}")

### Phase 2: Solving the Parsing Failure
In our previous, our rule-based system failed to connect `Who bought a car?` with `A brand new car was purchased by John` because the syntax and vocabulary were different.

Let's map them into the vector space.

In [None]:
# The query
query = "Who bought a car?"

# The documents (Knowledge base)
documents = [
    "A brand new car was purchased by John.",  # Meaning matches query (Failed in Session 2)
    "Mary gave John a heavy book.",            # Completely irrelevant
    "Alice traveled to Paris by train.",       # Completely irrelevant
    "I want to buy a vehicle.",                # Similar vocabulary, different intent
]

# Encode text into vectors
query_vector = embedder.encode(query)
document_vectors = embedder.encode(documents)

print(f"Query Vector Shape: {query_vector.shape}")
print(f"Document Matrix Shape: {document_vectors.shape}")

### Phase 3: Mathematical Proof (Cosine Similarity)
How do we know if two vectors mean the same thing? We measure the angle between them using **Cosine Similarity**.
$Cosine(A, B) = \frac{A \cdot B}{||A|| ||B||}$

In [None]:
from numpy.linalg import norm

# Function to calculate cosine similarity from scratch
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

print(f"Query: '{query}'\n")

results = []
for i, doc in enumerate(documents):
    score = cosine_similarity(query_vector, document_vectors[i])
    results.append((doc, score))
    print(f"Score: {score:.4f} | Document: '{doc}'")

# TAKEAWAY FOR THE STUDENT:
Notice that `purchased` and `bought` are mapped to very similar coordinates.
The model understands that `vehicle` and `car` are related, but intent matters.
We didn't write a single syntax rule. Mathematics solved the semantic gap.

### Phase 4: Building a Vector Database from Scratch
Commercial Vector Databases (Pinecone, Milvus, Qdrant) are just highly optimized engines that perform the matrix multiplication we just did, but over billions of vectors.

Let's do a bulk search using pure matrix multiplication (Dot Product), assuming the vectors are normalized.

In [None]:
# 1. Normalize vectors (Length = 1) so Dot Product equals Cosine Similarity
q_norm = query_vector / norm(query_vector)
doc_norms = document_vectors / norm(document_vectors, axis=1, keepdims=True)

# 2. The Core Engine of a Vector DB: Matrix Multiplication
# Multiply the (1, 384) query vector by the (4, 384) document matrix transpose.
similarity_scores = np.dot(q_norm, doc_norms.T)

# 3. Retrieve the Top-K results (Sorting)
top_k = 2
best_indices = np.argsort(similarity_scores)[::-1][:top_k]

print("--- Vector Database Search Results ---")
for idx in best_indices:
    print(f"Rank: Score={similarity_scores[idx]:.4f} -> {documents[idx]}")

# Visualizing the Vector Matrix Activity
plt.figure(figsize=(12, 6))
# We map the labels to the document strings using the best_indices
labels = [documents[i] for i in best_indices]
sns.heatmap(doc_norms[best_indices], cmap="coolwarm", center=0, yticklabels=labels)
plt.title("Dimensions of the Document Vector Space")
plt.xlabel("Latent Dimensions")
plt.ylabel("") # Remove y-axis title
plt.yticks(rotation=45) # Tilt the text
plt.show()

# This output becomes the 'Context' that we will feed into an LLM in Session 9 (RAG).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the contribution to the dot product (q_i * d_i)
# This shows in which dimensions the vectors align most
contributions = q_norm * doc_norms[best_indices]

# Select the top 20 dimensions with the highest average contribution for the top results
mean_contributions = np.mean(contributions, axis=0)
top_dims = np.argsort(np.abs(mean_contributions))[-20:]

plt.figure(figsize=(15, 5))
labels = [documents[i] for i in best_indices]
# We use xticklabels=top_dims to show the actual dimension indices
sns.heatmap(contributions[:, top_dims], cmap="RdYlGn", center=0, yticklabels=labels, xticklabels=top_dims)

plt.title("Top 20 Dimensions Contributing Most to Similarity")
plt.xlabel("Vector Dimension Index")
plt.ylabel("")
plt.yticks(rotation=0)
plt.show()

print("Green blocks represent dimensions where both the query and the document align strongly.")

### What does this mean?

These dimensions contain the 'features' that the model recognized in both the query and the best-matching documents (e.g., the concept of 'transaction' or 'automobiles').
