# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities

#YOUR CODE HERE
# Compute similarities for each pair
for i, (sent1, sent2) in enumerate(sentence_pairs, 1):
    # model.encode(): Converts text into numerical vectors (embeddings)
    # These embeddings capture semantic meaning in high-dimensional space
    embedding1 = model.encode([sent1])
    embedding2 = model.encode([sent2])
    
    # cosine_similarity(): Measures angle between vectors (0 = perpendicular, 1 = identical)
    # cosine similarity = (A · B) / (|A| × |B|)
    similarity = cosine_similarity(embedding1, embedding2)[0][0]
    
    print(f"Pair {i}:")
    print(f"  Sentence 1: '{sent1}'")
    print(f"  Sentence 2: '{sent2}'")
    print(f"  Cosine Similarity: {similarity:.4f}")
    print()


Pair 1:
  Sentence 1: 'A dog is playing in the park.'
  Sentence 2: 'A dog is running in a field.'
  Cosine Similarity: 0.5220

Pair 2:
  Sentence 1: 'I love pizza.'
  Sentence 2: 'I enjoy ice cream.'
  Cosine Similarity: 0.5281

Pair 3:
  Sentence 1: 'What is AI?'
  Sentence 2: 'How does a computer learn?'
  Cosine Similarity: 0.3194



### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


Pair2 is the sentece pair most semantically similar. 
Identical Syntactic Structure: Both follow "I [positive emotion verb] [food noun]"
Semantic Parallel: "love" and "enjoy" are very close in embedding space
Same Sentiment: Both express positive feelings
Grammatical Similarity: Subject + verb + object pattern
Concise Length: Shorter sentences often have higher similarity due to less noise

Key Insights from These Results
🔍 What This Reveals About Embeddings:

1. Syntactic Structure Matters A LOT
Similar sentence patterns score higher than semantic similarity alone
"I love X" vs. "I enjoy Y" has strong structural similarity


2. Exact Word Matching Is Important
"Dog" appearing in both sentences helps significantly
Direct lexical overlap boosts similarity scores


3. Sentence Length Effects
Shorter sentences (Pair 2) can achieve higher similarity
Less opportunity for divergent words to reduce similarity


4. Question Type Distinctions
Models distinguish between different types of questions
"What is" vs. "How does" are structurally very different


Examples where it might fail:
1. Negation and Antonyms
2. Sarcasm and Irony
3. Context-Dependent Polysemy (Multiple Meanings)
4. Temporal and Causal Relationships
5. Implicit vs. Explicit Information
6. Cultural References and Idioms
7. Domain-Specific Jargon
8. Quantity and Scale Differences
9. Presupposition and Entailment
10. Emotional Intensity Gradations



## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [3]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
print("Documents to cluster:")
for i, doc in enumerate(documents):
    print(f"  {i}: {doc}")
print()

# model.encode(): Converts all documents into vector representations
doc_embeddings = model.encode(documents)
print(f"Document embeddings shape: {doc_embeddings.shape}")
print(f"Each document is represented as a {doc_embeddings.shape[1]}-dimensional vector\n")



Documents to cluster:
  0: What is the capital of France?
  1: How do I bake a chocolate cake?
  2: What is the distance between Earth and Mars?
  3: How do I change a flat tire on a car?
  4: What is the best way to learn Python?
  5: How do I fix a leaky faucet?

Document embeddings shape: (6, 384)
Each document is represented as a 384-dimensional vector



In [6]:
# Perform KMeans clustering

#YOUR CODE HERE

# Perform KMeans clustering
# KMeans: Groups similar data points together by minimizing within-cluster variance
# n_clusters=3: We expect 3 semantic groups (geography, cooking/DIY, technology)
# random_state=42: Ensures reproducible results
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(doc_embeddings)

print("Perform KMeans clustering")
print("Done!")
print("-" * 30)


# Group documents by cluster
clusters = {}
for doc_idx, cluster_id in enumerate(cluster_labels):
    if cluster_id not in clusters:
        clusters[cluster_id] = []
    clusters[cluster_id].append((doc_idx, documents[doc_idx]))



Perform KMeans clustering
Done!
------------------------------


In [7]:
# Print cluster assignments

#YOUR CODE HERE
# Print cluster assignments
for cluster_id, cluster_docs in clusters.items():
    print(f"Cluster {cluster_id}:")
    for doc_idx, doc_text in cluster_docs:
        print(f"  - {doc_text}")
    print()

Cluster 2:
  - What is the capital of France?
  - What is the best way to learn Python?

Cluster 0:
  - How do I bake a chocolate cake?
  - What is the distance between Earth and Mars?

Cluster 1:
  - How do I change a flat tire on a car?
  - How do I fix a leaky faucet?



### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

1. For the Original 6-Document Dataset: 3 Clusters is optimal
Why 3 clusters works best:

Natural Semantic Groupings: The documents naturally fall into 3 thematic categories:

DIY/How-to cluster: "bake cake", "change tire", "fix faucet"
Factual/Knowledge cluster: "capital of France", "distance Earth-Mars"
Technology/Learning cluster: "learn Python"


2. Yes, very meaningful! Here are the semantic themes:
Cluster 1: DIY/Practical Skills
Cluster 2: Factual/Geographic Knowledge
Cluster 3: Technology/Learning



## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [8]:
import numpy as np

# Extended documents dataset for search
search_documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings

#YOUR CODE HERE

print("Search corpus contains", len(search_documents), "documents\n")

# Compute document embeddings for the search corpus
search_doc_embeddings = model.encode(search_documents)




Search corpus contains 10 documents



In [9]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    # YOUR CODE HERE

    """
    Performs semantic search using cosine similarity
    
    Parameters:
    - query: User's search query (string)
    - documents: List of documents to search through
    - doc_embeddings: Pre-computed embeddings for documents
    - top_n: Number of top results to return
    
    Returns:
    - List of tuples: (similarity_score, document_index, document_text)
    """


# Encode the user query into the same embedding space
    query_embedding = model.encode([query])
    
    # Calculate cosine similarity between query and all documents
    # cosine_similarity returns a matrix, we take the first row
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Create list of (similarity, index, document) tuples
    results = []
    for i, similarity in enumerate(similarities):
        results.append((similarity, i, documents[i]))
    
    # Sort by similarity score in descending order
    # key=lambda x: x[0] means sort by the first element (similarity score)
    results.sort(key=lambda x: x[0], reverse=True)
    
    # Return top N results
    return results[:top_n]



In [10]:
# Test the search function
# Test the search function
test_queries = [
    "Explain programming languages.",
    "How do I cook something sweet?",
    "Tell me about space and planets.",
    "I need help with car maintenance."
]

for query in test_queries:
    print(f"Query: '{query}'")
    print("-" * 40)
    
    results = semantic_search(query, search_documents, search_doc_embeddings, top_n=3)
    
    for rank, (similarity, doc_idx, doc_text) in enumerate(results, 1):
        print(f"  {rank}. [{similarity:.4f}] {doc_text}")
    print()




Query: 'Explain programming languages.'
----------------------------------------
  1. [0.4352] What is quantum computing?
  2. [0.3188] What is the best way to learn Python?
  3. [0.1104] How do I build a mobile app?

Query: 'How do I cook something sweet?'
----------------------------------------
  1. [0.5428] How do I bake a chocolate cake?
  2. [0.2814] How do I set up a local server?
  3. [0.2382] How do I change a flat tire on a car?

Query: 'Tell me about space and planets.'
----------------------------------------
  1. [0.4337] What is the distance between Earth and Mars?
  2. [0.2180] What is quantum computing?
  3. [0.1801] What is the best way to learn Python?

Query: 'I need help with car maintenance.'
----------------------------------------
  1. [0.2768] How do I change a flat tire on a car?
  2. [0.1232] How do I build a mobile app?
  3. [0.1182] What is the best way to learn Python?



### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset

In [11]:
# ===============================
# ANALYSIS AND INSIGHTS
# ===============================

print("=" * 50)
print("ANALYSIS AND INSIGHTS")
print("=" * 50)

print("KEY CONCEPTS EXPLAINED:")
print()

print("1. EMBEDDINGS:")
print("   - Convert text into numerical vectors that capture semantic meaning")
print("   - Similar texts have similar vector representations")
print("   - Enable mathematical operations on text")
print()

print("2. COSINE SIMILARITY:")
print("   - Measures the cosine of the angle between two vectors")
print("   - Range: -1 (opposite) to 1 (identical)")
print("   - Ignores magnitude, focuses on direction/orientation")
print("   - Formula: cos(θ) = (A·B) / (|A|×|B|)")
print()

print("3. KMEANS CLUSTERING:")
print("   - Partitions data into k clusters")
print("   - Minimizes within-cluster sum of squares")
print("   - Each point belongs to cluster with nearest centroid")
print()

print("4. SEMANTIC SEARCH:")
print("   - Uses embeddings to find semantically similar content")
print("   - More powerful than keyword matching")
print("   - Can find relevant results even with different wording")

print("\nFUNCTIONS USED:")
print("- SentenceTransformer.encode(): Text → Vector embeddings")
print("- cosine_similarity(): Compute similarity between vectors")  
print("- KMeans.fit_predict(): Cluster data points")
print("- numpy operations: Array manipulation and sorting")

ANALYSIS AND INSIGHTS
KEY CONCEPTS EXPLAINED:

1. EMBEDDINGS:
   - Convert text into numerical vectors that capture semantic meaning
   - Similar texts have similar vector representations
   - Enable mathematical operations on text

2. COSINE SIMILARITY:
   - Measures the cosine of the angle between two vectors
   - Range: -1 (opposite) to 1 (identical)
   - Ignores magnitude, focuses on direction/orientation
   - Formula: cos(θ) = (A·B) / (|A|×|B|)

3. KMEANS CLUSTERING:
   - Partitions data into k clusters
   - Minimizes within-cluster sum of squares
   - Each point belongs to cluster with nearest centroid

4. SEMANTIC SEARCH:
   - Uses embeddings to find semantically similar content
   - More powerful than keyword matching
   - Can find relevant results even with different wording

FUNCTIONS USED:
- SentenceTransformer.encode(): Text → Vector embeddings
- cosine_similarity(): Compute similarity between vectors
- KMeans.fit_predict(): Cluster data points
- numpy operations: Array man