# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import tqdm as notebook_tqdm

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities
for s1, s2 in sentence_pairs:
    emb1 = model.encode([s1])
    emb2 = model.encode([s2])
    sim = cosine_similarity(emb1, emb2)[0][0]
    print(f"Similarity between:\n  '{s1}'\n  '{s2}'\n  = {sim:.4f}\n")

Similarity between:
  'A dog is playing in the park.'
  'A dog is running in a field.'
  = 0.5220

Similarity between:
  'I love pizza.'
  'I enjoy ice cream.'
  = 0.5281

Similarity between:
  'What is AI?'
  'How does a computer learn?'
  = 0.3194



### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


# Answers to Task 1 Questions

# 1. Which sentence pairs are the most semantically similar? Why?
# The pair "A dog is playing in the park." vs. "A dog is running in a field." is the most semantically similar because both describe a dog engaging in a similar activity outdoors.

# 2. Can you think of cases where cosine similarity might fail to capture true semantic meaning?
# Cosine similarity may fail when sentences are paraphrased with very different vocabulary, or when context, sarcasm, or negation changes the meaning but embeddings remain close.

## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [4]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
doc_embeddings = model.encode(documents)

In [5]:
# Perform KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(doc_embeddings)

In [6]:
# Print cluster assignments
for i, doc in enumerate(documents):
    print(f"Cluster {clusters[i]}: {doc}")

Cluster 2: What is the capital of France?
Cluster 0: How do I bake a chocolate cake?
Cluster 0: What is the distance between Earth and Mars?
Cluster 1: How do I change a flat tire on a car?
Cluster 2: What is the best way to learn Python?
Cluster 1: How do I fix a leaky faucet?


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

In [None]:
# 1. How many clusters make the most sense? Why?
# Three clusters make sense here because the documents naturally group into three themes: factual questions, technical/how-to questions, and learning/education topics.

# 2. Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
# Yes, the clusters are meaningful. For example:
# - Cluster 0: factual/science questions
# - Cluster 1: car/home repair how-to
# - Cluster 2: travel/learning/technology

# 3. Try this exercise with a larger dataset of your choice.
# With a larger dataset, clusters may become more refined and reveal additional themes. Try using a dataset like Stack Overflow questions or Wikipedia articles for richer clustering.

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [7]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)

In [8]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    query_emb = model.encode([query])
    sims = cosine_similarity(query_emb, doc_embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:top_n]
    print(f"Query: {query}\nTop {top_n} results:")
    for idx in top_idx:
        print(f"  ({sims[idx]:.4f}) {documents[idx]}")

In [9]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

Query: Explain programming languages.
Top 5 results:
  (0.4352) What is quantum computing?
  (0.3188) What is the best way to learn Python?
  (0.1104) How do I build a mobile app?
  (0.0911) How do I set up a local server?
  (0.0906) What are the best travel destinations in Europe?


### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset

In [None]:
# 1. What are the top-ranked results for the given queries?
# The top-ranked results are those documents with the highest cosine similarity to the query embedding. For "Explain programming languages.", results related to learning Python or building a mobile app are likely to be top-ranked.

# 2. How can you improve the ranking explanation for users?
# You can display similarity scores, highlight matching keywords, or provide a short summary explaining why each result is relevant to the query.

# 3. Try this approach with a larger dataset
# Using a larger dataset will improve the diversity and relevance of search results, making the semantic search more useful and robust.