# Interactive Semantic Search Workshop

## Introduction

Welcome to this interactive notebook on semantic search! This notebook is designed to be a hands-on learning environment where you'll explore, implement, and optimize semantic search algorithms and techniques.

### What is Semantic Search?

Semantic search refers to search techniques that understand the *meaning* of a query, rather than just matching keywords. It allows for more intelligent retrieval of information by understanding the context, intent, and conceptual meaning behind search queries.

### In this Workshop, You Will:

1. Learn about embeddings and how they capture semantic meaning
2. Explore document similarity using vector representations
3. Build a simple but effective semantic search engine
4. Optimize your search engine for better performance
5. Evaluate and benchmark your implementation

### How to Use This Notebook

- Read through the explanation cells carefully
- Complete the code in cells marked with TODOs
- Check your work against the solution cells (but try to solve problems yourself first!)
- Answer the checkpoint questions to reinforce your understanding
- Feel free to experiment and modify the code to see how it affects results

Let's get started!

## 1. Setup and Required Libraries

First, let's install and import the libraries we'll need for this workshop.

In [1]:
# Install required packages
!pip install -q sentence-transformers numpy scikit-learn pandas matplotlib faiss-cpu nltk

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util
import faiss
from functools import lru_cache
import torch
import json
import warnings
warnings.filterwarnings('ignore')

# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

Let's prepare a sample dataset for our exercises. We'll use a collection of news headlines and articles for our semantic search experiments.

In [None]:
# Sample dataset of news headlines and short articles
sample_articles = [
    {
        "title": "New AI Model Breaks Performance Records",
        "content": "Researchers have developed a new artificial intelligence model that surpasses previous benchmarks on multiple tasks. The model demonstrates superior performance in natural language understanding and generation."
    },
    {
        "title": "Global Climate Summit Reaches Historic Agreement",
        "content": "World leaders at the climate summit have reached a historic agreement to reduce carbon emissions by 50% by 2030. The deal includes financial commitments to support developing nations in their transition to clean energy."
    },
    {
        "title": "Tech Company Launches Revolutionary Smartphone",
        "content": "A leading technology company has unveiled its latest smartphone with groundbreaking features. The device includes a foldable screen, week-long battery life, and advanced AI capabilities."
    },
    {
        "title": "Scientists Discover Potential Cancer Treatment",
        "content": "Medical researchers have identified a new compound that shows promise in treating aggressive forms of cancer. Early clinical trials indicate the treatment may be effective with minimal side effects."
    },
    {
        "title": "Stock Market Reaches All-Time High",
        "content": "The stock market closed at a record high yesterday, with technology and healthcare sectors leading the gains. Analysts attribute the surge to strong corporate earnings and positive economic indicators."
    },
    {
        "title": "New Study Links Exercise to Improved Brain Function",
        "content": "A recent study has found that regular exercise is directly linked to enhanced cognitive performance and brain health. Participants who exercised regularly showed significant improvements in memory and problem-solving abilities."
    },
    {
        "title": "Renewable Energy Surpasses Coal for the First Time",
        "content": "For the first time in history, electricity generated from renewable sources has exceeded that from coal on a global scale. Solar and wind power installations have grown exponentially over the past decade."
    },
    {
        "title": "Major Data Breach Exposes Millions of User Accounts",
        "content": "A major online platform has reported a significant data breach affecting millions of users worldwide. The compromised data includes email addresses, passwords, and in some cases, payment information."
    },
    {
        "title": "Archeologists Uncover Ancient City in Remote Jungle",
        "content": "A team of archeologists has discovered the remains of a previously unknown ancient city deep in the jungle. The site includes elaborate temples, plazas, and residential areas dating back over 2,000 years."
    },
    {
        "title": "New Law Aims to Reduce Plastic Pollution",
        "content": "Lawmakers have passed new legislation aimed at drastically reducing single-use plastic waste. The law will ban certain plastic products and impose taxes on others to encourage more sustainable alternatives."
    },
    {
        "title": "AI System Outperforms Human Doctors in Diagnosis",
        "content": "A newly developed artificial intelligence system has demonstrated greater accuracy than human physicians in diagnosing several common diseases. The AI can analyze medical images and patient data to provide fast and accurate diagnoses."
    },
    {
        "title": "Electric Vehicle Sales Double in Past Year",
        "content": "Sales of electric vehicles have more than doubled compared to the previous year. Increased model availability, improved battery technology, and growing charging infrastructure have contributed to the surge in adoption."
    },
    {
        "title": "International Space Station Welcomes New Crew",
        "content": "A new crew of astronauts has successfully docked with the International Space Station. The international team will conduct scientific experiments and maintenance activities during their six-month mission in orbit."
    },
    {
        "title": "Education Reform Bill Passes with Bipartisan Support",
        "content": "A comprehensive education reform bill has passed with support from both major political parties. The legislation includes increased funding for schools, teacher salary improvements, and new standards for curriculum development."
    },
    {
        "title": "Global Pandemic Response Shows Signs of Success",
        "content": "Coordinated global efforts to combat the recent pandemic are showing positive results with declining infection rates in multiple countries. Vaccination campaigns and public health measures have played crucial roles in this progress."
    }
]

# Display the dataset
articles_df = pd.DataFrame(sample_articles)
print(f"Sample dataset contains {len(articles_df)} articles")
articles_df.head(3)

## 2. Basic Embeddings Exploration

### What are Embeddings?

Embeddings are dense vector representations of words, sentences, or documents that capture semantic meaning. Unlike simple one-hot encodings, embeddings place semantically similar items close together in vector space.

Modern embedding techniques like BERT and its variants can capture nuanced contextual meanings, making them powerful tools for semantic search.

In [None]:
# Initialize a sentence transformer model
# We'll use a lightweight but effective model
model_name = 'paraphrase-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
print(f"Loaded model: {model_name}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Let's create embeddings for some simple sentences to see how semantic meaning is captured.

In [None]:
# Example sentences with different semantic relationships
sentences = [
    "I love programming in Python",               # 0
    "Python is my favorite programming language", # 1 - semantically similar to 0
    "I enjoy coding with Python",                 # 2 - semantically similar to 0 and 1
    "Programming languages are essential tools",  # 3 - related but less similar
    "Python snakes are fascinating reptiles",     # 4 - different meaning despite "Python"
    "The weather is beautiful today"              # 5 - completely different meaning
]

# TODO: Generate embeddings for these sentences using the model
# Your code here:


**Solution:**

In [None]:
# Solution
embeddings = model.encode(sentences)
print(f"Shape of embeddings: {embeddings.shape}")
print(f"Sample of first embedding vector: {embeddings[0][:5]}...")

Now, let's visualize the similarity between these sentence embeddings using a heatmap.

In [None]:
# TODO: Calculate the cosine similarity between all pairs of sentences
# and visualize the result as a heatmap
# Your code here:


**Solution:**

In [None]:
# Solution
# Calculate cosine similarity
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i in range(len(sentences)):
    for j in range(len(sentences)):
        similarity_matrix[i, j] = util.pytorch_cos_sim(embeddings[i], embeddings[j]).item()

# Create a heatmap visualization
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='YlOrRd')
plt.colorbar(label='Cosine Similarity')

# Add labels and annotations
sentence_labels = [f"Sentence {i+1}" for i in range(len(sentences))]
plt.xticks(np.arange(len(sentences)), sentence_labels, rotation=45, ha='right')
plt.yticks(np.arange(len(sentences)), sentence_labels)

# Add text annotations in the cells
for i in range(len(sentences)):
    for j in range(len(sentences)):
        plt.text(j, i, f"{similarity_matrix[i, j]:.2f}",
                 ha="center", va="center", color="black" if similarity_matrix[i, j] < 0.8 else "white")

plt.title('Semantic Similarity Between Sentences')
plt.tight_layout()
plt.show()

### Checkpoint Questions:

1. What do you observe in the similarity matrix? Which sentences are most similar to each other and why?
2. Does the presence of the word 'Python' automatically make sentences semantically similar?
3. How well does the model capture the meaning of the sentences?

## 3. Document Similarity Comparison

Now that we've explored basic similarities between sentences, let's apply these concepts to compare longer documents. We'll use our news article dataset for this exercise.

In [None]:
# First, let's combine the title and content of each article to create full documents
documents = []
for article in sample_articles:
    doc = f"{article['title']}. {article['content']}"
    documents.append(doc)

print(f"Created {len(documents)} documents for analysis")
print(f"Sample document: {documents[0][:100]}...")

### Document Embedding Strategies

When dealing with longer documents, there are several embedding strategies to consider:

1. **Full document embedding**: Encode the entire document at once (limited by token length)
2. **Chunk-based embedding**: Break document into chunks, encode each chunk, then average/pool
3. **Hierarchical embedding**: Combine sentence-level embeddings with document-level structure

For this exercise, we'll start with the simplest approach - full document embedding.

In [None]:
# TODO: Generate embeddings for the documents and calculate their similarity
# Implement a function that takes two document indices and returns their similarity score
# Your code here:

def document_similarity(doc_idx1, doc_idx2):
    # Your implementation here
    pass

# Also create a function to find the most similar document to a given document
def find_most_similar(doc_idx, top_k=3):
    # Your implementation here
    pass

**Solution:**

In [None]:
# Generate document embeddings
doc_embeddings = model.encode(documents)
print(f"Generated embeddings for {len(doc_embeddings)} documents with dimension {doc_embeddings.shape[1]}")

def document_similarity(doc_idx1, doc_idx2):
    """Calculate the cosine similarity between two documents"""
    emb1 = doc_embeddings[doc_idx1]
    emb2 = doc_embeddings[doc_idx2]
    similarity = util.pytorch_cos_sim(emb1, emb2).item()
    return similarity

def find_most_similar(doc_idx, top_k=3):
    """Find the top-k most similar documents to the given document"""
    query_embedding = doc_embeddings[doc_idx]
    
    # Calculate similarities with all documents
    similarities = [util.pytorch_cos_sim(query_embedding, doc_emb).item() 
                 for doc_emb in doc_embeddings]
    
    # Get top k indices (excluding the query document itself)
    sorted_indices = np.argsort(similarities)[::-1]
    top_indices = [idx for idx in sorted_indices if idx != doc_idx][:top_k]
    
    return [(idx, similarities[idx]) for idx in top_indices]

# Example usage
print("Example document:")
example_idx = 0
print(f"Title: {sample_articles[example_idx]['title']}")
print("\nMost similar documents:")
similar_docs = find_most_similar(example_idx)
for idx, score in similar_docs:
    print(f"- {sample_articles[idx]['title']} (similarity: {score:.3f})")

## 4. Building a Simple Search Engine

Now let's build a basic semantic search engine that can find relevant articles based on natural language queries.

### Basic Search Implementation
We'll start with a simple implementation that:
1. Takes a query string
2. Converts it to an embedding
3. Finds the most similar documents

In [None]:
class SimpleSemanticSearch:
    def __init__(self, model_name='paraphrase-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
    
    def index_documents(self, documents):
        """Index the documents by computing their embeddings"""
        self.documents = documents
        self.embeddings = self.model.encode(documents)
        return self
    
    def search(self, query, top_k=3):
        """Search for documents similar to the query"""
        # Encode the query
        query_embedding = self.model.encode(query)
        
        # Calculate similarities
        similarities = [util.pytorch_cos_sim(query_embedding, doc_emb).item() 
                     for doc_emb in self.embeddings]
        
        # Get top k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [(self.documents[idx], similarities[idx]) for idx in top_indices]

# Initialize and index our documents
search_engine = SimpleSemanticSearch()
search_engine.index_documents(documents)

# Try some searches
print("Testing the search engine:\n")
test_queries = [
    "latest developments in artificial intelligence",
    "environmental protection and climate change",
    "medical breakthroughs and healthcare"
]

for query in test_queries:
    print(f"Query: {query}")
    results = search_engine.search(query)
    print("Top results:")
    for doc, score in results:
        print(f"- Score {score:.3f}: {doc[:100]}...\n")
    print("-" * 80 + "\n")

## 5. Performance Optimization

Let's optimize our search engine for better performance with larger document collections.
Key improvements will include:
- FAISS indexing for faster similarity search
- Batch processing for document encoding
- Result caching

## End of Notebook

In [1]:
print('hello')

hello
