# Day 4 – Semantic Search Mini-Project

Building a **mini semantic search engine** using the concepts learned earlier:
- Convert a list of sentences into **embeddings** using a Transformer model.
- Store these embeddings in memory (can be extended to a database or FAISS later).
- Take a user query, embed it, and compute **cosine similarity** with stored embeddings.
- Retrieve and rank the most similar sentences.

This is the foundation of modern **semantic search engines, chatbots, and recommendation systems**.

In [None]:
#Import Libraries & Prepare Data
#We'll use a small dataset of sentences for now.

from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Set a manual seed for reproducibility
torch.manual_seed(42)

# Sample dataset (can be replaced with your own text corpus)
sentences = [
    "Machine learning is fascinating.",
    "I love exploring natural language processing.",
    "Transformers are the backbone of modern NLP.",
    "Deep learning enables powerful AI systems.",
    "I enjoy hiking in the mountains.",
    "The weather today is sunny and bright.",
    "Neural networks learn from data."
]

print(f"Loaded {len(sentences)} sentences for semantic search.")

### Generating Sentence Embeddings

To perform semantic search, each sentence must be converted into a numerical vector
(embedding) that captures its meaning.

Steps:
1. Tokenize all sentences with the model's tokenizer.
2. Pass tokens through a Transformer model (e.g., MiniLM, BERT).
3. Apply **mean pooling** to get a single embedding per sentence.
4. Store these embeddings for similarity search.

We'll use a lightweight sentence-transformer model:
`sentence-transformers/all-MiniLM-L6-v2`

In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Mean pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return (token_embeddings * mask).sum(1) / mask.sum(1)

sentence_embeddings = mean_pooling(outputs, inputs['attention_mask'])

print("Generated sentence embeddings with shape:", sentence_embeddings.shape)

### Visualizing Sentence Embeddings

High-dimensional embeddings (384 dimensions in MiniLM) are hard to interpret directly.
We use **PCA (Principal Component Analysis)** to project them into 2D.

Why?
- Helps see if semantically similar sentences appear close together.
- Useful for sanity check before search.

Each point represents one sentence from our dataset.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA to reduce to 2D
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(sentence_embeddings)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], color='blue', s=100)

# Annotate sentences
for i, sentence in enumerate(sentences):
    plt.annotate(f"{i+1}", (embeddings_2d[i, 0] + 0.02, embeddings_2d[i, 1]))

plt.title("Sentence Embeddings Visualized with PCA")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()

### Implementing Semantic Search

How it works:
1. **Embed the query** → Convert input text into a vector.
2. **Compute cosine similarity** between query vector and all sentence embeddings.
3. **Sort results** from most to least similar.
4. **Return top N matches**.

Cosine similarity is used because it focuses on the angle (semantic similarity) rather
than raw distance (magnitude), making it robust for text embeddings.

In [None]:
def semantic_search(query, top_k=3):
    # Tokenize and embed query
    encoded_query = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        query_output = model(**encoded_query)
        # ** is Python's argument unpacking operator for dictionaries.
        # encoded_query is a dictionary returned by the tokenizer.
        # The model's forward() method expects separate named arguments.
        # **encoded_query automatically unpacks the dictionary into keyword arguments

    query_embedding = mean_pooling(query_output, encoded_query['attention_mask'])

    # Compute cosine similarity
    similarities = cosine_similarity(query_embedding, sentence_embeddings)[0]

    # Get top-k most similar sentences
    top_indices = similarities.argsort()[::-1][:top_k]

    print(f"\nQuery: {query}\n")
    for idx in top_indices:
        print(f"Score: {similarities[idx]:.4f} | Sentence: {sentences[idx]}")

# Example search
semantic_search("I am interested in artificial intelligence.")

### Using Wikipedia as a Text Source

Instead of manually typing sentences, we'll fetch real-world content
from Wikipedia using the `wikipedia` Python library.

Why?
- Provides diverse and relevant content for AI/ML/Data Science topics.
- Makes our semantic search project more realistic and recruiter-friendly.
- Easy to expand into a larger dataset later.

Steps:
1. Install and import `wikipedia`.
2. Choose relevant topics (e.g., Artificial Intelligence, Machine Learning, Deep Learning).
3. Extract text content.
4. Clean and store it as sentences.

In [None]:
# Install the wikipedia library (if not already installed)
!pip install wikipedia

import wikipedia

topics = [
    # Core AI/ML topics
    'Artificial Intelligence',
    'Machine Learning',
    'Deep Learning',
    'Neural Networks',
    'Generative AI',
    'Computer Vision',
    'Large Language Model',
    'Retrieval-augmented generation',
    'Object Detection',
    'Face Recognition',
    'Natural Language Processing',
    'Image Processing',

    # Data Science & Analytics topics
    'Data Science',
    'Data Mining',
    'Big Data',
    'Data Analytics',
    'Predictive Analytics',
    'Statistical Modeling',
    'Data Visualization',
    'Exploratory Data Analysis',
    'Data Cleaning',
    'ETL (Extract Transform Load)',
    'Business Intelligence',
    'Data Warehousing',
    'Feature Engineering',
    'Time Series Analysis',
    'Reinforcement Learning',
    'Anomaly Detection',
    'Data Governance',
    'Data Ethics',
    'Cloud Computing for Data Science'
]

In [None]:
# After your Wikipedia fetching loop:
# corpus contains only successfully fetched summaries
# successful_topics contains the corresponding topics (create this during fetch)

successful_topics = []  # track which topics succeeded
corpus = []             # your summaries

for topic in topics:
    try:
        search_results = wikipedia.search(topic)
        if not search_results:
            print(f"No results for: {topic}")
            continue
        page = wikipedia.page(search_results[0])
        corpus.append(page.summary)
        successful_topics.append(page.title)  # store actual page title
        print(f"Fetched: {page.title}")
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Skipped {topic} due to disambiguation: {e}")
    except wikipedia.exceptions.PageError:
        print(f"Page not found for: {topic}")

print(f"\nTotal successful topics: {len(successful_topics)}")

In [None]:
print(successful_topics)

### Generating Embeddings for Wikipedia Corpus

Now that our dataset is populated with real-world AI/ML/Data Science content,
we need to convert each article summary into an embedding.

Steps:
1. Tokenize each summary (batch mode for efficiency).
2. Pass through the Transformer model.
3. Apply mean pooling to get a single embedding per topic.
4. Store them for semantic search.

In [None]:
# Tokenize Wikipedia corpus
inputs = tokenizer(corpus, padding=True, truncation=True, return_tensors="pt")

# Forward pass through model
with torch.no_grad():
    outputs = model(**inputs)

# Mean pooling for sentence embeddings
wiki_embeddings = mean_pooling(outputs, inputs['attention_mask'])

print("Generated embeddings for Wikipedia corpus:", wiki_embeddings.shape)

### Visualizing Wikipedia Corpus Embeddings & Running Semantic Search

Let's visualize the high-dimensional embeddings
in 2D using PCA.

Why?
- To see how topics are distributed in semantic space.
- Verify that related topics appear closer together (e.g., Machine Learning & Deep Learning).

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Make sure topics correspond only to successfully fetched summaries
fetched_topics = [successful_topics[i] for i in range(len(corpus))]  # align with corpus length

# Reduce dimensions to 2D
pca = PCA(n_components=2)
wiki_embeddings_2d = pca.fit_transform(wiki_embeddings)

# Plot the topics
plt.figure(figsize=(10, 7))
plt.scatter(wiki_embeddings_2d[:, 0], wiki_embeddings_2d[:, 1], color='blue', s=80)

# Annotate each point with the topic name (shortened for clarity)
for i, topic in enumerate(fetched_topics):
    plt.annotate(topic[:15] + "...", (wiki_embeddings_2d[i, 0] + 0.02, wiki_embeddings_2d[i, 1]))

plt.title("Wikipedia AI/ML/Data Science Topics – PCA Visualization")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.show()

### Cosine Similarity Heatmap for Wikipedia Topics

After generating embeddings for each topic, we can compute the pairwise
cosine similarity between all topics.

Why a heatmap?
- Quickly shows which topics are most closely related.
- Darker/brighter blocks indicate higher similarity.
- Helps identify natural groupings (e.g., Machine Learning & Deep Learning
are expected to be highly similar).

We will use `sklearn.metrics.pairwise.cosine_similarity` to compute
the similarity matrix and `seaborn` for visualization.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import numpy as np

# Compute cosine similarity matrix for embeddings
similarity_matrix = cosine_similarity(wiki_embeddings)

# Adjust topics list for fetched ones
fetched_topics = [successful_topics[i] for i in range(len(corpus))]

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(similarity_matrix, xticklabels=fetched_topics, yticklabels=fetched_topics,
            cmap="coolwarm", annot=False, fmt=".2f", square=True, cbar=True)

plt.title("Cosine Similarity Heatmap – Wikipedia AI/ML/Data Science Topics")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
#Performing Semantic Search

def semantic_search_wikipedia(query, top_k=5):
    # Tokenize and embed query
    encoded_query = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        query_output = model(**encoded_query)
    query_embedding = mean_pooling(query_output, encoded_query['attention_mask'])

    # Compute cosine similarity
    similarities = cosine_similarity(query_embedding, wiki_embeddings)[0]

    # Get top-k most similar topics
    top_indices = similarities.argsort()[::-1][:top_k]

    print(f"\nQuery: {query}\n")
    for idx in top_indices:
        print(f"Score: {similarities[idx]:.4f} | Topic: {successful_topics[idx]}")
        print(f"Summary: {corpus[idx][:300]}...\n")

# Example search
semantic_search_wikipedia("Explain the basics of computer vision")

### Saving Cosine Similarity Matrix for Future Use

To reuse the computed similarity matrix without recalculating embeddings each time,
we will export it in two formats:

1. **CSV** – Easy to open and inspect in Excel or Google Sheets.
2. **JSON** – Convenient for programmatic use in APIs or web apps.

In [None]:
import os
import pandas as pd
import json

# Create folder if it doesn't exist
folder_name = "artifacts"
os.makedirs(folder_name, exist_ok=True)

# Save similarity matrix as CSV
csv_path = os.path.join(folder_name, "wikipedia_similarity_matrix.csv")
df_similarity.to_csv(csv_path)
print(f"Saved: {csv_path}")

# Save as JSON
json_path = os.path.join(folder_name, "wikipedia_similarity_matrix.json")
with open(json_path, "w") as f:
    json.dump(similarity_json, f, indent=4)
print(f"Saved: {json_path}")

In [None]:
# Save Wikipedia corpus as CSV
corpus_df = pd.DataFrame({"Topic": successful_topics, "Summary": corpus})
corpus_csv_path = os.path.join(folder_name, "wikipedia_corpus.csv")
corpus_df.to_csv(corpus_csv_path, index=False)
print(f"Saved: {corpus_csv_path}")

# Save as JSON
corpus_json = [{"topic": t, "summary": s} for t, s in zip(successful_topics, corpus)]
corpus_json_path = os.path.join(folder_name, "wikipedia_corpus.json")
with open(corpus_json_path, "w") as f:
    json.dump(corpus_json, f, indent=4)
print(f"Saved: {corpus_json_path}")

### Semantic Search Function

We will now create a simple semantic search function that:
1. Takes a user query as input.
2. Converts the query into an embedding using the same model as the corpus.
3. Computes cosine similarity between the query embedding and each topic's embedding.
4. Returns the top N most semantically similar topics with their scores.

This is the core functionality of any semantic search engine before adding a UI.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_search(query, top_k=5):
    """
    Perform semantic search on the Wikipedia corpus.

    Args:
        query (str): User's search query.
        top_k (int): Number of top results to return.

    Returns:
        List of tuples: [(topic, similarity_score, summary), ...]
    """
    # Tokenize the query
    encoded_query = tokenizer(query, padding=True, truncation=True, return_tensors="pt")

    # Generate query embedding
    with torch.no_grad():
        query_output = model(**encoded_query)

    query_embedding = mean_pooling(query_output, encoded_query['attention_mask'])

    # Compute cosine similarity with all corpus embeddings
    similarities = cosine_similarity(query_embedding, wiki_embeddings)[0]

    # Sort by similarity score (descending)
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Collect results
    results = [(successful_topics[i], similarities[i], corpus[i]) for i in top_indices]

    return results

# Example query
query = "What is used to make machines learn like humans?"
results = semantic_search(query, top_k=5)

for rank, (topic, score, summary) in enumerate(results, start=1):
    print(f"{rank}. {topic} (score: {score:.4f})")
    print(f"   Summary: {summary[:200]}...\n")

Visualizing Semantic Search Results

To better understand which topics are most semantically related to the query,
we can plot their similarity scores as a bar chart.

- **X-axis:** Topics retrieved from the search.
- **Y-axis:** Cosine similarity scores (0 to 1).
- Higher bars mean greater semantic relevance to the query.

In [None]:
import matplotlib.pyplot as plt

def visualize_search_results(results, query):
    """
    Visualize semantic search results as a bar chart.
    Args:
        results (list): Output of semantic_search (topic, score, summary)
        query (str): The search query for title
    """
    topics = [r[0] for r in results]
    scores = [r[1] for r in results]

    plt.figure(figsize=(10, 5))
    bars = plt.barh(topics, scores, color='skyblue')
    plt.xlabel("Cosine Similarity")
    plt.title(f"Semantic Search Results for Query: \"{query}\"")
    plt.gca().invert_yaxis()  # Highest similarity at top

    # Annotate scores on bars
    for bar, score in zip(bars, scores):
        plt.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
                 f"{score:.2f}", va='center')

    plt.tight_layout()
    plt.show()

# Visualize the previous search results
visualize_search_results(results, query)

## Building a Minimal Gradio Semantic Search UI

We will create a lightweight user interface using Gradio to:
1. Accept a user query.
2. Perform semantic search on the Wikipedia corpus.
3. Display the top N matching topics and their summaries.

This is not the full deployment (which comes in Day 7),
but a functional prototype for interactive testing.


In [None]:
!pip install gradio --quiet

In [None]:
import gradio as gr

def semantic_search_interface(query, top_k=5):
    results = semantic_search(query, top_k=top_k)
    display_text = ""
    for rank, (topic, score, summary) in enumerate(results, start=1):
        display_text += f"### {rank}. {topic} (Score: {score:.4f})\n"
        display_text += f"{summary[:300]}...\n\n"
    return display_text

# Define Gradio UI
with gr.Blocks() as demo:
    gr.Markdown("# 🔍 Semantic Search – Wikipedia AI/ML Topics")
    query_input = gr.Textbox(label="Enter your statement related to AI/ML:")
    top_k_slider = gr.Slider(1, 10, value=5, step=1, label="Number of results")
    gr.Markdown("## Output: (NOTE - The output will be list of topics that \
    match the context of your input statement)")
    output_box = gr.Markdown()
    search_button = gr.Button("Search")

    search_button.click(fn=semantic_search_interface,
                        inputs=[query_input, top_k_slider],
                        outputs=output_box)

# Launch the app
demo.launch()

## SUMMARY

### What We Did
- Built a **semantic search engine** using Wikipedia topics on AI/ML/Data Science.
- Converted each topic's summary into **sentence embeddings** using a Transformer model.
- Implemented a **search function** to retrieve most relevant topics based on cosine similarity.
- Added **visualization** (bar chart) to interpret results.
- Created a **minimal Gradio UI** for interactive semantic search.

---

### Key Concepts Learned

#### **Semantic Search**
- Instead of keyword matching, we use **semantic meaning** to find relevant documents.
- Queries and documents are both represented as **vectors in the same embedding space**.

#### **Cosine Similarity**
- Measures how close two vectors are, based on the angle between them.
- **1.0 = identical meaning**, **0.0 = unrelated**.

#### **Embeddings**
- Generated using a pre-trained model (BERT-based in this case).
- Each document (Wikipedia summary) → 768-dimensional vector.

#### **Visualization**
- Applied **PCA (Principal Component Analysis)** to reduce embeddings from 768D → 2D.
- Plotted topics to observe **semantic clusters**.

#### **Gradio Interface**
- Built a minimal UI to:
  - Input query
  - Select number of results (`top_k`)
  - Display ranked results with summaries

---

### Key Functions, Modules & Parameters
- **transformers**
  - `AutoTokenizer`, `AutoModel` for encoding text
- **torch**
  - `torch.no_grad()` for inference efficiency
  - `manual_seed()` for reproducibility
- **sklearn**
  - `cosine_similarity()` for semantic relevance
  - `PCA()` for visualization
- **gradio**
  - `Blocks()`, `Textbox()`, `Slider()`, `Markdown()`, `Button()` for the UI

---

### Outputs
- **Cosine Similarity Matrix** saved as:
  - `artifacts/wikipedia_similarity_matrix.csv`
  - `artifacts/wikipedia_similarity_matrix.json`
- **Corpus (topics + summaries)** saved as:
  - `artifacts/wikipedia_corpus.csv`
  - `artifacts/wikipedia_corpus.json`
- **Interactive Gradio App** for semantic search

---

### Why This Matters
- Demonstrates the foundation of **intelligent search engines** (e.g., Google, ChatGPT memory retrieval, document search).
- Shows how to move from **raw text → embeddings → search interface**.
- Prepares the ground for:
  - **Day 5: Fine-tuning** (LoRA/PEFT)
  - **Day 6: RAG (Retrieval-Augmented Generation)**
  - **Day 7: Deployment with a polished UI**
