# Semantic Search Engine Demo

This notebook demonstrates how to use semantic embeddings to perform semantic search. Semantic search understands the *meaning* of text, not just keyword matching.

## What You'll Learn
- How to use pre-trained embedding models
- How to convert text into numerical vectors (embeddings)
- How to find semantically similar documents using cosine similarity


## Step 1: Install Required Libraries

First, we need to install the necessary libraries. Run this cell once to install the dependencies.


In [1]:
# Install the necessary library
%pip install sentence-transformers numpy scikit-learn



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Step 2: Import Libraries

Import the required libraries for our semantic search implementation.


In [2]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


## Step 3: Load the Embedding Model

We'll use a pre-trained model from Hugging Face. The `all-MiniLM-L6-v2` model is a lightweight but effective model that converts text into 384-dimensional vectors.

### Model Architecture

**Yes, it's a Transformer Neural Network!** Specifically:
- **Type**: Transformer encoder (6 layers)
- **Base**: Distilled from BERT (Bidirectional Encoder Representations from Transformers)
- **Purpose**: Optimized for sentence-level embeddings
- **Size**: ~80MB, produces 384-dimensional vectors

Transformers use **self-attention mechanisms** to understand relationships between words in a sentence, allowing them to capture semantic meaning effectively.

### What This Model Can and Cannot Do

**✅ What it CAN do:**
- Create embeddings (convert text to numerical vectors)
- Find semantically similar documents/texts
- Calculate similarity scores between texts
- Perform semantic search and retrieval

**❌ What it CANNOT do:**
- Generate text or chat (it's not a generative model)
- Answer questions with generated responses
- Create new content
- Have conversations

**Key Distinction:** This is an **embedding model**, not a **generative/chat model**. It finds the closest matching embeddings from your corpus but doesn't generate new text. For chat capabilities, you'd need models like GPT, Llama, or Claude.

**Important Notes:**
- **Local Execution**: The model runs **locally** on your machine (not via remote API). All processing happens on your CPU/GPU.
- **First Download**: The first time you run this, it will download the model (about 80MB) from Hugging Face Hub and cache it locally.
- **Offline Capable**: Subsequent runs will use the cached version - no internet connection needed after the initial download!
- **Privacy**: Since everything runs locally, your data never leaves your machine.


In [3]:
# --- 1. Load the Embedding Model ---
# This loads the model from the Hugging Face Hub (sentence-transformers/all-MiniLM-L6-v2)
print("Loading model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded successfully.")


Loading model...




Model loaded successfully.


## Step 4: Define the Knowledge Base (Corpus)

This is our collection of documents that we want to search through. In a real application, this could be thousands or millions of documents.


In [4]:
# --- 2. Define the Knowledge Base (Corpus) ---
documents = [
    "The sky is a vivid shade of blue today.",
    "The newest iPhone model was released with a powerful new chip.",
    "A majestic hawk was spotted flying high above the forest canopy.",
    "Apple is set to announce its latest mobile device with updated features.",
    "I'm enjoying a picnic on the grass.",
]


## Step 5: Create Embeddings for the Corpus

Convert each document in our corpus into a numerical vector (embedding). These embeddings capture the semantic meaning of the text.

### How Embeddings Work

**One-to-One Mapping:** Each chunk of text (sentence/document) gets converted into **one vector**.
- 5 sentences → 5 vectors
- Each vector has 384 dimensions (for this model)
- The entire semantic meaning of the text is encoded into that single vector

**Key Concept:** Similar meanings will have similar vectors, even if they use different words!

**Example:**
- Sentence: "The newest iPhone model was released..."
- → Single vector: `[0.23, -0.45, 0.12, ..., 0.67]` (384 numbers)


In [5]:
# --- 3. Create Embeddings for the Corpus ---
# The .encode() function converts the text into numerical vectors (embeddings)
document_embeddings = model.encode(documents, convert_to_tensor=True)
print(f"Generated {len(document_embeddings)} embeddings, each with a dimension of {document_embeddings.shape[1]}.")


Generated 5 embeddings, each with a dimension of 384.


### Visualizing the Embedding Structure

Let's examine the shape and structure of our embeddings to understand the one-to-one mapping.


In [6]:
# Examine the embedding structure
print(f"Shape of document_embeddings: {document_embeddings.shape}")
print(f"\nThis means:")
print(f"  - {document_embeddings.shape[0]} documents → {document_embeddings.shape[0]} vectors")
print(f"  - Each vector has {document_embeddings.shape[1]} dimensions")
print(f"\nExample: First document embedding (first 10 values):")
print(document_embeddings[0][:10].cpu().numpy())


Shape of document_embeddings: torch.Size([5, 384])

This means:
  - 5 documents → 5 vectors
  - Each vector has 384 dimensions

Example: First document embedding (first 10 values):
[ 0.02667101  0.02384197  0.08926792  0.03606723  0.02770695  0.00857649
  0.06938126 -0.07234967  0.02515115  0.01264223]


## Step 6: Define a Query and Create its Embedding

Now we'll create a search query. Notice that our query doesn't use the exact same words as the documents, but it should still find the relevant document about phones!


In [7]:
# --- 4. Define a Query and Create its Embedding ---
query = "Tell me about the recent phone technology releases."
query_embedding = model.encode([query], convert_to_tensor=True)


## Step 7: Perform Semantic Search (Calculate Similarity)

We calculate the cosine similarity between the query embedding and all document embeddings.

**Cosine Similarity:**
- Ranges from -1 (opposite meaning) to 1 (identical meaning)
- Values close to 1 indicate high semantic similarity
- Values close to 0 indicate low similarity

We'll find the document with the highest similarity score.


In [8]:
# --- 5. Perform Semantic Search (Calculate Similarity) ---
# We calculate the cosine similarity between the query embedding and ALL document embeddings.
# Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning).
similarities = cosine_similarity(query_embedding.cpu().numpy(), document_embeddings.cpu().numpy())

# Get the index of the most similar document
most_similar_index = np.argmax(similarities)
max_similarity_score = similarities[0, most_similar_index]
best_match_document = documents[most_similar_index]


## Step 8: Display Results

Let's see which document was found as the best match for our query!


In [9]:
# --- 6. Print Results ---
print("\n" + "="*50)
print(f"Query: **{query}**")
print("="*50)
print(f"Best Match (Score: {max_similarity_score:.4f}):")
print(f"'{best_match_document}'")



Query: **Tell me about the recent phone technology releases.**
Best Match (Score: 0.5846):
'Apple is set to announce its latest mobile device with updated features.'


## Finding Embedding Dimensions on Hugging Face

When exploring models on Hugging Face, you can find the embedding dimension in several ways:

### 1. Model Card
- Look at the model's main page description
- Check the "Model Card" section for specifications
- Often listed as "embedding dimension", "hidden size", or "model dimension"

### 2. Config File
- Go to the "Files and versions" tab
- Open `config.json`
- Look for parameters like:
  - `hidden_size` - internal representation size
  - `embedding_size` - output embedding dimension
  - `d_model` - model dimension (common in transformers)

### 3. Model Usage Example
- Check code examples in the model card
- Often shows the output shape/dimension

**Example:** For `all-MiniLM-L6-v2`, the config shows `hidden_size: 384`, which is the embedding dimension.


## Embedding Dimensions: Embedding Models vs. Chatbot Models

**Important Distinction:** There are two types of models with different purposes:

### Embedding Models (for Semantic Search)
These produce **output embeddings** for similarity search:

| Model | Embedding Dimension | Use Case |
|-------|-------------------|----------|
| `all-MiniLM-L6-v2` | **384** | Lightweight semantic search |
| `all-mpnet-base-v2` | **768** | Higher quality embeddings |
| `BGE-base-en-v1.5` | **768** | General-purpose embeddings |
| `e5-large-v2` | **1024** | High-quality multilingual |

### Chatbot Models (Generative LLMs)
These have **internal/hidden dimensions** but don't produce embeddings for search:

| Model | Hidden Dimension | Type | Note |
|-------|-----------------|------|-----|
| **GPT-3.5** | ~4096 | Generative | Internal representation, not for embeddings |
| **GPT-4** | ~8192+ | Generative | Not publicly disclosed, estimated |
| **Llama 3.1 8B** | **4096** | Generative | Internal hidden size |
| **Llama 3.1 70B** | **8192** | Generative | Internal hidden size |
| **Claude 3** | ~4096+ | Generative | Not publicly disclosed |
| **BERT-base** | **768** | Encoder | Can be used for embeddings |

**Key Point:** Chatbot models are **generative** (create text), while embedding models are **retrieval-focused** (find similar text). They serve different purposes!


## Step 9: Explore All Similarity Scores (Optional)

Let's see the similarity scores for all documents to better understand how the semantic search works.


In [9]:
# Display similarity scores for all documents
print("\nSimilarity scores for all documents:")
print("-" * 50)
for i, (doc, score) in enumerate(zip(documents, similarities[0])):
    print(f"\nDocument {i+1} (Score: {score:.4f}):")
    print(f"  '{doc}'")



Similarity scores for all documents:
--------------------------------------------------

Document 1 (Score: 0.0485):
  'The sky is a vivid shade of blue today.'

Document 2 (Score: 0.5450):
  'The newest iPhone model was released with a powerful new chip.'

Document 3 (Score: -0.0547):
  'A majestic hawk was spotted flying high above the forest canopy.'

Document 4 (Score: 0.5846):
  'Apple is set to announce its latest mobile device with updated features.'

Document 5 (Score: -0.0379):
  'I'm enjoying a picnic on the grass.'


## Try It Yourself!

Experiment with different queries to see how semantic search works:

- Try queries about nature, technology, or daily activities
- Notice how the model finds relevant documents even when they don't share exact keywords
- Compare the similarity scores to understand the ranking


## From Semantic Search to RAG (Retrieval-Augmented Generation)

**You're absolutely correct!** Combining an embedding model with a chat model creates a **RAG (Retrieval-Augmented Generation)** system. This is one of the most powerful applications of semantic embeddings!

### How RAG Works

**Step 1: Retrieval (Embedding Model)**
- User asks a question: *"What are the latest iPhone features?"*
- Embedding model searches your knowledge base (documents, articles, etc.)
- Finds the most relevant text chunks using semantic similarity

**Step 2: Augmentation (Context Injection)**
- The retrieved relevant documents are passed as context to the chat model
- This gives the chat model specific, relevant information to work with

**Step 3: Generation (Chat Model)**
- The chat model (GPT, Claude, Llama, etc.) generates an answer
- It uses the retrieved context + its training knowledge
- Produces a well-informed, contextually relevant response

### Why RAG is Powerful

✅ **Reduces Hallucinations**: Chat models can make up information. RAG grounds answers in real documents.

✅ **Domain-Specific Knowledge**: You can use your own documents/knowledge base without retraining the model.

✅ **Up-to-Date Information**: Add new documents without retraining - just update your knowledge base.

✅ **Transparency**: You can see which documents were used to generate the answer (citations).

### Example RAG Pipeline

```
User Question
    ↓
[Embedding Model] → Finds relevant documents from knowledge base
    ↓
[Retrieved Context] → "iPhone 15 Pro features: A17 chip, titanium design..."
    ↓
[Chat Model] → Generates answer using context
    ↓
Final Answer: "Based on the latest information, the iPhone 15 Pro features..."
```

**This notebook demonstrates Step 1 (Retrieval).** To build a full RAG system, you'd add a chat model for Step 3!


In [10]:
# --- 4. Define a Query and Create its Embedding ---
query = "do people see hawks under the rain?"
query_embedding = model.encode([query], convert_to_tensor=True)
# --- 5. Perform Semantic Search (Calculate Similarity) ---
# We calculate the cosine similarity between the query embedding and ALL document embeddings.
# Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning).
similarities = cosine_similarity(query_embedding.cpu().numpy(), document_embeddings.cpu().numpy())

# Get the index of the most similar document
most_similar_index = np.argmax(similarities)
max_similarity_score = similarities[0, most_similar_index]
best_match_document = documents[most_similar_index]
# --- 6. Print Results ---
print("\n" + "="*50)
print(f"Query: **{query}**")
print("="*50)
print(f"Best Match (Score: {max_similarity_score:.4f}):")
print(f"'{best_match_document}'")



Query: **do people see hawks under the rain?**
Best Match (Score: 0.4604):
'A majestic hawk was spotted flying high above the forest canopy.'
