# Semantic Search Engine Demo

This notebook demonstrates how to use semantic embeddings to perform semantic search. Semantic search understands the *meaning* of text, not just keyword matching.

## What You'll Learn
- How to use pre-trained embedding models
- How to convert text into numerical vectors (embeddings)
- How to find semantically similar documents using cosine similarity


## Step 1: Install Required Libraries

First, we need to install the necessary libraries. Run this cell once to install the dependencies.


In [1]:
# Install the necessary library
%pip install sentence-transformers numpy scikit-learn


Note: you may need to restart the kernel to use updated packages.


## Step 2: Import Libraries

Import the required libraries for our semantic search implementation.


In [2]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


  from .autonotebook import tqdm as notebook_tqdm


## Step 3: Load the Embedding Model

We'll use a pre-trained model from Hugging Face. The `all-MiniLM-L6-v2` model is a lightweight but effective model that converts text into 384-dimensional vectors.

**Note:** The first time you run this, it will download the model (about 80MB). Subsequent runs will use the cached version.


In [3]:
# --- 1. Load the Embedding Model ---
# This loads the model from the Hugging Face Hub (sentence-transformers/all-MiniLM-L6-v2)
print("Loading model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded successfully.")


Loading model...
Model loaded successfully.


## Step 4: Define the Knowledge Base (Corpus)

This is our collection of documents that we want to search through. In a real application, this could be thousands or millions of documents.


In [4]:
# --- 2. Define the Knowledge Base (Corpus) ---
documents = [
    "The sky is a vivid shade of blue today.",
    "The newest iPhone model was released with a powerful new chip.",
    "A majestic hawk was spotted flying high above the forest canopy.",
    "Apple is set to announce its latest mobile device with updated features.",
    "I'm enjoying a picnic on the grass.",
]


## Step 5: Create Embeddings for the Corpus

Convert each document in our corpus into a numerical vector (embedding). These embeddings capture the semantic meaning of the text.

**Key Concept:** Similar meanings will have similar vectors, even if they use different words!


In [5]:
# --- 3. Create Embeddings for the Corpus ---
# The .encode() function converts the text into numerical vectors (embeddings)
document_embeddings = model.encode(documents, convert_to_tensor=True)
print(f"Generated {len(document_embeddings)} embeddings, each with a dimension of {document_embeddings.shape[1]}.")


Generated 5 embeddings, each with a dimension of 384.


## Step 6: Define a Query and Create its Embedding

Now we'll create a search query. Notice that our query doesn't use the exact same words as the documents, but it should still find the relevant document about phones!


In [6]:
# --- 4. Define a Query and Create its Embedding ---
query = "Tell me about the recent phone technology releases."
query_embedding = model.encode([query], convert_to_tensor=True)


## Step 7: Perform Semantic Search (Calculate Similarity)

We calculate the cosine similarity between the query embedding and all document embeddings.

**Cosine Similarity:**
- Ranges from -1 (opposite meaning) to 1 (identical meaning)
- Values close to 1 indicate high semantic similarity
- Values close to 0 indicate low similarity

We'll find the document with the highest similarity score.


In [7]:
# --- 5. Perform Semantic Search (Calculate Similarity) ---
# We calculate the cosine similarity between the query embedding and ALL document embeddings.
# Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning).
similarities = cosine_similarity(query_embedding.cpu().numpy(), document_embeddings.cpu().numpy())

# Get the index of the most similar document
most_similar_index = np.argmax(similarities)
max_similarity_score = similarities[0, most_similar_index]
best_match_document = documents[most_similar_index]


## Step 8: Display Results

Let's see which document was found as the best match for our query!


In [8]:
# --- 6. Print Results ---
print("\n" + "="*50)
print(f"Query: **{query}**")
print("="*50)
print(f"Best Match (Score: {max_similarity_score:.4f}):")
print(f"'{best_match_document}'")



Query: **Tell me about the recent phone technology releases.**
Best Match (Score: 0.5846):
'Apple is set to announce its latest mobile device with updated features.'


## Step 9: Explore All Similarity Scores (Optional)

Let's see the similarity scores for all documents to better understand how the semantic search works.


In [9]:
# Display similarity scores for all documents
print("\nSimilarity scores for all documents:")
print("-" * 50)
for i, (doc, score) in enumerate(zip(documents, similarities[0])):
    print(f"\nDocument {i+1} (Score: {score:.4f}):")
    print(f"  '{doc}'")



Similarity scores for all documents:
--------------------------------------------------

Document 1 (Score: 0.0485):
  'The sky is a vivid shade of blue today.'

Document 2 (Score: 0.5450):
  'The newest iPhone model was released with a powerful new chip.'

Document 3 (Score: -0.0547):
  'A majestic hawk was spotted flying high above the forest canopy.'

Document 4 (Score: 0.5846):
  'Apple is set to announce its latest mobile device with updated features.'

Document 5 (Score: -0.0379):
  'I'm enjoying a picnic on the grass.'


## Try It Yourself!

Experiment with different queries to see how semantic search works:

- Try queries about nature, technology, or daily activities
- Notice how the model finds relevant documents even when they don't share exact keywords
- Compare the similarity scores to understand the ranking
