#  Using Embeddings for Semantic Search

In this notebook, we will explore how **embeddings** can be used to build a simple search engine for text data.  
Instead of relying on keyword matching, embeddings allow us to capture the **semantic meaning** of text. This means that even if the words are not exactly the same, we can still find relevant results.

### What you will learn
- How to **load and preprocess text data** (in this case, a collection of books).  
- How to use a **pretrained embedding model** (`SentenceTransformer`) to convert entire books into vectors.  
- How to measure **similarity between a query and the books** using **cosine similarity**.  
- How to build a simple **search function** that returns the most relevant books for a user’s query.  

### Why embeddings?
Traditional search engines often rely on keywords:  
- Searching for *"dog"* will only find documents containing the exact word *"dog"*.  

With embeddings, the model learns meaning:  
- Searching for *"puppy"* may also retrieve documents about *dogs*, because the two are semantically related.  

### Limitations of this demo
- For simppart of the, we embed **entire books as single vectors**.  
  In real-world systems, texts are usually split into **smaller chunks** (paragraphs or pages) for better retrieval.  
- Computing embeddings for thousands of so the embedding is done in advacne and stored into pkl fileke st similar to your query.


In [1]:
!!pip install -upgrade sentence-transformers scikit-learn tqdm
!pip install --upgrade transformers

Defaulting to user installation because normal site-packages is not writeable


In [5]:
# step 1: load the needed models. 
import warnings

warnings.filterwarnings("ignore")

import os
import re
import pickle
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

<b> First, we load the books, do some simple preprocessing </b>

In [6]:
# --- Configuration ---
# Define key variables here for easy modification
DATA_DIR = 'data/'
EMBEDDING_FILE = 'book_snippet_embeddings.pkl'
MAX_WORDS_PER_BOOK = 800  # We'll use the first 800 words of each book for the demo

# Step 2: Define Functions for Data Loading and Preprocessing

def load_books(data_dir):
    """Loads all .txt files from the specified directory into a dictionary."""
    books = {}
    print(f"Loading books from '{data_dir}'...")
    for filename in os.listdir(data_dir):
        if filename.endswith(".txt"):
            with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f:
                books[filename] = f.read()
    print(f"####Loaded {len(books)} books.")
    return books

def preprocess_and_truncate(books):
    """
    Cleans and truncates the text of each book to MAX_WORDS_PER_BOOK.
    This creates the snippets we will embed.
    """
    book_snippets = {}
    print(f"Processing and truncating books to {MAX_WORDS_PER_BOOK} words...")
    for filename, text in books.items():
        # Remove non-alphabetic characters and convert to lowercase
        processed_text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
        
        # Split into words and take the first MAX_WORDS_PER_BOOK
        words = processed_text.split()
        snippet = " ".join(words[:MAX_WORDS_PER_BOOK])
        book_snippets[filename] = snippet
        
    print("############### Snippets created.")
    return book_snippets

# Step 3: Load and Process the Data

# Load the original full text of the books
all_books_full_text = load_books(DATA_DIR)

# Create the shorter snippets for embedding
book_snippets = preprocess_and_truncate(all_books_full_text)

Loading books from 'data/'...
####Loaded 5689 books.
Processing and truncating books to 800 words...
############### Snippets created.


<b> Next, using sentence transformer and numpy, we do the embedding on the books and store the embedding </b>

In [7]:
# Step 4: Generate or Load Embeddings

# Initialize the SentenceTransformer model. Make sure it is running on GPU. Otherwise the task will take extremely long time
model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B', device="cuda")

# Check if embeddings already exist to avoid recomputing
if os.path.exists(EMBEDDING_FILE):
    print("Loading existing embeddings from disk...")
    with open(EMBEDDING_FILE, 'rb') as f:
        book_embeddings = pickle.load(f)
    print("############ Embeddings loaded.")
else:
    print("Generating new embeddings for book snippets...")
    # Get the list of filenames and their corresponding text snippets
    filenames = list(book_snippets.keys())
    snippets_to_embed = list(book_snippets.values())

    # Generate embeddings in one go. This will be fast!
    embeddings = model.encode(
        snippets_to_embed,
        batch_size=8,  # Can use a larger batch size if you have more memory on GPU
        show_progress_bar=True,
        convert_to_numpy=True
    )
    
    # Map filenames back to their embeddings
    book_embeddings = {fn: emb for fn, emb in zip(filenames, embeddings)}

    # Save the embeddings to a file for future use
    with open(EMBEDDING_FILE, 'wb') as f:
        pickle.dump(book_embeddings, f)
    print(f"############# Embeddings computed and saved to '{EMBEDDING_FILE}'.")


Loading existing embeddings from disk...
############ Embeddings loaded.


In [8]:
# Step 5: Define Functions for Semantic Search and Display

def find_top_n_books(query, book_embeddings, n=5):
    """Finds the top N most relevant books for a given query."""
    # Embed the user's query
    query_embedding = model.encode(query, convert_to_numpy=True)
    
    # Reshape for cosine similarity calculation
    query_embedding = query_embedding.reshape(1, -1)

    # Compute cosine similarity between the query and all book snippets
    similarities = {}
    for filename, book_embedding in book_embeddings.items():
        book_embedding = book_embedding.reshape(1, -1)
        similarity = cosine_similarity(query_embedding, book_embedding)[0][0]
        similarities[filename] = similarity

    # Sort the books by similarity score in descending order
    sorted_books = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

    # Return the top N books
    return sorted_books[:n]

def print_top_books(book_list, original_books_text):
    """Prints the search results nicely, showing the start of the original book."""
    print("\n\n---  Top Search Results ---")
    for i, (filename, score) in enumerate(book_list):
        # Get the first 30 lines of the original, unprocessed book for context
        book_content = original_books_text[filename]
        first_30_lines = '\n'.join(book_content.splitlines()[:30])
        
        print(f"\n{i+1}. Filename: {filename}")
        print(f"   Similarity Score: {score:.4f}")
        print("   " + "-" * 40)
        print("   Preview of the book's beginning:")
        print(first_30_lines)
        print("=" * 60)

# Step 6: Run the Interactive Search 🔍

# This loop allows for multiple searches without rerunning the script
while True:
    user_query = input("\nEnter your search query (or type 'exit' to quit): ")
    if user_query.lower() == 'exit':
        break
    
    # Find the top 5 most relevant books
    top_5_books = find_top_n_books(user_query, book_embeddings, n=5)
    
    # Print the results using the original, clean text for better readability
    print_top_books(top_5_books, all_books_full_text)


Enter your search query (or type 'exit' to quit):  book about fictional wars




---  Top Search Results ---

1. Filename: book_3009.txt
   Similarity Score: 0.5323
   ----------------------------------------
   Preview of the book's beginning:

Truth and the Myth, by A.R. Narayanan
(C) Copyright 2000 by A.R. Narayanan



2. Filename: book_2600.txt
   Similarity Score: 0.5082
   ----------------------------------------
   Preview of the book's beginning:




WAR AND PEACE


By Leo Tolstoy/Tolstoi


    Contents

    BOOK ONE: 1805

    CHAPTER I

    CHAPTER II

    CHAPTER III

    CHAPTER IV

    CHAPTER V

    CHAPTER VI

    CHAPTER VII

    CHAPTER VIII


3. Filename: book_1324.txt
   Similarity Score: 0.4959
   ----------------------------------------
   Preview of the book's beginning:





This Etext prepared by Joseph Gallanar
Gallanar@microserve.net





RUSSIA IN 1919
BY ARTHUR RANSOME




PUBLISHER'S NOTE



On August 27, 1914, in London, I made this note in a
memorandum book: "Met Arthur Ransome at_____'s;
discussed a book on the Russian's relation t


Enter your search query (or type 'exit' to quit):  exit
