# Vector Search with ChromaDB - Deep Dive

This notebook demonstrates how vector search works in the RAG pipeline.

## What is Vector Search?

Instead of keyword matching, we:
1. Convert text to numerical vectors (embeddings)
2. Find vectors that are "close" in meaning
3. Return the most similar sentences

**Example**:
- Query: "What treats diabetes?"
- Finds: "Metformin reduces blood glucose" (no keyword "treats" but semantically similar)

## Setup

In [1]:
import sys
sys.path.append('..')

import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load ChromaDB Collection

In [2]:
# Load ChromaDB
CHROMA_DIR = Path("../data/chroma")

client = chromadb.PersistentClient(path=str(CHROMA_DIR))
collection = client.get_collection("pubmed_index")

print(f"Collection: {collection.name}")
print(f"Total documents: {collection.count()}")

Collection: pubmed_index
Total documents: 298152


## Step 2: Load Embedding Model

In [3]:
# Same model used to build the index
embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

print(f"Model: {embedder}")
print(f"Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

Model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
Embedding dimension: 768


## Step 3: Understand Embeddings

Let's see what an embedding looks like.

In [4]:
# Example sentence
sentence = "Diabetes is a metabolic disorder"

# Convert to embedding
embedding = embedder.encode(sentence)

print(f"Original text: {sentence}")
print(f"\nEmbedding shape: {embedding.shape}")
print(f"\nFirst 10 dimensions: {embedding[:10]}")
print(f"\nThese 768 numbers capture the semantic meaning!")

Original text: Diabetes is a metabolic disorder

Embedding shape: (768,)

First 10 dimensions: [ 0.0103629   0.03786739  0.02471967  0.02201937  0.05474599  0.02856212
  0.01622538  0.01917951  0.03064471 -0.01876031]

These 768 numbers capture the semantic meaning!


## Step 4: Semantic Similarity

Similar sentences have similar embeddings.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Three sentences
sentences = [
    "Diabetes is a metabolic disorder",
    "High blood sugar characterizes diabetes",  # Similar meaning
    "The sky is blue"  # Different meaning
]

# Get embeddings
embeddings = embedder.encode(sentences)

# Calculate similarities
similarities = cosine_similarity(embeddings)

print("Cosine Similarity Matrix:")
print("(1.0 = identical, 0.0 = unrelated)\n")

for i, sent1 in enumerate(sentences):
    for j, sent2 in enumerate(sentences):
        if i < j:
            print(f"Sentence {i+1} vs {j+1}: {similarities[i][j]:.3f}")
            print(f"  '{sent1[:40]}...'")
            print(f"  '{sent2[:40]}...'\n")

Cosine Similarity Matrix:
(1.0 = identical, 0.0 = unrelated)

Sentence 1 vs 2: 0.649
  'Diabetes is a metabolic disorder...'
  'High blood sugar characterizes diabetes...'

Sentence 1 vs 3: -0.135
  'Diabetes is a metabolic disorder...'
  'The sky is blue...'

Sentence 2 vs 3: -0.070
  'High blood sugar characterizes diabetes...'
  'The sky is blue...'



## Step 5: Search ChromaDB

Now let's search for relevant PubMed sentences.

In [6]:
# Your question
query = "What is asthma?"

# Embed the query
query_embedding = embedder.encode(query).tolist()

# Search ChromaDB
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5  # Top 5 results
)

print(f"Query: {query}\n")
print("Top 5 Most Similar Sentences:\n")
print("="*80)

for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    similarity = 1 - distance  # Convert distance to similarity
    print(f"\n{i}. Similarity: {similarity:.3f}")
    print(f"   PMID: {metadata.get('pmid', 'N/A')}")
    print(f"   Text: {doc}")
    print("-"*80)

Query: What is asthma?

Top 5 Most Similar Sentences:


1. Similarity: 0.720
   PMID: deep_learning_1683
   Text: Asthma is a syndrome composed of heterogeneous disease entities.
--------------------------------------------------------------------------------

2. Similarity: 0.608
   PMID: deep_learning_3700
   Text: We focus on a specific syndrome-asthma/difficulty breathing.
--------------------------------------------------------------------------------

3. Similarity: 0.573
   PMID: deep_learning_2215
   Text: Respiratory ailments afflict a wide range of people and manifests itself through conditions like asthma and sleep apnea.
--------------------------------------------------------------------------------

4. Similarity: 0.566
   PMID: deep_learning_5498
   Text: Although the complex disease of asthma has been defined as being heterogeneous, the extent of its endophenotypes remains unclear.
--------------------------------------------------------------------------------

5. Simi

## Step 6: Compare Keyword vs Vector Search

In [7]:
# Query without exact keywords
query = "What treats high blood pressure?"
print(f"Query: {query}")
print("\nKeyword search would look for: 'treats', 'high', 'blood', 'pressure'")
print("Vector search understands: 'medication for hypertension'\n")

# Vector search
query_embedding = embedder.encode(query).tolist()
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

print("Vector Search Results:\n")
for i, doc in enumerate(results['documents'][0], 1):
    print(f"{i}. {doc[:150]}...\n")

Query: What treats high blood pressure?

Keyword search would look for: 'treats', 'high', 'blood', 'pressure'
Vector search understands: 'medication for hypertension'

Vector Search Results:

1. Despite increasing financial and human resources invested, the disappointing rate of hypertension (HT) control continues to pose a challenge to health...

2. Long-term abnormal blood pressure will lead to various cardiovascular diseases, making the early detection and assessment of hypertension profoundly s...

3. Here we present the reasoned statement of the Italian Society of Hypertension to maintain ongoing antihypertensive treatments....



## Step 7: Try Your Own Queries

In [8]:
def search_pubmed(query, n_results=5):
    """Search PubMed sentences using vector similarity"""
    
    # Embed query
    query_embedding = embedder.encode(query).tolist()
    
    # Search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    # Display results
    print(f"Query: {query}\n")
    print("="*80)
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1):
        similarity = 1 - distance
        print(f"\n{i}. Similarity: {similarity:.3f} | PMID: {metadata.get('pmid', 'N/A')}")
        print(f"   {doc}")
    
    print("\n" + "="*80)

# Try different queries
search_pubmed("What causes COVID-19?")
# search_pubmed("How does aspirin work?")
# search_pubmed("What are symptoms of diabetes?")

Query: What causes COVID-19?


1. Similarity: 0.818 | PMID: covid_19_4916
   What is COVID-19?

2. Similarity: 0.794 | PMID: covid_19_7383
   The coronavirus disease-19 (COVID-19) is caused by the novel severe acute respiratory syndrome coronavirus that was first detected at the end of December 2019.

3. Similarity: 0.792 | PMID: covid_19_4111
   The COVID-19 is a global pandemic infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

4. Similarity: 0.785 | PMID: covid_19_81
   COVID-19 is caused by a novel coronavirus, named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [previously provisionally known as 2019 novel coronavirus (2019-nCoV)].

5. Similarity: 0.783 | PMID: covid_19_5045
   COVID-19 is a highly infectious respiratory disease caused by a new coronavirus known as SARS-CoV-2 (severe acute respiratory syndrome-coronavirus-2).



## Understanding the Technology

### How It Works:

1. **Embedding Model**: `sentence-transformers/all-mpnet-base-v2`
   - Converts text -768-dimensional vector
   - Trained on 1B+ sentence pairs
   - Captures semantic meaning

2. **Vector Database**: ChromaDB
   - Stores 298,152 PubMed sentence embeddings
   - Uses HNSW algorithm for fast search
   - Sub-second search over 300K vectors

3. **Similarity Metric**: Cosine Similarity
   - Measures angle between vectors
   - 1.0 = identical, 0.0 = unrelated
   - Efficient to compute

### Why Vector Search?

**Keyword Search**:
-  Misses synonyms ("treats" vs "therapy")
-  Misses paraphrases
-  Requires exact matches

**Vector Search**:
-  Finds semantic similarity
-  Handles synonyms naturally
-  Works with paraphrases
-  Understands context

## Performance Analysis

In [9]:
import time

# Measure search latency
query = "What is diabetes?"

# Embedding time
start = time.time()
query_embedding = embedder.encode(query).tolist()
embed_time = time.time() - start

# Search time
start = time.time()
results = collection.query(query_embeddings=[query_embedding], n_results=5)
search_time = time.time() - start

print(f"Performance Metrics:")
print(f"  Embedding time: {embed_time*1000:.2f}ms")
print(f"  Search time:    {search_time*1000:.2f}ms")
print(f"  Total:          {(embed_time + search_time)*1000:.2f}ms")
print(f"\nSearching over {collection.count():,} documents")

Performance Metrics:
  Embedding time: 14.91ms
  Search time:    7.46ms
  Total:          22.37ms

Searching over 298,152 documents
