# **Building an Index and a Retriever**

## **What's Covered?**
1. Building an Index
    - Step 1: Initialize an Embedding Model
    - Step 2: Setting a Connection with the ChromaDB
    - Step 3: Load the Documents
    - Step 4: Prepare Documents by Splitting them into Chunks
    - Step 5: Add Chunks to Vector DB
    - Step 6: Retrieving Documents from VectorDB (Similarity Search With Score)
2. Building a Retriever
    - Simple Similarity Search
    - Maximal Marginal Relevance (MMR)
3. Reranking (Coming Soon)

## **Building an Index**
1. Initialize an Embedding Model
2. Setting a Connection with the ChromaDB
3. Load the Documents
4. Prepare Documents by Splitting them into Chunks
5. Add Chunks to Vector DB
6. Retrieving Documents from VectorDB (Similarity Search With Score)

### **Step 1: Initialize an Embedding Model**

In [1]:
# Step 1 - Initialize an embedding_model
# We are just loading OpenAIEmbeddings

from langchain_openai import OpenAIEmbeddings

f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

embedding_model = OpenAIEmbeddings(api_key=OPENAI_API_KEY, 
                                   model="text-embedding-3-large")

  from .autonotebook import tqdm as notebook_tqdm


### **Step 2: Setting a Connection with the ChromaDB**

In [3]:
# Step 2 - Initialize a ChromaDB Connection
from langchain_chroma import Chroma

# Initialize the database connection
# If database exist, it will connect with the collection_name and persist_directory
# Otherwise a new collection will be created
db = Chroma(collection_name="vector_database", 
            embedding_function=embedding_model, 
            persist_directory="./chroma_db_")

In [4]:
# Initially the database is empty

db.get()

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

### **Step 3: Load the Documents**

In [6]:
# Step 3 a - Load a document
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|█████████████████████████████████████████| 10/10 [00:00<00:00, 2499.29it/s]


### **Step 4: Prepare Documents by Splitting them into Chunks**

In [8]:
# Step 3 b - Split the document into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)

chunks = text_splitter.split_documents(data)

print("Number of Chunks: ", len(chunks))
print()
print("Datatype of Chunks: ", type(chunks[0]))
print()
print("Chunk Text:\n", chunks[0])

Number of Chunks:  920

Datatype of Chunks:  <class 'langchain_core.documents.base.Document'>

Chunk Text:
 page_content='1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...' metadata={'source': 'data/subtitles/Friends_2x01.srt'}


In [9]:
print(chunks[-1])

page_content='354
00:22:33,943 --> 00:22:36,104
They're ribbed for your pleasure!' metadata={'source': 'data/subtitles/Friends_2x09.srt'}


### **Step 5: Add Chunks to Vector DB**

In [10]:
db.add_documents(chunks)

['6f95367b-f5b8-4636-aa11-9b8fbb0db5fe',
 'ebf6548a-99d1-427a-be7d-907fd0247277',
 'e04c1479-2b07-4705-aa9f-a1629ca6bf45',
 '58d23c46-0899-47da-ad66-048a44ddfc1c',
 'bb8d95dd-7024-4c2c-b411-fee6eecf94e2',
 'eb07b49f-2fcd-49a6-a8fd-c6dad96c4f9d',
 'fe04f5b1-b9eb-4642-a4ea-2d32ab5f99d6',
 'f1c3c496-0065-4959-b876-6d3bc559c59e',
 '4fede1bd-01b9-4d4f-adf3-8d771da0d371',
 '718f8b85-c94d-402d-bd23-df996c171a1b',
 '5295c944-4da1-4fe0-88c0-7133d0b2c20e',
 '4ea1c832-1369-45ba-835e-285ae424d41c',
 '291ddbdd-01e6-4982-8f72-921c946e7aad',
 '7b8c4f1b-7410-44ff-ad0a-dec5214c643d',
 'f9047604-b585-4747-b069-2ff460c9d99a',
 '8beb989b-fd05-49bb-9b46-4cd43a5c0c7d',
 '4dfc6a1a-f0e1-4b1b-820c-5ab56aa2c4a5',
 'ccba6460-f22a-4389-bf4a-6da2caa0f015',
 'ff531791-2322-484f-95dd-5cf9cba18d50',
 '1d804255-d6e0-46c6-8d94-cee08952f959',
 'd2c0ade3-b72c-4223-b73d-86e146bb8cad',
 '0562fceb-5a6d-4772-a875-b356800aeb15',
 '349a77b7-15cf-47b0-a22f-92c8e449ccd5',
 '1ed05387-a565-42ac-afcb-874b01edf620',
 '87d88d9e-31ec-

In [11]:
# # You can check if the documents are indexed or not
# db.get()

# # We can check the already existing values
print(len(db.get()["ids"]))

920


### **Step 6: Retrieving Documents from VectorDB (Similarity Search With Score)**

**Note: Range for score is `[0,2]`. Lower is always better.**

**Syntax**
```python
relevant_docs = vectordb.similarity_search_with_score(
    query=query, 
    k=10,
    filter={}
)
```

In [39]:
query = "What is their on Julie vs Rachels List?"

relevant_docs = db.similarity_search_with_score(query=query, k=10)

print("Number of Relevant Documents: ", len(relevant_docs))

Number of Relevant Documents:  10


In [42]:
print("Type of output:", type(relevant_docs))
print()
print("Type of each item in output:", type(relevant_docs[0]))

Type of output: <class 'list'>

Type of each item in output: <class 'tuple'>


In [43]:
print("Let's look at one document: \n", relevant_docs[0])

Let's look at one document: 
 (Document(id='77028be3-d450-436f-b1ab-12da794ac056', metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='247\n00:15:09,082 --> 00:15:10,242\nNo! I\n\n248\n00:15:10,483 --> 00:15:13,611\nOkay, look at the other side.\nLook at Julie\'s column.\n\n249\n00:15:14,487 --> 00:15:15,954\n"She\'s not Rachem"?\n\n250\n00:15:17,423 --> 00:15:18,822\nWhat the hell\'s a Rachem?'), 0.9598516225814819)


In [44]:
for doc in relevant_docs:
    print(f"{doc[0].metadata} -> {doc[1]}")

{'source': 'data/subtitles/Friends_2x08.srt'} -> 0.9598516225814819
{'source': 'data/subtitles/Friends_2x01.srt'} -> 1.0160479545593262
{'source': 'data/subtitles/Friends_2x08.srt'} -> 1.049673318862915
{'source': 'data/subtitles/Friends_2x04.srt'} -> 1.1232709884643555
{'source': 'data/subtitles/Friends_2x02.srt'} -> 1.1275694370269775
{'source': 'data/subtitles/Friends_2x09.srt'} -> 1.1634116172790527
{'source': 'data/subtitles/Friends_2x08.srt'} -> 1.1881942749023438
{'source': 'data/subtitles/Friends_2x08.srt'} -> 1.1927765607833862
{'source': 'data/subtitles/Friends_2x07.srt'} -> 1.1949961185455322
{'source': 'data/subtitles/Friends_2x08.srt'} -> 1.2044466733932495


## **Vector Store as Retriever**

There are multiple approaches to retrieving documents from vector stores.
1. Simple Similarity Search
2. Maximal Marginal Relevance (MMR)

**Syntax:**
```python
retriver = vectordb.as_retriever(
    search_type="similarity_score_threshold",   # Can be 'similarity' (default), 'mmr', or 'similarity_score_threshold'.
    search_kwargs={
        "k":5,                                    # Amount of documents to return (Default: 4)
        "score_threshold":0.85,                   # Minimum relevance threshold for similarity_score_threshold
        "fetch_k":20,                             # Amount of documents to pass to MMR algorithm (Default: 20)
        "lambda_mult":0.5,                        # Diversity of results returned by MMR; 1 for minimum diversity and 0 for maximum. (Default: 0.5)
        "filter": {"paper_title": "GPT-4 Technical Report"} # Filter by document metadata
    }
)
```

### **Simple Similarity Search**
Simple retrieval finds the k most similar documents to your query based on embedding distance

#### **How does it work?**
1. Your query gets embedded: "What is machine learning?" → `[0.234, -0.456, 0.789, ...]`
2. Vector store calculates similarity (cosine distance) between query embedding and all document embeddings
3. Returns the top k documents sorted by similarity score
4. Documents are ranked: `[highest_similarity, 2nd_highest, 3rd_highest, ...]`


#### **The Problem with Simple Similarity:**

```python
# Imagine you have 6 documents about machine learning:
docs = [
    "Machine learning is AI...",              # 0.95 similarity
    "ML is a subset of AI...",                # 0.94 similarity (VERY similar to #1!)
    "Deep learning uses neural networks...",  # 0.93 similarity (VERY similar to #1!)
    "Python is a programming language...",    # 0.45 similarity (not relevant)
    "AI in healthcare applications...",       # 0.42 similarity (not relevant)
    "Classification vs regression...",        # 0.41 similarity (not relevant)
]

# Simple retrieval returns ALL the similar ones:
retriever = vectorstore.as_retriever(k=6)
results = retriever.invoke("What is machine learning?")
# Returns: [#1, #2, #3, #4, #5, #6]
# Problem: #2 and #3 are REDUNDANT - they all say the same thing!
# You're wasting k slots on repetitive information
```

**This is where Maximal Marginal Relevance (MMR) comes in.**

In [30]:
simple_retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 4
    }
)

In [32]:
results = simple_retriever.invoke("What is their on Julie vs Rachels List?")

print(results)

[Document(id='77028be3-d450-436f-b1ab-12da794ac056', metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='247\n00:15:09,082 --> 00:15:10,242\nNo! I\n\n248\n00:15:10,483 --> 00:15:13,611\nOkay, look at the other side.\nLook at Julie\'s column.\n\n249\n00:15:14,487 --> 00:15:15,954\n"She\'s not Rachem"?\n\n250\n00:15:17,423 --> 00:15:18,822\nWhat the hell\'s a Rachem?'), Document(id='f1c3c496-0065-4959-b876-6d3bc559c59e', metadata={'source': 'data/subtitles/Friends_2x01.srt'}, page_content='27\n00:02:08,857 --> 00:02:12,602\nCome on, I wanna hear everything!\nEverything!\n\n28\n00:02:14,121 --> 00:02:16,654\nWell, where do I start?\n\n29\n00:02:16,870 --> 00:02:19,815\nThis is Julie.\nJulie, this is Rachel.\n\n30\n00:02:25,015 --> 00:02:25,881\nThese are....'), Document(id='f09b02bc-6779-404d-a5b8-1278dde7d826', metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='141\n00:08:28,614 --> 00:08:31,048\n"Rachel and Julie: Pros and Cons."\n\n142\n00:08:35,254

In [33]:
for result in results:
    print(result.metadata)

{'source': 'data/subtitles/Friends_2x08.srt'}
{'source': 'data/subtitles/Friends_2x01.srt'}
{'source': 'data/subtitles/Friends_2x08.srt'}
{'source': 'data/subtitles/Friends_2x04.srt'}


### **Maximal Marginal Relevance (MMR)**

MMR balances relevance with diversity by penalizing documents that are too similar to already-selected documents.

Instead of just returning the k most similar documents, MMR iteratively selects documents that are:
- Highly relevant to the query (high similarity to query)
- Low redundancy with already-selected documents (low similarity to previous picks)

How MMR works mathematically:
```
For each candidate document:
  Relevance_Score = similarity(document, query)
  Redundancy_Penalty = similarity(document, already_selected_docs)
  
  MMR_Score = λ * Relevance_Score - (1 - λ) * Redundancy_Penalty
  
  Select document with highest MMR_Score
  Repeat k times
```
Where λ (lambda) controls the tradeoff:
- λ = 1.0: Pure relevance (same as simple similarity)
- λ = 0.5: Balanced (50% relevance, 50% diversity)
- λ = 0.0: Pure diversity (maximize differences)

```python
# Understanding lambda_mult (the diversity knob)

# Conservative: More relevant, less diverse
search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.8}

# Balanced: Mix of relevance and diversity (RECOMMENDED)
search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.5}

# Exploratory: More diverse, some less relevant
search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.2}

# fetch_k consideration:
# fetch_k = 20, k = 5 → Consider 20, return best 5 diverse ones
# Higher fetch_k = more computation but better diversity selection
```

In [35]:
# Use MMR retrieval - balances relevance and diversity
mmr_retriever = db.as_retriever(
    search_type="mmr",           # Enable MMR
    search_kwargs={
        "k": 3,                  # Return 4 documents
        "fetch_k": 10,           # Consider top 20 for diversity check
        "lambda_mult": 0.5       # Balance relevance (0.5) vs diversity (0.5)
    }
)

In [36]:
# Retrieve with diversity considered
results = mmr_retriever.invoke("What is their on Julie vs Rachels List?")

print(results)

[Document(id='77028be3-d450-436f-b1ab-12da794ac056', metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='247\n00:15:09,082 --> 00:15:10,242\nNo! I\n\n248\n00:15:10,483 --> 00:15:13,611\nOkay, look at the other side.\nLook at Julie\'s column.\n\n249\n00:15:14,487 --> 00:15:15,954\n"She\'s not Rachem"?\n\n250\n00:15:17,423 --> 00:15:18,822\nWhat the hell\'s a Rachem?'), Document(id='4a8c659a-1992-44ab-a3cd-fbba3e85460e', metadata={'source': 'data/subtitles/Friends_2x02.srt'}, page_content="45\n00:03:31,171 --> 00:03:34,841\nYou can't go shopping with her.\nWhat about Rachel?\n\n46\n00:03:35,751 --> 00:03:37,015\nWill it be a problem?\n\n47\n00:03:37,221 --> 00:03:39,318\nYou're going to\nBloomingdale's with Julie."), Document(id='b116094f-f9b2-41fa-8233-6f46f40f340a', metadata={'source': 'data/subtitles/Friends_2x09.srt'}, page_content='141\n00:08:57,639 --> 00:09:00,506\nMonica, pigeons learn faster than you.\n\n142\n00:09:08,817 --> 00:09:09,875\nHey, Rach.\n\n143\n00

In [37]:
for result in results:
    print(result.metadata)

{'source': 'data/subtitles/Friends_2x08.srt'}
{'source': 'data/subtitles/Friends_2x02.srt'}
{'source': 'data/subtitles/Friends_2x09.srt'}


## **ReRanking (More Coming Soon)**

Rerankers are cross-encoders that score query-document pairs directly, not individual documents.

Flow:  
- Stage 1: Vector Search (k=100) → 100 docs, fast, cheap
- Stage 2: Reranking (top 100 → top 5) → 5 best docs, accurate, expensive
- Stage 3: Generation (5 docs → answer) → Final answer


Ref: https://docs.langchain.com/oss/python/integrations/retrievers/cohere-reranker

Note:
- ContextualCompressionRetriever has been moved to langchain_classic
- Top Rerankers: Cohere Rerank, HuggingFace (bge-reranker) and FlashRank