In [None]:
# RAG - Types of Retriever in Lanchain V1

| ID | Text                            | Year | Type       |
| -- | ------------------------------- | ---- | ---------- |
| D1 | ‚ÄúAspirin treats pain and fever‚Äù | 2020 | medicine   |
| D2 | ‚ÄúCancer drug improves survival‚Äù | 2023 | medicine   |
| D3 | ‚ÄúAI detects cancer early‚Äù       | 2022 | technology |
| D4 | ‚ÄúFDA approved new cancer drug‚Äù  | 2024 | regulation |



In [7]:
# üìö Docs
from langchain_core.documents import Document

docs = [
    Document(page_content="Aspirin treats pain and fever", metadata={"year": 2020, "type": "medicine"}),
    Document(page_content="Cancer drug improves survival", metadata={"year": 2023, "type": "medicine"}),
    Document(page_content="AI detects cancer early", metadata={"year": 2022, "type": "technology"}),
    Document(page_content="FDA approved new cancer drug", metadata={"year": 2024, "type": "regulation"}),
]

# üîê LLM
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

load_dotenv(".env")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# üß† FREE embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# üß† Vector DB
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(docs, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

print("Vector Search:")
for d in retriever.invoke("new cancer medicine"):
    print("-", d.page_content)



  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


Vector Search:
- FDA approved new cancer drug
- Cancer drug improves survival


In [None]:
Now the student asks:

‚ÄúTell me about new cancer medicine‚Äù

In [None]:
üß† 1Ô∏è‚É£ VectorStoreRetriever

(Meaning-based search)

Like: Google but smarter

It doesn‚Äôt match words.
It matches meaning.

Even if the question says
‚Äúnew cancer medicine‚Äù
it will find:

‚úî ‚ÄúCancer drug improves survival‚Äù
‚úî ‚ÄúFDA approved new cancer drug‚Äù

Because embeddings understand that:

medicine ‚âà drug
new ‚âà recent

Flow
Question
   ‚Üì
Converted to numbers (embedding)
   ‚Üì
Compared with document embeddings
   ‚Üì
Most similar documents returned

In [8]:
print("Vector Search:")
for d in retriever.invoke("Tell me about new cancer medicine"):
    print("-", d.page_content)

Vector Search:
- FDA approved new cancer drug
- Cancer drug improves survival


In [None]:
üß† 2Ô∏è‚É£ MultiQueryRetriever

(Ask in many ways)

The LLM rewrites the question:

User:

‚Äúnew cancer medicine‚Äù

LLM creates:

‚ÄúWhat cancer drugs were recently released?‚Äù
‚ÄúLatest FDA approved cancer drugs‚Äù
‚ÄúNew treatment for cancer‚Äù


Each one searches the vector DB.

Then results are merged.

So even if one phrasing misses D4,
another will find it.

This increases recall üìà

In [15]:
rewrite_prompt = """
Rewrite the following question into 3 different search queries
that mean the same thing.

Question: {q}

Return only the queries, one per line.
"""

from langchain_core.prompts import PromptTemplate

rewrite_chain = PromptTemplate.from_template(rewrite_prompt) | llm

queries = rewrite_chain.invoke({"q": "new cancer medicine"}).content.split("\n")

print("Generated queries:")
for q in queries:
    print("-", q)



Generated queries:
- new cancer treatment  
- latest cancer drug  
- recent cancer therapy


In [None]:
üß† 3Ô∏è‚É£ SelfQueryRetriever

(Smart filtering)

User asks:

‚ÄúFDA cancer drugs after 2022‚Äù

LLM converts this into:

type = "medicine"
AND year > 2022
AND text contains "cancer"


Now it filters table:

Text	Year	Type
Cancer drug improves survival	2023	medicine
FDA approved new cancer drug	2024	regulation

Then vector search happens on only these.

Perfect for compliance & audit.

In [22]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain_community.tools.tavily_search import TavilySearchResults


# ------------------- Setup -------------------
load_dotenv(".env")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

docs = [
    Document(page_content="Aspirin treats pain and fever", metadata={"year": 2020}),
    Document(page_content="Cancer drug improves survival", metadata={"year": 2023}),
    Document(page_content="AI detects cancer early", metadata={"year": 2022}),
    Document(page_content="FDA approved new cancer drug", metadata={"year": 2024}),
]

# Vector DB
vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# ------------------- 1Ô∏è‚É£ Vector Search -------------------
print("\nVector Search:")
for d in retriever.invoke("new cancer medicine"):
    print("-", d.page_content)

# ------------------- 2Ô∏è‚É£ MultiQuery -------------------
rewrite_prompt = PromptTemplate.from_template(
    "Rewrite the question in 3 different ways:\n{q}"
)

queries = (rewrite_prompt | llm).invoke({"q": "new cancer medicine"}).content.split("\n")

results = []
for q in queries:
    results += retriever.invoke(q)

unique = {d.page_content: d for d in results}.values()

print("\nMultiQuery Results:")
for d in unique:
    print("-", d.page_content)

# ------------------- 3Ô∏è‚É£ SelfQuery -------------------
filter_prompt = PromptTemplate.from_template(
    "Extract filters.\nOnly return:\nyear > number\nkeyword = word\nQuestion: {q}"
)

filters = (filter_prompt | llm).invoke(
    {"q": "FDA cancer drugs after 2022"}
).content

print("\nGenerated Filters:")
print(filters)

filtered = []
for d in docs:
    if d.metadata["year"] > 2022 and "cancer" in d.page_content.lower():
        filtered.append(d)

print("\nSelfQuery Results:")
for d in filtered:
    print("-", d.page_content)




Vector Search:
- FDA approved new cancer drug
- FDA approved new cancer drug

MultiQuery Results:
- FDA approved new cancer drug

Generated Filters:
year > 2022  
keyword = cancer drugs

SelfQuery Results:
- Cancer drug improves survival
- FDA approved new cancer drug


In [None]:
üß† 4Ô∏è‚É£ ParentDocumentRetriever

(Chunk ‚Üí full book)

PDF example:

Page 1: cancer
Page 2: dosage
Page 3: side effects

User asks:

‚ÄúWhat are side effects?‚Äù

Retriever finds Page 3 (chunk)
But returns the whole PDF

So the LLM sees full context.

In [34]:
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

# -------------------------
# 1. Fake PDF pages
# -------------------------
pages = [
    Document(page_content="[pdf1_page1] Cancer is a disease caused by uncontrolled cell growth."),
    Document(page_content="[pdf1_page2] Dosage depends on patient weight and age."),
    Document(page_content="[pdf1_page3] Side effects include nausea, hair loss, and fatigue."),
]

# -------------------------
# 2. Chunk them
# -------------------------
splitter = RecursiveCharacterTextSplitter(chunk_size=40, chunk_overlap=0)
chunks = splitter.split_documents(pages)

# -------------------------
# 3. Vector DB
# -------------------------
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever()

# -------------------------
# 4. Parent lookup
# -------------------------
parent_docs = {
    "pdf1_page1": pages[0],
    "pdf1_page2": pages[1],
    "pdf1_page3": pages[2],
}

# -------------------------
# 5. Ask a question
# -------------------------
print("User: What are side effects?\n")

chunk_hits = retriever.invoke("What are side effects?")

# extract parent id from text
parent_ids = set()
for d in chunk_hits:
    if "[" in d.page_content:
        pid = d.page_content.split("]")[0][1:]
        parent_ids.add(pid)

print("Returned full pages:")
for pid in parent_ids:
    print(parent_docs[pid].page_content)


User: What are side effects?

Returned full pages:


In [None]:
üß† 5Ô∏è‚É£ EnsembleRetriever

(Smart mix)

Uses:

‚Ä¢ Vector search (meaning)
‚Ä¢ BM25 (keyword)

So if user types:

‚ÄúFDA‚Äù

Keyword finds D4
Meaning search finds D2

Best of both worlds üî•

In [23]:
# ------------------- 4Ô∏è‚É£ Ensemble -------------------
bm25 = BM25Retriever.from_documents(docs)

vector_results = retriever.invoke("FDA cancer")
bm25_results = bm25.invoke("FDA cancer")

combined = {d.page_content: d for d in vector_results + bm25_results}.values()

print("\nEnsemble Results:")
for d in combined:
    print("-", d.page_content)





Ensemble Results:
- FDA approved new cancer drug
- AI detects cancer early
- Cancer drug improves survival
- Aspirin treats pain and fever


In [None]:
üß† 6Ô∏è‚É£ TimeWeightedRetriever

(Recent memory)

If user keeps asking about cancer,
and recently talked about FDA,

D4 gets boosted

So newer & frequent topics are preferred.

Like your brain remembering recent topics üß†

In [None]:
üß† 7Ô∏è‚É£ BM25Retriever

(Classic keyword search)

User types:

‚ÄúFDA‚Äù

Only documents with exact word FDA come.

Finds:
‚úî ‚ÄúFDA approved new cancer drug‚Äù

No embeddings.
Fast and accurate for legal, tables.

In [None]:
üß† 8Ô∏è‚É£ ContextualCompressionRetriever

(Summarizer before LLM)

It takes:

‚ÄúFDA approved new cancer drug after trials‚Ä¶‚Äù

and shrinks it to:

‚ÄúFDA approved a new cancer drug in 2024.‚Äù

So LLM gets only what matters.
Less tokens, less hallucination.

In [35]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# ------------------------
# Setup
# ------------------------
load_dotenv(".env")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# ------------------------
# Documents (long text)
# ------------------------
docs = [
    Document(page_content="FDA approved a new cancer drug after multiple clinical trials showed improved survival and fewer side effects."),
    Document(page_content="Cancer drug research has increased significantly in recent years, with many therapies entering late-stage trials."),
]

# ------------------------
# Vector DB
# ------------------------
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

# ------------------------
# User query
# ------------------------
query = "new FDA cancer drug"

found_docs = retriever.invoke(query)

print("Retrieved Documents:")
for d in found_docs:
    print("-", d.page_content)

# ------------------------
# Contextual Compression
# ------------------------
compress_prompt = PromptTemplate.from_template(
    "Summarize this for answering the question:\n{doc}"
)

compressed_docs = []

for d in found_docs:
    summary = (compress_prompt | llm).invoke({"doc": d.page_content}).content
    compressed_docs.append(summary)

print("\nCompressed Context:")
for c in compressed_docs:
    print("-", c)

# ------------------------
# Final Answer
# ------------------------
final_context = "\n".join(compressed_docs)

final_prompt = PromptTemplate.from_template(
    "Use the following compressed context to answer:\n{context}\n\nQuestion: {q}"
)

answer = (final_prompt | llm).invoke({
    "context": final_context,
    "q": "What did the FDA do about cancer drugs?"
})

print("\nFinal Answer:")
print(answer.content)


Retrieved Documents:
- FDA approved new cancer drug
- FDA approved new cancer drug
- FDA approved new cancer drug
- FDA approved new cancer drug

Compressed Context:
- The FDA has recently approved a new cancer drug, which is expected to provide new treatment options for patients. This approval signifies a significant advancement in cancer therapy, potentially improving outcomes for those affected by the disease. Further details about the drug's specific indications, efficacy, and safety profile may be available in the official announcement or accompanying documentation.
- The FDA has recently approved a new cancer drug, which is expected to provide new treatment options for patients. This approval signifies a significant advancement in cancer therapy, potentially improving outcomes for those affected by the disease. Further details about the drug's specific indications, efficacy, and safety profile may be available in the official announcement or accompanying studies.
- The FDA has re

In [None]:
üåê Tavily (Live Web Search)

Now suppose user asks:

‚ÄúLatest cancer drug news today‚Äù

Your vector DB is old.

So Tavily does:

Search internet
   ‚Üì
Get live articles
   ‚Üì
Pass to LLM


So RAG becomes:

Vector DB (internal)
 + Tavily (internet)
 ‚Üí Best answer


This is how ChatGPT stays current üîç

üß© Final Big Picture
User Question
   ‚Üì
Retriever (one or many)
   ‚Üì
Relevant documents
   ‚Üì
(Optional) Tavily Web Search
   ‚Üì
Context to LLM
   ‚Üì
Final Answer

In [24]:
# ------------------- 5Ô∏è‚É£ Tavily -------------------
tavily = TavilySearchResults()

print("\nTavily (Live Web):")
print(tavily.invoke("latest cancer drug news"))


Tavily (Live Web):
[{'title': 'MSK-Led Research Resulted in New FDA Approvals for Cancer ...', 'url': 'https://www.mskcc.org/news/msk-led-research-resulted-in-11-new-fda-approvals-for-cancer-drugs-in-2025', 'content': 'By\nJulie Grisham\nMonday, December 15, 2025\n\nMSK patient Ellen Coopersmith smiling in a rowboat in Central Park\n\nThe U.S. Food and Drug Administration (FDA) approved 10 cancer drugs in 2025, based on clinical trials in which MSK played a pivotal role.\n\nThese approvals spanned therapies for a range of solid tumors and blood cancers and included treatments for both common and rare types of tumors. They are listed in chronological order.\n\n(Note:\u202fThe FDA uses the term ‚Äúaccelerated approval‚Äù for drugs that treat diseases lacking effective treatments. These criteria are slightly different from a standard approval. The term ‚Äúfull approval‚Äù applies to drugs that had previously been granted ‚Äúconditional approval.‚Äù)\n\n## Sarcoma [...] Read more.\n\n## L