## üìò **Workflow of a Reliable RAG System**

A reliable RAG pipeline does not just *retrieve* and *generate* responses ‚Äî  
it also **checks retrieval quality**, **checks hallucination**, and **tracks which context influenced the answer**.

---

#### üîπ **1) Query ‚Üí Retriever ‚Üí Retrieved Documents**

##### ‚úîÔ∏è *LLM-Based Relevancy Check*
After retrieving documents, use an LLM to evaluate:

- Are the retrieved documents actually relevant to the query?
- Are they sufficient to answer it?
- Do we need a retry with different retrieval parameters?

This prevents garbage context from reaching the generator.

---

#### üîπ **2) Retrieved Docs + Query ‚Üí System Prompt ‚Üí LLM ‚Üí Response**

##### ‚úîÔ∏è *Hallucination Check*
Use another LLM pass to verify:

- Does the **generated response** align with the **retrieved context**?
- Are there statements not supported by context?
- Should the response be revised?

This ensures factual grounding.

---

#### üîπ **3) Evidence Tracking ‚Üí Highlight the Context Used**

Finally, ask the LLM to identify **which specific lines/snippets** from the retrieved documents were actually used to generate the response.

---

In [1]:
## Specify LLM 
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="llama3.2",
    temperature=0,
    verbose=True
)

llm.invoke("How are you?")

AIMessage(content="I'm just a language model, so I don't have feelings or emotions like humans do. However, I'm functioning properly and ready to assist you with any questions or tasks you may have! How can I help you today?", additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2025-12-03T15:45:34.062656Z', 'done': True, 'done_reason': 'stop', 'total_duration': 21391342750, 'load_duration': 3658086167, 'prompt_eval_count': 29, 'prompt_eval_duration': 12826979792, 'eval_count': 47, 'eval_duration': 3510671920, 'logprobs': None, 'model_name': 'llama3.2', 'model_provider': 'ollama'}, id='lc_run--14b6c7c9-7444-458c-8de7-be184513c802-0', usage_metadata={'input_tokens': 29, 'output_tokens': 47, 'total_tokens': 76})

In [3]:
# login to huggingface
import os
from huggingface_hub import login 

hf_token = os.environ['HF_TOKEN']
login(token=hf_token)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [None]:
# for embedding model we'll use sentence-transformers
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# sample embedding 
embeddings = embedding_model.embed_query("Hey How are you?")
print(f"Length of embeddings : {len(embeddings)}")
print(f"Embedding : {embeddings[:100]}")

Length of embeddings : 384
Embedding : [-0.013380538672208786, 0.003255972173064947, 0.10806030035018921, 0.08322358131408691, 0.02040085941553116, -0.049066152423620224, 0.0722508355975151, 0.002980925841256976, -0.08823534101247787, 0.016058299690485, -0.03367079421877861, -4.332493062975118e-06, -0.02510129101574421, 0.0007887802203185856, 0.060331884771585464, -0.0415474958717823, 0.07702311128377914, -0.14256997406482697, -0.13958506286144257, 0.06023767963051796, 0.003192346775904298, 0.018982844427227974, 0.02300790697336197, 0.06056844815611839, -0.07911035418510437, -0.05399537831544876, -0.0008475205395370722, 0.03202424943447113, -0.029674910008907318, -0.04484577104449272, -0.10411098599433899, 0.06399180740118027, -0.05713418126106262, -0.02695028856396675, -0.028776653110980988, 0.00333896791562438, -0.0355900302529335, -0.13525626063346863, 0.009469274431467056, 0.0003555373114068061, 0.009924577549099922, -0.0014938903041183949, -0.009747199714183807, -0.002170604653656

## Loading Docs -> Chunking (Making ready for VectoreDB)

This time we'll use `WebBaseLoader` from langchain to fetch content from URLs

In [4]:
# we'll use some URLs of 'The Batch' newsletter of Andrew NG
urls = [
    "https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-2-reflection/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-3-tool-use/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-4-planning/?ref=dl-staging-website.ghost.io",
    "https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-5-multi-agent-collaboration/?ref=dl-staging-website.ghost.io"
]

In [9]:
# use WebBaseLoader to load the content from URLs
from langchain_community.document_loaders import WebBaseLoader

docs = [WebBaseLoader(url).load() for url in urls]

In [17]:
from pprint import pprint

pprint(docs[0][0].page_content)

('Four AI Agent Strategies That Improve GPT-4 and GPT-3.5 Performance‚ú® New '
 'course! Enroll in Building Coding Agents with Tool ExecutionExplore '
 "CoursesAI NewsletterThe BatchAndrew's LetterData PointsML ResearchBlog‚ú® AI "
 'Dev x SF 26CommunityForumEventsAmbassadorsAmbassador '
 "SpotlightResourcesMembershipStart LearningWeekly IssuesAndrew's LettersData "
 'PointsML ResearchBusinessScienceCultureHardwareAI CareersAboutSubscribeThe '
 'BatchLettersArticleAgentic Design Patterns Part 1 Four AI agent strategies '
 'that improve GPT-4 and GPT-3.5 performanceLettersTechnical '
 'InsightsPublishedMar 20, 2024Reading time2 min readShareDear friends,I think '
 'AI agent workflows will drive massive AI progress this year ‚Äî perhaps even '
 'more than the next generation of foundation models. This is an important '
 'trend, and I urge everyone who works in AI to pay attention to it.Today, we '
 'mostly use LLMs in zero-shot mode, prompting a model to generate final '
 'output token b

In [22]:
# list of docs
docs_list = [item for sublist in docs for item in sublist]

In [23]:
docs_list

[Document(metadata={'source': 'https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/?ref=dl-staging-website.ghost.io', 'title': 'Four AI Agent Strategies That Improve GPT-4 and GPT-3.5 Performance', 'description': 'I think AI agent workflows will drive massive AI progress this year ‚Äî perhaps even more than the next generation of foundation models. This is an important...', 'language': 'en'}, page_content='Four AI Agent Strategies That Improve GPT-4 and GPT-3.5 Performance‚ú® New course! Enroll in Building Coding Agents with Tool ExecutionExplore CoursesAI NewsletterThe BatchAndrew\'s LetterData PointsML ResearchBlog‚ú® AI Dev x SF 26CommunityForumEventsAmbassadorsAmbassador SpotlightResourcesMembershipStart LearningWeekly IssuesAndrew\'s LettersData PointsML ResearchBusinessScienceCultureHardwareAI CareersAboutSubscribeThe BatchLettersArticleAgentic Design Patterns Part 1 Four AI agent strategies that improve GPT-4 and GPT-3.5 performanceLettersTechnical Insig

In [36]:
# use RecursiveTextSplitter for efficient chunking
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)
doc_chunks = text_splitter.split_documents(docs_list)

print(f"Total chunks : {len(doc_chunks)}")
pprint(f"Example chunk : {doc_chunks[0]}")

Total chunks : 44
("Example chunk : page_content='Four AI Agent Strategies That Improve GPT-4 "
 'and GPT-3.5 Performance‚ú® New course! Enroll in Building Coding Agents with '
 "Tool ExecutionExplore CoursesAI NewsletterThe BatchAndrew's LetterData "
 'PointsML ResearchBlog‚ú® AI Dev x SF '
 '26CommunityForumEventsAmbassadorsAmbassador '
 "SpotlightResourcesMembershipStart LearningWeekly IssuesAndrew's LettersData "
 'PointsML ResearchBusinessScienceCultureHardwareAI CareersAboutSubscribeThe '
 'BatchLettersArticleAgentic Design Patterns Part 1 Four AI agent strategies '
 'that improve GPT-4 and GPT-3.5 performanceLettersTechnical '
 "InsightsPublishedMar 20, 2024Reading time2 min' metadata={'source': "
 "'https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/?ref=dl-staging-website.ghost.io', "
 "'title': 'Four AI Agent Strategies That Improve GPT-4 and GPT-3.5 "
 "Performance', 'description': 'I think AI agent workflows will drive massive "
 'AI progress this 

## Creating a VectorDB

This time we'll use ChromDB (locally) with persistence

In [48]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="reliable_rag",
    embedding_function=embedding_model,
    persist_directory="../persistent_vectordb/chroma_langchain_db1",
)

In [49]:
# adding chunks to our DB
vector_store.add_documents(documents=doc_chunks)

['ada2fdcd-14e2-4204-8979-c64826de3cfe',
 '93ad62a6-57a1-427d-93b4-03fbc504251a',
 '5d0c709b-3ecd-4650-b2ea-b11d6f046b03',
 'd38b5dad-3951-48b5-9232-08732261093c',
 '46a83631-36ae-4624-ac9c-97b9ec63fba6',
 '9aa86f82-76e2-478b-b016-83a9cf4edd83',
 'd0784f62-bdaf-4c12-ae49-3001a9e5eedd',
 '2ca788cd-3239-4c06-a5c1-fb81b8a344ac',
 'e9f305ab-a51f-4871-9ade-67def2c78ce7',
 'caabe9b5-23ab-46b7-862b-b86c4f15c4db',
 'c39af50d-e94a-4dfa-8f4f-12f31dd5ace5',
 'b0c4080c-5369-4fde-89b6-630d458ebf15',
 'b4e749f3-b370-4f98-ac95-6d301855d29f',
 '495e8702-c89a-4d91-8826-f99503483b18',
 'd4301986-987f-444c-984b-0a4567f78d5a',
 '4bb3fe33-44c0-4fc4-9e1c-207cea6bbf20',
 'e22c4bb4-b7e6-45b3-bbac-9267fad8b748',
 '1badaa58-2754-47fb-a530-30704ce03929',
 '561578ec-4a9e-4ba3-a1d8-7cb664fd272d',
 '71f69bf0-a372-4ec2-9b74-fd5168bcf946',
 'b66aaa4b-69ee-4767-a4f9-28044d2ac9b8',
 '1576eba7-f918-4617-91be-662fc36c1680',
 '3f70d674-c35f-4e02-86c0-80bbab2117b8',
 'a62c3db7-8a80-4238-a335-5ae73f0397c8',
 'a9ca5b6d-0edd-

## üî• What is MMR?

**MMR (Maximal Marginal Relevance)** is a re-ranking strategy used in retrieval systems to pick documents that are:

- **Highly relevant to the query** (Query Relevance)
- **Not redundant with each other** (Document Diversity)



This ensures the final retrieved set is *diverse* and avoids repetitive or overlapping chunks.

---

### üîπ Why MMR is useful

Normal vector search often returns many chunks that are very similar to each other.  
MMR fixes this by balancing:

- **How close a chunk is to the query** (Relevance term = Information Gain)
- **How different it is from already selected chunks** (Diversity term = Redundancy Penalty)

---

### üîπ How MMR works (simple view)

1. Retrieve **`fetch_k`** top similar documents using vector similarity.
2. Apply MMR re-ranking to select the best **`k`** documents that maximize:
   - relevance to the query  
   - diversity among selected documents  

---

### üîπ In simple words

MMR picks:
1. The most relevant document.
2. The next document that adds the most **new information** (not repetitive).
3. Continues until it selects **k** diverse + relevant chunks.


In [50]:
retriever = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"k": 2, "fetch_k": 5}
)

In [51]:
# sample testing of retriever
sample_context = retriever.invoke("what are the differnt kind of agentic design patterns?")

for i, doc in enumerate(sample_context):
    print(f"Doc : {i+1}")
    print(f"Content : {doc.page_content}")
    print("-"*89)

Doc : 1
Content : Agentic Design Patterns Part 3: Tool Use‚ú® New course! Enroll in Building Coding Agents with Tool ExecutionExplore CoursesAI NewsletterThe BatchAndrew's LetterData PointsML ResearchBlog‚ú® AI Dev x SF 26CommunityForumEventsAmbassadorsAmbassador SpotlightResourcesMembershipStart LearningWeekly IssuesAndrew's LettersData PointsML ResearchBusinessScienceCultureHardwareAI CareersAboutSubscribeThe BatchLettersArticleAgentic Design Patterns Part 3, Tool Use How large language models can act as agents by taking advantage of external tools for search, code execution, productivity, ad
-----------------------------------------------------------------------------------------
Doc : 2
Content : performance"Read "Agentic Design Patterns Part 3, Tool Use"Read "Agentic Design Patterns Part 4: Planning"Read "Agentic Design Patterns Part 5: Multi-Agent Collaboration"ShareSubscribe to The BatchStay updated with weekly AI News and Insights delivered to your inboxCoursesThe BatchCommunit

---

### Checking Document Relevancy (Query <-> Retrieved Context)

In [52]:
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from pydantic import BaseModel, Field
from typing import Annotated

# data validation class (LLM output)
class ClassifyContext(BaseModel):
    """
    When context is retrived for a query then relevancy value (True or False) is validated by this class.
    """
    value: Annotated[bool, Field(..., description="'True' or 'False', whether context is relevant to query")]

# configure our llm to produce this structured output
structured_llm_grader = llm.with_structured_output(ClassifyContext)

# Prompt
prompt = PromptTemplate.from_template(
    "You are a grader assessing relevance of a retrieved  document : {context} to a user question : {query}. " \
    "If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. " \
    "It does not need to be a stringent test. The goal is to filter out erroneous retrievals. Give a boolean value 'True' or 'False' score to indicate whether the document is relevant to the question."
)

relevancy_check_chain = prompt | structured_llm_grader

In [54]:
# checking relevancy of retrieved docs 
query = "what are the differnt kind of agentic design patterns?"
retrieved_docs = retriever.invoke(query)

print(f"Query : {query}")
print("-"*89)

# we'll store context that is truely related to query side by side
true_context = ""

for doc in retrieved_docs:
    print(f"Context : {doc.page_content}")
    llm_score = relevancy_check_chain.invoke({'query' : query, 'context' : doc.page_content})
    print(f"Response : {llm_score}")
    print("-"*89)
    # store true context
    if llm_score.value == True:
        true_context += doc.page_content
        true_context += "\n"

Query : what are the differnt kind of agentic design patterns?
-----------------------------------------------------------------------------------------
Context : Agentic Design Patterns Part 3: Tool Use‚ú® New course! Enroll in Building Coding Agents with Tool ExecutionExplore CoursesAI NewsletterThe BatchAndrew's LetterData PointsML ResearchBlog‚ú® AI Dev x SF 26CommunityForumEventsAmbassadorsAmbassador SpotlightResourcesMembershipStart LearningWeekly IssuesAndrew's LettersData PointsML ResearchBusinessScienceCultureHardwareAI CareersAboutSubscribeThe BatchLettersArticleAgentic Design Patterns Part 3, Tool Use How large language models can act as agents by taking advantage of external tools for search, code execution, productivity, ad
Response : value=True
-----------------------------------------------------------------------------------------
Context : performance"Read "Agentic Design Patterns Part 3, Tool Use"Read "Agentic Design Patterns Part 4: Planning"Read "Agentic Design Patt

In [64]:
# now create a function that first verfies whether retrieved context is really helful to answer the query 
import time

def get_and_validate_retrieved_context(retriever, query):
    # get the context
    start = time.time()
    retrieved_context = retriever.invoke(query)
    mid = time.time()
    print(f"Time take to retrieve context : {mid-start} sec")
    print("-"*89)

    # true_context
    true_context_docs = []

    for doc in retrieved_docs:
        is_relevant = relevancy_check_chain.invoke({'query' : query, 'context' : doc.page_content})
        if is_relevant.value == True:
            true_context_docs.append(doc.page_content)
    
    end = time.time()

    print(f"Time take to validate retrieved context : {end-mid} sec")
    print("-"*89)
        
    return true_context

In [65]:
# LLM to generate response from query and context
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate

prompt_for_response = PromptTemplate.from_template(
    "You are a helpful assistant. Look at the user query : <query>{query}</query> and " \
    "try to answer it in 2-3 lines using context : <context>{context}</context> " \
    "If no context is provided just respond with No relevant docs found."
)

response_chain = prompt_for_response | llm

In [66]:
# let try this flow 
if __name__ == "__main__":
    query = "what are the differnt kind of agentic design patterns?"
    print(f"Query : {query}")
    print("-"*89)
    context_list = get_and_validate_retrieved_context(retriever, query)
    context = ""
    for doc in context_list:
        context += doc 
        context += "\n\n"
    response = response_chain.invoke({'context' : context, 'query' : query})
    print(f"Response : {response.content}")

Query : what are the differnt kind of agentic design patterns?
-----------------------------------------------------------------------------------------
Time take to retrieve context : 0.4318091869354248 sec
-----------------------------------------------------------------------------------------
Time take to validate retrieved context : 21.599080801010132 sec
-----------------------------------------------------------------------------------------
Response : Based on the Agentic Design Patterns, here are some of the main types:

1. Agent-based systems: These are systems that use multiple autonomous agents to achieve a common goal.
2. Multi-agent systems: These are systems that consist of multiple agents that interact with each other to achieve a common goal.
3. Autonomous intelligent agents: These are agents that have the ability to make decisions and take actions without human intervention.

These patterns are used in various fields such as artificial intelligence, robotics, and comp

---

### Now we need to do hallucination checks (Generated answer <-> context)

In [69]:
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import Annotated

# LLM response validation
class HallucinationValidator(BaseModel):
    """
    If generated response contains content that is not present in context then it flags it by 'False' otherwise 'True'
    """
    valid: Annotated[bool, Field(..., description="'True' or 'False' on basis of validation whether generated response.")]

# configure our llm to produce this structured output
structured_llm_halluciantion_checker = llm.with_structured_output(HallucinationValidator)

answer_hallucination_check_prompt = PromptTemplate.from_template(
    "You are provided with a generated response : {generated_response} and some context : {context}. You need to check whether generated_response if a subset of context and not contains any irrelevant content. If generate_respose is hallucinated return 'True' otherwise 'False'."
)

hallucination_checking_chain = answer_hallucination_check_prompt | structured_llm_halluciantion_checker

In [70]:
# let try this flow 
if __name__ == "__main__":
    query = "what are the differnt kind of agentic design patterns?"
    print(f"Query : {query}")
    print("-"*89)
    context_list = get_and_validate_retrieved_context(retriever, query)
    context = ""
    for doc in context_list:
        context += doc 
        context += "\n\n"
    start = time.time()
    response = response_chain.invoke({'context' : context, 'query' : query})
    print(f"Time taken to generate response : {time.time() - start} sec.")
    print("-"*89)

    # check for hallucinations
    start = time.time()
    hallucination_validator = hallucination_checking_chain.invoke({'generated_response' : response.content, 'context' : context})
    print(f"Time take to check hallucination : {time.time() - start} sec")
    print("-"*89)
    print(f"Answer is Hallucinated? : {hallucination_validator.valid}")
    print("-"*89)
    print(f"Response : {response.content}")

Query : what are the differnt kind of agentic design patterns?
-----------------------------------------------------------------------------------------
Time take to retrieve context : 0.4074101448059082 sec
-----------------------------------------------------------------------------------------
Time take to validate retrieved context : 18.18198299407959 sec
-----------------------------------------------------------------------------------------
Time taken to generate response : 32.927732944488525 sec.
-----------------------------------------------------------------------------------------
Time take to check hallucination : 19.928250074386597 sec
-----------------------------------------------------------------------------------------
Answer is Hallucinated? : False
-----------------------------------------------------------------------------------------
Response : Based on the Agentic Design Patterns, here are some of the main types:

1. Agent-based systems: These are systems that 