# Advanced Retrieval Augmented Generation with LangChain

## Initial Setup

In [76]:
# !uv pip install sentence-transformers

In [77]:
import os, json, re, getpass, warnings
import numpy as np
from dotenv import load_dotenv
from uuid import uuid4
from IPython.display import display, Markdown

In [78]:
warnings.filterwarnings('ignore')
load_dotenv(override=True)

True

In [79]:
#Check for Groq API Key
if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("GROQ API Key: ")

In [80]:
#Set Hugging Face token (as we will be using some of Hugging Face's functionalities
from huggingface_hub.hf_api import HfFolder
HfFolder.save_token(os.environ["HF_TOKEN"])

## Defining Components

In [81]:
# Question
question = "What is the PGP AI & DS at Jio Institute all about?"

### Chat Model

In [82]:
from langchain.chat_models import init_chat_model

model_name = "llama-3.1-8b-instant"
llm = init_chat_model(model_name, model_provider="groq") #Other Llama alternatives available are llama3-8b-8192, llama-3.3-70b-versatile

### Embedding Model

In [39]:
!ollama pull llama3.1:8b

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 667b0c1932bc: 100% ▕██████████████████▏ 4.9 GB                         [K
pulling 948af2743fc7: 100% ▕██████████████████▏ 1.5 KB                         [K
pulling 0ba8f0e314b4: 100% ▕██████████████████▏  12 KB                         [K
pulling 56bb8bd477a5: 100% ▕██████████████████▏   96 B          

In [40]:
from langchain_ollama import OllamaEmbeddings

embeddings_model = OllamaEmbeddings(model="llama3.1:8b")

### Vector Store

In [41]:
#Import library
from langchain_chroma import Chroma

In [42]:
#Create a vector store
vector_store_chroma = Chroma(
    collection_name="session-4",
    embedding_function=embeddings_model,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

## Indexing

### Loading Documents

In [43]:
import bs4 #import Beautiful Soup
from langchain_community.document_loaders import WebBaseLoader

In [44]:
# Only keep the main content from the full HTML
bs4_strainer = bs4.SoupStrainer(class_=("node__content clearfix", "col-md-9 pl-lg-5"))
loader = WebBaseLoader(
    web_paths=("https://www.jioinstitute.edu.in/about/",
    "https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science"),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

assert len(docs) == 2

print(f"Total characters: {len(docs[0].page_content)}")

Total characters: 4210


In [45]:
# --- Post-processing to clean up excessive newlines and whitespace ---
for i, doc in enumerate(docs):
    if doc.page_content:
        # 1. Replace multiple newlines with a single newline
        cleaned_content = re.sub(r'\n\s*\n', '\n\n', doc.page_content)
        # 2. Replace multiple spaces with a single space
        cleaned_content = re.sub(r' {2,}', ' ', cleaned_content)
        # 3. Strip leading/trailing whitespace from each line
        cleaned_content = '\n'.join([line.strip() for line in cleaned_content.split('\n')])
        # 4. Remove leading/trailing whitespace from the whole string
        cleaned_content = cleaned_content.strip()
    
        docs[i].page_content = cleaned_content
# --- End of post-processing ---

In [46]:
print(f"Total characters: {len(docs[0].page_content)}")
print("\n--- Cleaned Content Snippet ---")
print(docs[0].page_content[:1000]) # Print a snippet to verify

Total characters: 3548

--- Cleaned Content Snippet ---
About Us

Jio Institute is a multidisciplinary higher education institute set up as a philanthropic initiative by the Reliance Group. The Institute is dedicated to the pursuit of excellence by bringing together global scholars and thought leaders and providing an enriching student experience through world-class education, and a culture of research and innovation.

Our Story
Pursuit of excellence in academics, research and innovation.
We stand at the confluence of the best higher education practices from India and the world. The institute aims to nurture students’ aspirations, and provide a platform to their entrepreneurial spirit.

Read more

Our Vision
In sync with global aspirations. In step with changing times.
We envisage to be a world-class higher education institute through our multi-disciplinary academic programmes, robust research endeavours and a culture of innovation and entrepreneurship.

Read more

Growth Plan
Well tho

### Splitting Documents

In [47]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

Split blog post into 62 sub-documents.


In [20]:
for i in range(len(all_splits)):
    print(f"\n--- Split{i} ---\n")
    print(all_splits[i].page_content)


--- Split0 ---

Secure Deployment : VPC | On-Prem | Air-GappedAI Gateway: Fast, Scalable, Enterprise-ReadyEnterprise-Ready AI Gateway for secure, high-performance LLM access, observability, and orchestration.See detailed pricingGet Started for FreeAI Gateway: Unified LLM API AccessSimplify your GenAI stack with a single AI Gateway that integrates all major models.

Connect to OpenAI, Claude, Gemini, Groq, Mistral, and 250+ LLMs through one AI Gateway API

Use the platform to support chat, completion, embedding, and reranking model types.

Centralize API key management and team authentication in one place.

Orchestrate multi-model workloads seamlessly through your infrastructure.Read MoreAI Gateway Observability

Monitor token usage, latency, error rates, and request volumes across your system.

Store and inspect full request/response logs centrally to ensure compliance and simplify debugging.

Tag traffic with metadata like user ID, team, or environment to gain granular insights.

---

### Storing in Vector Store

In [21]:
#Add first 10 chunks to vector DB
uuids = [str(uuid4()) for _ in range(len(all_splits))] #Universally unique identifier
vector_store_chroma.add_documents(documents=all_splits[:10], ids=uuids[:10])

['71535f94-9ea0-408f-b161-82511c4f2f35',
 'bf8ae97d-95b6-4488-b7c1-c01e56d1b477',
 '22d7a904-19e6-46b7-87f7-c0cdfca18d42',
 'e0595d54-3eda-4bcd-98a2-9a4dafc26d49',
 'f293ff4c-e6db-4c63-86ef-27c6119ee3b5',
 'af769fe8-3114-4311-8ed6-ed3e10073295']

## Advanced Retrieval and Reranking Strategies

### Multi Query Retrieval

Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

The [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [22]:
#Import libraries
from langchain.retrievers.multi_query import MultiQueryRetriever
import logging # Set logging for the queries

In [23]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                                search_kwargs={"k": 2})

In [24]:
mq_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever, 
    llm=llm,
    include_original=True
)

In [25]:
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [26]:
docs = mq_retriever.invoke(question)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three alternative versions of the user question to retrieve relevant documents from a vector database:', "What is TrueFoundry's mission and values?", ' ', 'What does TrueFoundry specialize in or offer to its users?', 'What key aspects or features of TrueFoundry should I know about?', 'These alternative questions aim to capture different nuances of the original question, allowing the vector database to return relevant documents that may not have been retrieved by a single search query.']


[Document(id='af769fe8-3114-4311-8ed6-ed3e10073295', metadata={'start_index': 0, 'source': 'https://www.truefoundry.com'}, page_content='Orchestrate Agentic AI with AI GatewayEnable intelligent multi-step reasoning, tool usage, and memory with full control and visibility across your AI agents and workflows.AI GatewayManage agent memory, tool orchestration, and action planning through a centralized protocol that supports complex, context-aware workflows.Learn More\n\nMCP & Agents RegistryMaintain a structured, discoverable registry of tools and APIs accessible to agents, complete with schema validation and access control.Learn More\n\nPrompt Lifecycle ManagementVersion, manage, and monitor prompts to ensure high-quality, repeatable behavior across agents and use cases.Learn More'),
 Document(id='bf8ae97d-95b6-4488-b7c1-c01e56d1b477', metadata={'start_index': 790, 'source': 'https://www.truefoundry.com/ai-gateway'}, page_content='Store and inspect full request/response logs centrally to 

### Chained Retrieval with Reranker

This strategy uses a chain of multiple retrievers sequentially to get to the most relevant documents. The following is the flow:

*Similarity Retrieval → Reranker Model Retrieval*

**What are rerankers?**

- Rerankers are fine-tuned cross-encoder transformer models
- These models take in a pair of documents (Query, Document) and return back a relevance score
- Models fine-tuned on more pairs and released recently will usually be better

In [27]:
#Import libraries
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

In [28]:
# Retriever 1 - simple cosine distance based retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [30]:
# Download an open-source reranker model - cross-encoder/qnli-electra-base
reranker = HuggingFaceCrossEncoder(model_name="cross-encoder/qnli-electra-base")
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=2)

config.json:   0%|          | 0.00/771 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [31]:
# Retriever 2 - Uses a Reranker model to rerank retrieval results from the previous retriever
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor,
    base_retriever=retriever
)

In [32]:
docs = final_retriever.invoke(question)
docs

[Document(id='bf8ae97d-95b6-4488-b7c1-c01e56d1b477', metadata={'source': 'https://www.truefoundry.com/ai-gateway', 'start_index': 790}, page_content='Store and inspect full request/response logs centrally to ensure compliance and simplify debugging.\n\nTag traffic with metadata like user ID, team, or environment to gain granular insights.\n\nFilter logs and metrics by model, team, or geography to quickly pinpoint root causes and accelerate resolution.Read MoreQuota & Access Control via AI GatewayEnforce governance, control costs, and reduce risk with consistent policy management.\n\nApply rate limits per user, service, or endpoint.\n\nSet cost-based or token-based quotas using metadata filters.\n\nUse role-based access control (RBAC) to isolate and manage usage.\n\nGovern service accounts and agent workloads at scale through centralized rules.Read MoreEnsuring predictable usage, strong access boundaries, and scalable team-level governance for your GenAI infrastructure.Low-Latency Infer

## Context Compression Strategies

Here, we'll explore **LLM prompt-based context compression** strategies. The context compression can happen in the form of:

- **Extractor**: Remove parts of the content of retrieved documents which are not relevant to the query. This is done by extracting only relevant parts of the document to the given query

- **Filter**: Filter out documents which are not relevant to the given query but do not remove content from the document

Good to also read about [Microsoft LLMLingua Prompt
Compression](https://www.microsoft.com/en-us/research/blog/llmlingua-innovating-llm-efficiency-with-prompt-compression/).

### LLMChainExtractor

Here we look at `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query. Totally irrelevant documents might also be dropped.

In [33]:
#Import libraries
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [34]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [35]:
# Extracts from each document only the content that is relevant to the query
compressor = LLMChainExtractor.from_llm(llm=llm)

In [36]:
# Retrieves the documents similar to query and then applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [37]:
docs = compression_retriever.invoke(question)
docs

[Document(metadata={'start_index': 0, 'source': 'https://www.truefoundry.com'}, page_content='Orchestrate Agentic AI with AI Gateway\nEnable intelligent multi-step reasoning, tool usage, and memory with full control and visibility across your AI agents and workflows.\n \nAI Gateway\nManage agent memory, tool orchestration, and action planning through a centralized protocol that supports complex, context-aware workflows.\nLearn More'),
 Document(metadata={'start_index': 790, 'source': 'https://www.truefoundry.com/ai-gateway'}, page_content='>>> Store and inspect full request/response logs centrally to ensure compliance and simplify debugging.\n>>> Tag traffic with metadata like user ID, team, or environment to gain granular insights.\n>>> Quota & Access Control via AI Gateway\n>>> Enforce governance, control costs, and reduce risk with consistent policy management.\n>>> Apply rate limits per user, service, or endpoint.\n>>> Set cost-based or token-based quotas using metadata filters.\n>

### LLMChainFilter

The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [38]:
#Import library
from langchain.retrievers.document_compressors import LLMChainFilter

In [39]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [40]:
# Decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=llm)

In [41]:
# Retrieves the documents similar to query and then applies the filter
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

In [42]:
docs = compression_retriever.invoke(question)
docs

[Document(id='af769fe8-3114-4311-8ed6-ed3e10073295', metadata={'source': 'https://www.truefoundry.com', 'start_index': 0}, page_content='Orchestrate Agentic AI with AI GatewayEnable intelligent multi-step reasoning, tool usage, and memory with full control and visibility across your AI agents and workflows.AI GatewayManage agent memory, tool orchestration, and action planning through a centralized protocol that supports complex, context-aware workflows.Learn More\n\nMCP & Agents RegistryMaintain a structured, discoverable registry of tools and APIs accessible to agents, complete with schema validation and access control.Learn More\n\nPrompt Lifecycle ManagementVersion, manage, and monitor prompts to ensure high-quality, repeatable behavior across agents and use cases.Learn More'),
 Document(id='bf8ae97d-95b6-4488-b7c1-c01e56d1b477', metadata={'start_index': 790, 'source': 'https://www.truefoundry.com/ai-gateway'}, page_content='Store and inspect full request/response logs centrally to 