# Advanced Retrieval Augmented Generation with LangChain

## Initial Setup

In [76]:
# !uv pip install sentence-transformers

In [77]:
import os, json, re, getpass, warnings
import numpy as np
from dotenv import load_dotenv
from uuid import uuid4
from IPython.display import display, Markdown

In [78]:
warnings.filterwarnings('ignore')
load_dotenv(override=True)

True

In [79]:
#Check for Groq API Key
if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("GROQ API Key: ")

In [80]:
#Set Hugging Face token (as we will be using some of Hugging Face's functionalities
from huggingface_hub.hf_api import HfFolder
HfFolder.save_token(os.environ["HF_TOKEN"])

## Defining Components

In [81]:
# Question
question = "What is the PGP AI & DS at Jio Institute all about?"

### Chat Model

In [82]:
from langchain.chat_models import init_chat_model

model_name = "llama-3.1-8b-instant"
llm = init_chat_model(model_name, model_provider="groq") #Other Llama alternatives available are llama3-8b-8192, llama-3.3-70b-versatile

### Embedding Model

In [39]:
!ollama pull llama3.1:8b

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 667b0c1932bc: 100% ▕██████████████████▏ 4.9 GB                         [K
pulling 948af2743fc7: 100% ▕██████████████████▏ 1.5 KB                         [K
pulling 0ba8f0e314b4: 100% ▕██████████████████▏  12 KB                         [K
pulling 56bb8bd477a5: 100% ▕██████████████████▏   96 B          

In [40]:
from langchain_ollama import OllamaEmbeddings

embeddings_model = OllamaEmbeddings(model="llama3.1:8b")

### Vector Store

In [41]:
#Import library
from langchain_chroma import Chroma

In [42]:
#Create a vector store
vector_store_chroma = Chroma(
    collection_name="session-4",
    embedding_function=embeddings_model,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

## Indexing

### Loading Documents

In [43]:
import bs4 #import Beautiful Soup
from langchain_community.document_loaders import WebBaseLoader

In [44]:
# Only keep the main content from the full HTML
bs4_strainer = bs4.SoupStrainer(class_=("node__content clearfix", "col-md-9 pl-lg-5"))
loader = WebBaseLoader(
    web_paths=("https://www.jioinstitute.edu.in/about/",
    "https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science"),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

assert len(docs) == 2

print(f"Total characters: {len(docs[0].page_content)}")

Total characters: 4210


In [45]:
# --- Post-processing to clean up excessive newlines and whitespace ---
for i, doc in enumerate(docs):
    if doc.page_content:
        # 1. Replace multiple newlines with a single newline
        cleaned_content = re.sub(r'\n\s*\n', '\n\n', doc.page_content)
        # 2. Replace multiple spaces with a single space
        cleaned_content = re.sub(r' {2,}', ' ', cleaned_content)
        # 3. Strip leading/trailing whitespace from each line
        cleaned_content = '\n'.join([line.strip() for line in cleaned_content.split('\n')])
        # 4. Remove leading/trailing whitespace from the whole string
        cleaned_content = cleaned_content.strip()
    
        docs[i].page_content = cleaned_content
# --- End of post-processing ---

In [46]:
print(f"Total characters: {len(docs[0].page_content)}")
print("\n--- Cleaned Content Snippet ---")
print(docs[0].page_content[:1000]) # Print a snippet to verify

Total characters: 3548

--- Cleaned Content Snippet ---
About Us

Jio Institute is a multidisciplinary higher education institute set up as a philanthropic initiative by the Reliance Group. The Institute is dedicated to the pursuit of excellence by bringing together global scholars and thought leaders and providing an enriching student experience through world-class education, and a culture of research and innovation.

Our Story
Pursuit of excellence in academics, research and innovation.
We stand at the confluence of the best higher education practices from India and the world. The institute aims to nurture students’ aspirations, and provide a platform to their entrepreneurial spirit.

Read more

Our Vision
In sync with global aspirations. In step with changing times.
We envisage to be a world-class higher education institute through our multi-disciplinary academic programmes, robust research endeavours and a culture of innovation and entrepreneurship.

Read more

Growth Plan
Well tho

### Splitting Documents

In [47]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

Split blog post into 62 sub-documents.


In [48]:
for i in range(len(all_splits)):
    print(f"\n--- Split{i} ---\n")
    print(all_splits[i].page_content)


--- Split0 ---

About Us

Jio Institute is a multidisciplinary higher education institute set up as a philanthropic initiative by the Reliance Group. The Institute is dedicated to the pursuit of excellence by bringing together global scholars and thought leaders and providing an enriching student experience through world-class education, and a culture of research and innovation.

Our Story
Pursuit of excellence in academics, research and innovation.
We stand at the confluence of the best higher education practices from India and the world. The institute aims to nurture students’ aspirations, and provide a platform to their entrepreneurial spirit.

Read more

Our Vision
In sync with global aspirations. In step with changing times.
We envisage to be a world-class higher education institute through our multi-disciplinary academic programmes, robust research endeavours and a culture of innovation and entrepreneurship.

Read more

--- Split1 ---

Read more

Growth Plan
Well thought out gro

### Storing in Vector Store

In [49]:
#Add first 10 chunks to vector DB
uuids = [str(uuid4()) for _ in range(len(all_splits))] #Universally unique identifier
vector_store_chroma.add_documents(documents=all_splits[:10], ids=uuids[:10])

['3977720a-f0b4-40c0-9f57-64681369fd60',
 'f95e3681-5c6a-430a-8e80-35f57563b652',
 '7eacb750-6428-40a4-ac1a-311cadb763cf',
 '3dc437a1-f883-4e98-a60a-effa20eb0540',
 'c7abb518-74dd-41d2-9994-b02e8b1e5acb',
 '33678f54-8e55-4d7f-9ebe-cbc8396a4845',
 '62578d0d-26ae-4e30-91ea-aefb5ef5b068',
 '4af879cc-1eea-4fae-bed9-d2f9a9562513',
 'd0dccbba-c8e2-4d2b-b3c9-bd6a8c475f16',
 '43b1f331-1241-4cd2-97cb-6ce56694edc3']

## Advanced Retrieval and Reranking Strategies

### Multi Query Retrieval

Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

The [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [51]:
#Import libraries
from langchain.retrievers.multi_query import MultiQueryRetriever
import logging # Set logging for the queries

In [52]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                                search_kwargs={"k": 2})

In [53]:
mq_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever, 
    llm=llm,
    include_original=True
)

In [54]:
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [55]:
docs = mq_retriever.invoke(question)
docs

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three alternative perspectives on the original question:', 'What is the program curriculum and course structure of the Post Graduate Program in Artificial Intelligence & Data Science at Jio Institute?', 'Can you provide information about the faculty expertise and research areas of the Post Graduate Program in Artificial Intelligence & Data Science at Jio Institute?', 'What are the career prospects and alumni outcomes of students who have completed the Post Graduate Program in Artificial Intelligence & Data Science at Jio Institute?']


[Document(id='62578d0d-26ae-4e30-91ea-aefb5ef5b068', metadata={'start_index': 1103, 'source': 'https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science'}, page_content='as well as the know-how to create practical solutions for enterprises and society. Students learn to convert business problems and workflows into AI&DS products and solutions across multiple verticals/industries. Enriched by exposure to real-life AI&DS applications through capstone projects and lectures from industry veterans, students are exposed to hands-on exercises, practical projects and quizzes to reinforce their learning.'),
 Document(id='4af879cc-1eea-4fae-bed9-d2f9a9562513', metadata={'source': 'https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science', 'start_index': 1537}, page_content='PGP in Artificial Intelligence & Data Science\n\nLeadership\nFaculty\nAdvisors\nCurriculum\nTools\nHighlights\nAdmissions\nFAQ\nBrochure\n\nPGP in Artificial Intelligence & Data Scie

### Chained Retrieval with Reranker

This strategy uses a chain of multiple retrievers sequentially to get to the most relevant documents. The following is the flow:

*Similarity Retrieval → Reranker Model Retrieval*

**What are rerankers?**

- Rerankers are fine-tuned cross-encoder transformer models
- These models take in a pair of documents (Query, Document) and return back a relevance score
- Models fine-tuned on more pairs and released recently will usually be better

In [56]:
#Import libraries
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever

In [58]:
# Retriever 1 - simple cosine distance based retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [61]:
# Download an open-source reranker model - cross-encoder/qnli-electra-base
reranker = HuggingFaceCrossEncoder(model_name="cross-encoder/qnli-electra-base")
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=2)

config.json:   0%|          | 0.00/771 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [63]:
# Retriever 2 - Uses a Reranker model to rerank retrieval results from the previous retriever
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor,
    base_retriever=retriever
)

In [64]:
docs = final_retriever.invoke(question)
docs

[Document(id='62578d0d-26ae-4e30-91ea-aefb5ef5b068', metadata={'start_index': 1103, 'source': 'https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science'}, page_content='as well as the know-how to create practical solutions for enterprises and society. Students learn to convert business problems and workflows into AI&DS products and solutions across multiple verticals/industries. Enriched by exposure to real-life AI&DS applications through capstone projects and lectures from industry veterans, students are exposed to hands-on exercises, practical projects and quizzes to reinforce their learning.'),
 Document(id='3977720a-f0b4-40c0-9f57-64681369fd60', metadata={'source': 'https://www.jioinstitute.edu.in/about/', 'start_index': 0}, page_content='About Us\n\nJio Institute is a multidisciplinary higher education institute set up as a philanthropic initiative by the Reliance Group. The Institute is dedicated to the pursuit of excellence by bringing together global schol

## Context Compression Strategies

Here, we'll explore **LLM prompt-based context compression** strategies. The context compression can happen in the form of:

- **Extractor**: Remove parts of the content of retrieved documents which are not relevant to the query. This is done by extracting only relevant parts of the document to the given query

- **Filter**: Filter out documents which are not relevant to the given query but do not remove content from the document

Good to also read about [Microsoft LLMLingua Prompt
Compression](https://www.microsoft.com/en-us/research/blog/llmlingua-innovating-llm-efficiency-with-prompt-compression/).

### LLMChainExtractor

Here we look at `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query. Totally irrelevant documents might also be dropped.

In [66]:
#Import libraries
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [67]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [68]:
# Extracts from each document only the content that is relevant to the query
compressor = LLMChainExtractor.from_llm(llm=llm)

In [69]:
# Retrieves the documents similar to query and then applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [70]:
docs = compression_retriever.invoke(question)
docs

[Document(metadata={'source': 'https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science', 'start_index': 1103}, page_content='as well as the know-how to create practical solutions for enterprises and society. Students learn to convert business problems and workflows into AI&DS products and solutions across multiple verticals/industries.'),
 Document(metadata={'start_index': 2588, 'source': 'https://www.jioinstitute.edu.in/about/'}, page_content='From reputed academicians and researchers around the world to Indian stalwarts across the spectrum – Jio Institute’s distinguished leadership is committed to guiding the Institute from strength to strength.\n\nJio Institute Leadership\nChancellor\nDr. Raghunath Mashelkar\n\nPadma Vibhushan | Former Director General, CSIR, Government of India\n\nSee Profile\n\nVICE-CHANCELLOR\nDr. Dipak Jain\n\nFormer Dean, Kellogg School of Management, USA | Former Dean, INSEAD, France\n\nSee Profile'),
 Document(metadata={'source': 'https

### LLMChainFilter

The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [71]:
#Import library
from langchain.retrievers.document_compressors import LLMChainFilter

In [72]:
#Initialize retriever
retriever = vector_store_chroma.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 3})

In [73]:
# Decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=llm)

In [74]:
# Retrieves the documents similar to query and then applies the filter
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=retriever
)

In [75]:
docs = compression_retriever.invoke(question)
docs

[Document(id='62578d0d-26ae-4e30-91ea-aefb5ef5b068', metadata={'start_index': 1103, 'source': 'https://www.jioinstitute.edu.in/academics/artificial-intelligence-data-science'}, page_content='as well as the know-how to create practical solutions for enterprises and society. Students learn to convert business problems and workflows into AI&DS products and solutions across multiple verticals/industries. Enriched by exposure to real-life AI&DS applications through capstone projects and lectures from industry veterans, students are exposed to hands-on exercises, practical projects and quizzes to reinforce their learning.')]