# **Problem Statement:**

Organizations stores the manuals, policy documents, legal contracts or any other documents in unstructured format like pdfs. In this case study use any Life Insurance Policy pdf document like "**Principal-Sample-Life-Insurance-Policy.pdf**", and retrieve instant and accurate answers from the document.

Traditional keyword-based search leads to inefficient contextual information, missed insights and further leads to frustrated users.

The case study should concentrate on the below points.

1.   Understand the context of the question,
2.   Match semantically similar but differently worded content,
3.   And synthesize answers across multiple sections.


# **Implementation:**
To address the challenge of answering questions based on unstructured document data (such as insurance policies), a Retrieval-Augmented Generation (RAG) solution has been implemented. This solution leverages LlamaIndex for document parsing, indexing, and retrieval, and LangChain for managing LLM interactions, prompting, and response generation. These retrieves the most relevant context from documents and uses it to generate accurate, grounded responses via an OpenAI LLM.



In [37]:
# Installing packages
!pip install -U -q llama-index openai llama-index-core llama-index-embeddings-openai
!pip install llama-index-llms-openai



In [38]:
#importing libraries
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
import os
import openai
from dotenv import load_dotenv
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()

In [39]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# If you have saved the API key in Google Drive
filepath = "/content/drive/MyDrive/helpmate_ai/"
with open(filepath + "OpenAI_API_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

# RAG with Llamaindex

In [5]:
# Importing libraries
# import pymupdf
# Importing libraries for pdf loader and reader
from pathlib import Path
from llama_index.core import download_loader
from llama_index.readers.file import PyMuPDFReader
from collections import Counter

In [6]:
#Installing packages
!pip install pymupdf



In [10]:
# Path to the directory containing your PDF files
pdf_dir = Path('/content/drive/MyDrive/helpmate_ai/Policy_Documents')

In [55]:


# Get all PDF file paths in the directory
pdf_files = list(pdf_dir.glob('*.pdf'))
header=[]
footer=[]
# Load all documents
loader = PyMuPDFReader()
documents = []
header_footer_candidates={}
#Reading all documents in the pdf_dir into documents
for pdf_path in pdf_files:
    #docs = pymupdf.open(pdf_path)
    docs = loader.load(file_path=pdf_path)
    for page in docs:
      page.text.strip()
      for i in page.text.splitlines():
        if not i.isspace():
          header.append(i)
          break
    probable_header = Counter(header).most_common()[0][0]
    print (probable_header)
    for page in docs:
      if "page left blank intentionally" in page.text:
        docs.remove(page)
#      if probable_header in page.text:
#      page.text.strip(probable_header)
#
    documents.extend(docs)

print(f"Loaded {len(documents)} documents from {len(pdf_files)} PDF files.")


This policy has been updated effective  January 1, 2014 
Loaded 61 documents from 1 PDF files.


In [68]:
#Complete documents that are read
documents

[Document(id_='470a2982-011f-4e01-b2e7-a5b6fd1e79b5', embedding=None, metadata={'total_pages': 64, 'file_path': '/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text=' \n \n \n \n \nGROUP POLICY FOR: \nRHODE ISLAND JOHN DOE \nALL MEMBERS \nGroup Member Life Insurance \nPrint Date: 07/16/2014 \n \nDOROTHEA GLAUSE \nS655\nRHODE ISLAND JOHN DOE \n01/01/2014\n711 HIGH STREET \nGEORGE RI 02903 \n \n \n \n                                       \n', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'),
 Document(id_='529da190-861a-4ce9-bca0-ac185d2c955d', embedding=None, metadata={'total_pages': 64, 'file_path': '/content/drive/MyDrive/helpmat

In [57]:
# Importing libraries for Vector store, Node parser
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import VectorStoreIndex
from IPython.display import display, HTML

In [58]:
# create parser and parse document into nodes
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)
nodes

[TextNode(id_='bff500e9-bb01-4eda-8a69-c62e4b36a453', embedding=None, metadata={'total_pages': 64, 'file_path': '/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf', 'source': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='470a2982-011f-4e01-b2e7-a5b6fd1e79b5', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'total_pages': 64, 'file_path': '/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf', 'source': '1'}, hash='e1e5643eb3908228ac9d39acb793c4cc99cebf9dab4f27dd6c46e9f8f58f33f6')}, metadata_template='{key}: {value}', metadata_separator='\n', text='GROUP POLICY FOR: \nRHODE ISLAND JOHN DOE \nALL MEMBERS \nGroup Member Life Insurance \nPrint Date: 07/16/2014 \n \nDOROTHEA GLAUSE \nS655\nRHODE ISLAND JOHN DOE \n01/01/2014\n711 HIGH STREET \nGEORGE RI 02903', mimetype='text/plain', start_char_idx=10, end_char_

In [59]:
## Sample test
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
Settings.num_output = 512
Settings.context_window = 3900
# # build index
index = VectorStoreIndex(nodes)

# Construct Query Engine
query_engine = index.as_query_engine(return_source=True)

# Query the engine.
response = query_engine.query("What is the Policy document about?")

# print the synthesized response.
print (response.response)
# Display source documents with page info
print("\nSources:")
for node in response.source_nodes:
    metadata = node.node.metadata
    text_snippet = node.node.text[:100].replace("\n", " ")  # Short preview
    page = metadata.get("source") or "N/A"
    print(f"- Page {page}: {text_snippet}...")



The Policy document is about a Group Policy for Life Insurance, detailing definitions, policy administration, and specifically addressing policy renewal procedures and conditions.

Sources:
- Page 15: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 7     ...
- Page 25: This policy has been updated effective  January 1, 2014                PART II - POLICY ADMINISTRATI...


In [60]:
# Import necessary libraries and modules
from llama_index.core.service_context import ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.node_parser import TokenTextSplitter

In [61]:
# Initialize an LLMPredictor object with a specific OpenAI model and settings
llm = OpenAI(model='gpt-3.5-turbo', temperature=0, max_tokens=256)

# Initialize an OpenAIEmbedding model
OpenAIEmbedding(model="text-embedding-ada-002")

# Initialize a SimpleNodeParser with custom chunking settings
node_parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=100)

In [62]:
#Importing Settings and Sentencesplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.text_splitter = SentenceSplitter(chunk_size = 512,chunk_overlap = 100)
Settings.num_output = 512
Settings.context_window = 3900

# Create a VectorStoreIndex from a list of documents
index = VectorStoreIndex.from_documents(documents)

In [63]:
# Initialize a query engine for the index with a specified similarity top-k value
query_engine = index.as_query_engine(similarity_top_k=3,return_source=True)

# Query the engine with a specific question. TESTING the sample output
response = query_engine.query("What is the document about?")

# Display the synthesized response with HTML formatting
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

for node in response.source_nodes:
    metadata = node.node.metadata
    text_snippet = node.node.text[:200].replace("\n", " ")  # Short preview
    page = metadata.get("source") or "N/A"
    print(f"- Page {page}: {text_snippet}...")



- Page 15: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 7      A record which is on or transmitted by paper or electronic media, and which is consistent with  app...
- Page 6: This policy has been updated effective January 1, 2014  GC 6001  TABLE OF CONTENTS, PAGE 1    TABLE OF CONTENTS      PART I - DEFINITIONS    PART II - POLICY ADMINISTRATION      Section A – Contract  ...
- Page 19: This policy has been updated effective  January 1, 2014  PART II - POLICY ADMINISTRATION  GC 6003   Section A - Contract, Page 4    The Principal has complete discretion to construe or interpret the p...


In [64]:
# Importing libraries
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer

In [65]:
# configure retriever
retriever = VectorIndexRetriever(
 index = index,
 similarity_top_k=3
)
# configure response synthesizer
synth = get_response_synthesizer(
    response_mode="refine"
)
# construct query engine
query_engine = RetrieverQueryEngine(
 retriever=retriever,
 response_synthesizer=synth,
)
def Query(query):
  print ("Query : ", query)
  response = query_engine.query("What risks or events are covered under this policy?")
  print("Response : ",response.response)
  references(response)

def references(response):
  for node in response.source_nodes:
    metadata = node.node.metadata
    text_snippet = node.node.text[:200].replace("\n", " ")  # Short preview
    page = metadata.get("source") or "N/A"
    print(f"- Page {page}: {text_snippet}...")

In [66]:
# Query 1
Query("What risks or events are covered under this policy?")

Query :  What risks or events are covered under this policy?
Response :  The policy covers accidental death and dismemberment resulting from events such as willful self-injury, disease or medical treatment complications, participation in criminal activities, certain aeronautic activities, military duty, war, alcohol use exceeding legal limits, drug use without prescription, and injuries not related to employment for wage or profit.
- Page 58: This policy has been updated effective  January 1, 2014        PART IV - BENEFITS  GC 6015   Section B - Member Accidental Death and  Dismemberment Insurance, Page 6      a.  willful self-injury or se...
- Page 27: This policy has been updated effective  January 1, 2014  PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS  GC 6006  Section A - Eligibility, Page 2    If a Member's Dependent is employed and is covered u...
- Page 13: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 5      a.  A licensed Doctor 

In [67]:
# Query 2
Query("what is the life insurance coverage for disability")

Query :  what is the life insurance coverage for disability
Response :  The policy covers accidental death and dismemberment resulting from events such as willful self-injury, disease or medical treatment complications, participation in criminal activities, certain aeronautic activities, military duty, war, alcohol use exceeding legal limits, drug use without prescription, and injuries not related to employment for wage or profit.
- Page 58: This policy has been updated effective  January 1, 2014        PART IV - BENEFITS  GC 6015   Section B - Member Accidental Death and  Dismemberment Insurance, Page 6      a.  willful self-injury or se...
- Page 27: This policy has been updated effective  January 1, 2014  PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS  GC 6006  Section A - Eligibility, Page 2    If a Member's Dependent is employed and is covered u...
- Page 13: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 5      a.  A licensed Doctor o

In [114]:
# Query 3
Query("Summarize the key benefits from the insurance policy documents.")

Query :  Summarize the key benefits from the insurance policy documents.
Response :  The risks or events covered under this policy include accidental death and dismemberment that are not a result of willful self-injury, disease or medical treatment complications, participation in criminal activities, certain aeronautic activities, military duty, war, excessive alcohol consumption, drug use, or injuries sustained during employment for wage or profit.
- Page 58: This policy has been updated effective  January 1, 2014        PART IV - BENEFITS  GC 6015   Section B - Member Accidental Death and  Dismemberment Insurance, Page 6      a.  willful self-injury or se...
- Page 27: This policy has been updated effective  January 1, 2014  PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS  GC 6006  Section A - Eligibility, Page 2    If a Member's Dependent is employed and is covered u...
- Page 13: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 5      a.  

In [115]:
# Query 4
Query("What riders or add-ons are available?")

Query :  What riders or add-ons are available?
Response :  The risks or events covered under this policy include accidental death and dismemberment that are not a result of willful self-injury, disease or medical treatments, participation in criminal activities, certain aeronautic activities, military duty, war, excessive alcohol consumption, drug use, and injuries sustained during employment for wage or profit.
- Page 58: This policy has been updated effective  January 1, 2014        PART IV - BENEFITS  GC 6015   Section B - Member Accidental Death and  Dismemberment Insurance, Page 6      a.  willful self-injury or se...
- Page 27: This policy has been updated effective  January 1, 2014  PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS  GC 6006  Section A - Eligibility, Page 2    If a Member's Dependent is employed and is covered u...
- Page 13: This policy has been updated effective  January 1, 2014  GC 6002   PART I - DEFINITIONS, PAGE 5      a.  A licensed Doctor of Medicine (M.D.) o

Insights:

Used Llamaindex with LLM model=gpt-3.5-turbo", temperature=0.1, embeddings="text-embedding-ada-002", SentenceSplitter with chunk_size = 512,chunk_overlap = 100, num_output = 512, context_window = 3900, similarity_top_k=3, synthesizer response_mode="refine"




#**RAG with Langchain**

In [40]:
# Installing packages
!pip install -U -q langchain pdfplumber langchain-community sentence-transformers google-colab pypdf opentelemetry-sdk

In [41]:
# Importing libraries
from langchain.vectorstores import Chroma,FAISS
from langchain.document_loaders import PyPDFDirectoryLoader # Using standard PDF loader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from sentence_transformers.cross_encoder import CrossEncoder # For reranker
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI


from langchain.embeddings.openai import OpenAIEmbeddings
#from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.chains import (
    create_history_aware_retriever,
    create_retrieval_chain,
)
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

In [42]:
# Load Documents from pdf_dir
print(f"\n--- Loading Documents from {pdf_dir} using PyPDFDirectoryLoader ---")


pdf_directory = Path(pdf_dir)
docs = []

if not pdf_directory.is_dir():
    print(f"Error: Directory not found at {pdf_dir}")
else:
    try:
        # PyPDFDirectoryLoader handles iterating through the directory
        loader = PyPDFDirectoryLoader(
            path=pdf_dir,
            recursive=False # True if PDFs in subfolders
            )
        # Load documents - this loads all pages from all PDFs found
        docs = loader.load()

        if docs:
            print(f"Loaded {len(docs)} documents (pages) from text-based PDFs.")
            # Verify sources
            sources = set(doc.metadata.get('source', 'Unknown') for doc in docs)
            print("Sources loaded:", sources)
            print("Sample document metadata (first page):", docs[0].metadata)
            # Note: PyPDFDirectoryLoader usually correctly populates 'source' and 'page' metadata.
        else:
            print("No text-based PDF documents were loaded.")
            print("Check if the directory contains text-based PDFs or if they are corrupted.")

    except Exception as e:
        print(f"An error occurred loading PDFs with PyPDFDirectoryLoader: {e}")
        print("Ensure only text-based PDFs are in the directory or that 'pypdf' library is working.")
        docs = [] # Ensure docs is empty on error

# Now the 'docs' variable holds pages loaded from text-based PDFs
if not docs:
    print("\nWarning: No documents were loaded. Subsequent steps might fail.")


--- Loading Documents from /content/drive/MyDrive/helpmate_ai/Policy_Documents using PyPDFDirectoryLoader ---
Loaded 64 documents (pages) from text-based PDFs.
Sources loaded: {'/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf'}
Sample document metadata (first page): {'producer': 'ADEP Document Services - PDF Generator', 'creator': 'Acrobat PDFMaker 10.1 for Word', 'creationdate': '2014-07-16T12:50:27-05:00', 'author': 'Apache POI', 'company': 'Principal Financial Group', 'contenttypeid': '0x010100963E1E66C16CFF4188CB6D2716FCB7F5', 'itemretentionformula': '', 'moddate': '2014-07-16T12:50:34-05:00', 'order': '467600.000000', 'paper copies': '1', 'sourcemodified': 'D:20140716175024', 'title': 'Life Policy', '_copysource': '2486961970434790079.docx', '_dlc_policyid': '', 'source': '/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf', 'total_pages': 64, 'page': 0, 'page_label': '1'}


In [43]:
# Split Documents into Chunks using RecursiveCharacterTextSplitter
splits = []
if docs:
    print("\n--- Splitting Documents into Chunks ---")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000, # Adjust chunk size based on embedding model limits & desired context
        chunk_overlap=300  # Overlap helps maintain context between chunks
    )
    splits = text_splitter.split_documents(docs)
    print(f"Split into {len(splits)} chunks.")
    if splits:
        print("Sample chunk metadata:", splits[0].metadata)
else:
    print("\nSkipping splitting: No documents loaded.")


--- Splitting Documents into Chunks ---
Split into 86 chunks.
Sample chunk metadata: {'producer': 'ADEP Document Services - PDF Generator', 'creator': 'Acrobat PDFMaker 10.1 for Word', 'creationdate': '2014-07-16T12:50:27-05:00', 'author': 'Apache POI', 'company': 'Principal Financial Group', 'contenttypeid': '0x010100963E1E66C16CFF4188CB6D2716FCB7F5', 'itemretentionformula': '', 'moddate': '2014-07-16T12:50:34-05:00', 'order': '467600.000000', 'paper copies': '1', 'sourcemodified': 'D:20140716175024', 'title': 'Life Policy', '_copysource': '2486961970434790079.docx', '_dlc_policyid': '', 'source': '/content/drive/MyDrive/helpmate_ai/Policy_Documents/Principal-Sample-Life-Insurance-Policy.pdf', 'total_pages': 64, 'page': 0, 'page_label': '1'}


In [44]:
#Installing chromadb
!pip install chromadb



In [45]:
#ChromaDB path
chroma_persist_path= "/content/drive/MyDrive/helpmate_ai/chromadata"

In [46]:
# 1. Install ChromaDB and LangChain dependencies
!pip install -U -q chromadb langchain openai tiktoken langchain-openai

In [47]:
from langchain.vectorstores import Chroma,FAISS
import chromadb

In [48]:
# OpenAI Embedding model with text-embedding-ada-002
embedding_model=OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai.api_key)

# Create or Load Vector Store
vectorstore = None
db_creation_needed = True # Flag to check if we need to create the DB

if embedding_model and splits:
    print(f"\n--- Setting up Chroma Vector Store ---")
    print(f"Using persistent path: {chroma_persist_path}")

    # Check if the database directory already exists and isn't empty
    if os.path.exists(chroma_persist_path) and os.listdir(chroma_persist_path):
        print("Existing ChromaDB directory found. Attempting to load...")
        try:
            """vectorstore = Chroma(documents=splits,
                persist_directory=chroma_persist_path,
                embedding_function=embedding_model
            )
            """
            vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=embedding_model,
                persist_directory=chroma_persist_path
              )
            # Quick check to see if it loaded something plausible
            test_search = vectorstore.similarity_search("insurance", k=1)
            if test_search:
                 print("Successfully loaded existing vector store.")
                 db_creation_needed = False
            else:
                 print("Loaded directory, but store seems empty or invalid. Will recreate.")
                 # Consider clearing the directory here if needed: shutil.rmtree(chroma_persist_path)
        except Exception as e:
            print(f"Error loading existing vector store: {e}. Will attempt to recreate.")
            # Consider clearing the directory here if needed: shutil.rmtree(chroma_persist_path)


    if db_creation_needed:
        print("Creating new Chroma vector store...")
        print(f"Embedding {len(splits)} chunks. This may take a significant amount of time...")
        try:
            # Create Chroma vector store FROM the document splits and WITH the Gemini embedding function
            vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=embedding_model,persist_directory=chroma_persist_path
              )
            print("New Chroma vector store created and documents embedded.")
        except Exception as e:
            print(f"Error creating new Chroma vector store: {e}")
            vectorstore = None
    else:
        print("Using previously loaded vector store.")


elif not embedding_model:
     print("\nSkipping vector store setup: Embedding model not initialized.")
elif not splits:
     print("\nSkipping vector store setup: No document splits available.")


--- Setting up Chroma Vector Store ---
Using persistent path: /content/drive/MyDrive/helpmate_ai/chromadata
Creating new Chroma vector store...
Embedding 86 chunks. This may take a significant amount of time...
New Chroma vector store created and documents embedded.


In [49]:
# Create Retriever
retriever = None
if vectorstore:
    print("\n--- Creating Retriever ---")

    base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})  # Fetch more for potential reranking
    print("Base retriever created (fetches top 20).")

    try:
        print("Attempting to set up CrossEncoder reranker using LangChain wrapper...")

        from langchain_community.cross_encoders import HuggingFaceCrossEncoder

        hf_cross_encoder_model = HuggingFaceCrossEncoder(model_name='cross-encoder/ms-marco-MiniLM-L-6-v2',model_kwargs = {'device': 'cpu'})

        reranker = CrossEncoderReranker(
            model=hf_cross_encoder_model,  # Use LangChain wrapper as model
            top_n=3                        # Return top 3 after reranking
        )
        # Use the reranker as a context compression retriever
        reranker_retriever = ContextualCompressionRetriever(
            base_compressor=reranker, base_retriever=base_retriever
        )

        retriever = reranker_retriever  # Use the reranking retriever
        print("Contextual Compression Retriever with reranker created (returns top 3).")

    except ImportError:
        # Handle case where LangChain or sentence-transformers might not be fully installed
        print("ImportError occurred. Ensure 'sentence-transformers' and 'langchain' are installed. Using base retriever instead.")
        retriever = base_retriever

    except Exception as e:
        # Handle other exceptions
        print(f"Could not set up reranker: {e}. Using base retriever instead.")
        retriever = base_retriever  # Fallback to base retriever

else:
    print("\nSkipping retriever creation: Vector store not available.")



--- Creating Retriever ---
Base retriever created (fetches top 20).
Attempting to set up CrossEncoder reranker using LangChain wrapper...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Contextual Compression Retriever with reranker created (returns top 3).


In [50]:
# Installing langchain community packages
!pip install llama-index-llms-langchain
!pip install -U langchain-community

Collecting llama-index-llms-langchain
  Downloading llama_index_llms_langchain-0.6.1-py3-none-any.whl.metadata (1.4 kB)
Downloading llama_index_llms_langchain-0.6.1-py3-none-any.whl (6.1 kB)
Installing collected packages: llama-index-llms-langchain
Successfully installed llama-index-llms-langchain-0.6.1


In [51]:
# Create Retriever (Direct Implementation of Re-ranking)
retriever = None # Initialize retriever variable

if vectorstore: # Only proceed if the vector store was successfully created/loaded
    print("\n--- Creating Retriever with Re-ranking ---")
    try:
        # 1. Import the necessary components
        from langchain_community.cross_encoders import HuggingFaceCrossEncoder
        from langchain.retrievers.document_compressors import CrossEncoderReranker
        from langchain.retrievers import ContextualCompressionRetriever

        # 2. Define the base retriever (fetches initial candidates)
        base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20}) # Fetch 20 candidates
        print("Base retriever created (fetches top 20).")

        # 3. Initialize the LangChain wrapper for the cross-encoder model
        print("Initializing CrossEncoder model...")
        hf_cross_encoder_model = HuggingFaceCrossEncoder(model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')

        # 4. Initialize the reranker component using the wrapper
        reranker = CrossEncoderReranker(
            model=hf_cross_encoder_model,
            top_n=3 # Return the top 3 most relevant documents after reranking
        )
        print("CrossEncoderReranker component created (will return top 3).")

        # 5. Create the Contextual Compression Retriever
        # This wraps the base_retriever and uses the reranker to compress/filter results
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=reranker,
            base_retriever=base_retriever
        )
        retriever = compression_retriever # Assign the final reranking retriever
        print("Contextual Compression Retriever setup complete.")

    except ImportError as e:
         # Handle potential missing libraries
         print(f"ImportError: Could not import necessary components ({e}).")
         print("Ensure 'langchain_community' and 'sentence-transformers' are installed.")
         print("Retriever setup failed.")
    except Exception as e:
        # Catch other potential errors during setup
        print(f"An error occurred during retriever setup: {e}")
        print("Retriever setup failed.")

else:
    print("\nSkipping retriever creation: Vector store not available from previous step.")

# Final check
if retriever:
    print("\nRetriever is ready.")
else:
    print("\nRetriever was not successfully created.")


--- Creating Retriever with Re-ranking ---
Base retriever created (fetches top 20).
Initializing CrossEncoder model...
CrossEncoderReranker component created (will return top 3).
Contextual Compression Retriever setup complete.

Retriever is ready.


In [52]:
# Define RAG Chain system
rag_chain = None
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, api_key=openai.api_key)

if retriever and llm:
    print("\n--- Defining RAG Chain ---")
    # Define prompt template
    template = """
You are an expert policy assistant. Given the following documents, summarize and answer user queries with factual precision.

Your job is to answer the question strictly based on the provided context.
Do not use any prior knowledge or external sources. If the answer is not in the context, reply:
"The answer is not found in the provided documents."

Instructions:
- Only answer using facts present in the context.
- Summarize the relevant information clearly and accurately.
- When possible, cite the document name and page number like this: [Source: document_name, Page: page_number].

Context:
{context}

Question:
{question}

Answer:
"""
    prompt = ChatPromptTemplate.from_template(template)

    # Function to format retrieved documents
    def format_docs(docs):
        return "\n\n".join(f"Source: {doc.metadata.get('source', 'Unknown')}, Page: {doc.metadata.get('page', 'N/A')}\nContent: {doc.page_content}" for doc in docs)

    # Define the chain using LangChain Expression Language (LCEL)
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    print("RAG chain defined successfully.")
else:
    print("\nCannot define RAG chain: Retriever or LLM not available.")

  llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, api_key=openai.api_key)



--- Defining RAG Chain ---
RAG chain defined successfully.


In [54]:
import textwrap

# Run Query as Interactive Chat
if rag_chain:
    print("\n--- Ready to Query ---\n")

    while True:
        try:
            user_query = input("Enter your question (or type 'quit' to exit): ").strip()

            if user_query.lower() == 'quit':
                print("\nExiting the query session. Goodbye!")
                break
            if not user_query:
                print("\nPlease enter a valid query.")
                continue

            print("\nGenerating response...\n")
            final_response = rag_chain.invoke(user_query)

            print("-"*50)
            print("                  Final Answer")
            print("-"*50)

            # Wrap text to fit the output width (80 characters in this case)
            wrapper = textwrap.TextWrapper(width=80, break_long_words=False)

            # Prepare summary
            summary = []
            paragraphs = final_response.split('[Source:')

            # Process paragraphs and add source references inline
            for i, paragraph in enumerate(paragraphs):
                paragraph = paragraph.strip()

                if '[Source:' in paragraph:
                    # Add the source reference inline with the paragraph
                    source_reference = '[Source:' + '[Source:'.join(final_response.split('[Source:')[i + 1:]).split(']')[0] + ']'
                    paragraph += " " + source_reference
                wrapped_paragraph = wrapper.fill(paragraph)
                summary.append(wrapped_paragraph)

            # Print summary (with sources inline)
            print("\n** Summary **\n")
            print("\n".join(summary))

        except Exception as e:
            print(f"\nAn error occurred: {e}")
        except KeyboardInterrupt:
            print("\nExiting the query session. Goodbye!")
            break
else:
    print("\nCannot run queries: RAG chain was not set up successfully.")





--- Ready to Query ---

Enter your question (or type 'quit' to exit): What is the maximum benefit or payout limit?

Generating response...

--------------------------------------------------
                  Final Answer
--------------------------------------------------

** Summary **

The maximum benefit or payout limit varies depending on the specific
circumstances of the termination or loss. For example, in the case of
termination as described in b. (4) above, the maximum amount will be the lesser
of $10,000 or the Dependent Life Insurance benefit in force for the Dependent on
the date of termination, less the amount for which the Dependent becomes
eligible under any group policy within 31 days. In other cases of termination,
the maximum amount will be the Dependent Life Insurance benefit in force for the
Dependent on the date of termination, less any individual policy amount
purchased earlier under the policy. The specific maximum benefit or payout limit
is determined by the ter

Insights:

Used langchain with LLM model=gpt-3.5-turbo", temperature=0, embeddings="text-embedding-ada-002", HuggingFaceCrossEncoder=ms-marco-MiniLM-L-6-v2
