##  Business Objective

The goal of this project is to build an **end-to-end AI-powered document search and question-answering system** using vector embeddings, semantic search, and large language models (LLMs).  
The system will process a **PDF policy document** and deliver accurate, contextually relevant answers to user queries.

---

### üîπ 1. Embedding Layer
- **Objective:** Preprocess, clean, and divide the PDF document into meaningful chunks for embedding.  
- The **chunking strategy** greatly influences retrieval quality ‚Äî experiment with multiple strategies and compare their performance.  
- Use either the **OpenAI embedding model** or models from the **SentenceTransformers** library (Hugging Face) for generating embeddings.

---

### üîπ 2. Search Layer
- **Objective:** Design and test your search pipeline with **at least three self-created queries** based on the document‚Äôs content.  
- Embed each query and perform similarity search against your **ChromaDB vector database**.  
- Implement a **cache mechanism** for efficient query handling.  
- Add a **re-ranking block** using a **cross-encoder model** from Hugging Face to improve result relevance.

---

### üîπ 3. Generation Layer
- **Objective:** Design an exhaustive and well-structured **prompt** for the LLM to generate accurate and complete responses.  
- Ensure all relevant context is passed correctly into the prompt.  
- Optionally, include **few-shot examples** to enhance generation quality and consistency.


## üîç Flowchart Overview: Document-Based LLM Search System

The flowchart represents a **three-layer architecture** that drives a document-based search system built on a **Large Language Model (LLM)**.  
The system is modular, emphasizing experimentation at each stage to enhance both relevance and response quality.

---

### 1. üß† Embedding Layer  
In this foundational stage, the **documents and user queries** are transformed into **vector embeddings** using pre-trained models such as **OpenAI** or **Hugging Face‚Äôs SentenceTransformers**.  
These embeddings capture semantic relationships, enabling accurate similarity matching.  
Experimentation may include exploring various **embedding models**, **chunking techniques**, or **metadata enrichment** to optimize retrieval quality.

---

### 2. üîé Search Layer  
Once the embeddings are generated, the system performs a **similarity search** within a vector database (e.g., **FAISS**, **Chroma**) to find the most relevant document chunks.  
This ensures that only **contextually meaningful content** is passed to the LLM for response generation.  
You can experiment with **different distance metrics** (e.g., cosine similarity, dot product), **hybrid retrieval methods** (BM25 + vectors), and **metadata-based filters**.

---

### 3. üßæ Generation Layer  
In this final stage, the **retrieved document chunks** and **user query** are combined into a structured **prompt** for the LLM.  
The model then generates a **coherent natural language answer**.  
This layer can be refined using **prompt engineering**, **custom instructions**, or **retrieval-augmented generation (RAG)** techniques.


In [4]:
# Import all the required Libraries
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import chromadb
import openai
from openai import OpenAI
import google.generativeai as genai

In [5]:
import os
os.environ["ANONYMIZED_TELEMETRY"] = "False"
import warnings
warnings.filterwarnings("ignore", message="Failed to send telemetry event")

In [6]:
# --------------------------
# Gemini client setup
# --------------------------
openai_client = OpenAI(
    api_key=open('GEMINI_API_KEY.txt', 'r').read().strip(),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [7]:
genai.configure(api_key=open("GEMINI_API_KEY.txt").read().strip())

In [8]:
import os
os.chdir(".")
!ls

GEMINI_API_KEY.txt
Mr_HelpMate_AI_old.ipynb
Mr_HelpMate_AI.ipynb
Principal-Sample-Life-Insurance-Policy.pdf


##  <font> Read, Process, and Chunk the PDF Files </font>

We'll use pdfplumber for PDF extraction and processing, which offers several advantages over simpler PDF libraries. pdfplumber provides robust capabilities for extracting structured content from PDFs, including:

Text extraction with positional data
Table detection and extraction
Form field identification
Image extraction capabilities
Visual debugging tools for development

This library allows us to handle complex document structures by preserving the spatial relationships between text elements, which is crucial for maintaining document context during chunking. It also provides methods to extract text while preserving formatting elements like paragraphs, headers, and lists.
For optimal retrieval performance, we'll implement a strategic chunking approach that balances chunk size with semantic coherence, ensuring that related content stays together while creating chunks that are appropriately sized for our vector database.

In [9]:
pdf_path = "."

In [10]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [11]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any([check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

*Now that we have defined the function for extracting the text and tables from a PDF, let's iterate and call this function for all the PDFs in our drive and store them in a list.*

In [12]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing Principal-Sample-Life-Insurance-Policy.pdf
Finished processing Principal-Sample-Life-Insurance-Policy.pdf
All PDFs have been processed.


In [13]:
# Concatenate all the DFs in the list 'data' together
insurance_pdfs_data = pd.concat(data, ignore_index=True)

In [14]:
print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.sample(2)

Shape of the data is ::  (64, 3) 



Unnamed: 0,Page No.,Page_Text,Document Name
37,Page 38,Section D - Continuation Article 1 - Member Li...,Principal-Sample-Life-Insurance-Policy.pdf
47,Page 48,c . If a beneficiary dies at the same time or ...,Principal-Sample-Life-Insurance-Policy.pdf


In [15]:
# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop
insurance_pdfs_data['Text_Length'] = insurance_pdfs_data['Page_Text'].apply(lambda x: len(x.split(' ')))

insurance_pdfs_data.sample(2)

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
31,Page 32,(1) marriage or establishment of a Civil Union...,Principal-Sample-Life-Insurance-Policy.pdf,429
11,Page 12,An institution that is licensed as a Hospital ...,Principal-Sample-Life-Insurance-Policy.pdf,352


In [16]:
print("-"*50)
print("Maximum Text Length is :: ", max(insurance_pdfs_data['Text_Length']))
print("-"*50)
print("Minimum Text Length is :: ", min(insurance_pdfs_data['Text_Length']))
print("-"*50)

--------------------------------------------------
Maximum Text Length is ::  462
--------------------------------------------------
Minimum Text Length is ::  5
--------------------------------------------------


To skip pages that are essentially blank‚Äîeither containing fewer than 10 words or consisting solely of a header or footer‚Äîwe use the following code to filter them out during processing.

In [17]:
# Retain only the rows with a text length of at least 10
insurance_pdfs_data = insurance_pdfs_data.loc[insurance_pdfs_data['Text_Length'] >= 10]

print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.head()

Shape of the data is ::  (60, 4) 



Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153
6,Page 7,Section A ‚Äì Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176


In [18]:
# Store the metadata for each page in a separate column
insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.head()

Shape of the data is ::  (60, 5) 



Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30,{'Policy_Name': 'Principal-Sample-Life-Insuran...
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230,{'Policy_Name': 'Principal-Sample-Life-Insuran...
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110,{'Policy_Name': 'Principal-Sample-Life-Insuran...
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153,{'Policy_Name': 'Principal-Sample-Life-Insuran...
6,Page 7,Section A ‚Äì Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176,{'Policy_Name': 'Principal-Sample-Life-Insuran...


This wraps up the chunking process. As observed, most pages in the insurance documents contain a few hundred words, rarely exceeding 1000. Therefore, further chunking is unnecessary‚Äîwe can perform embeddings at the page level. This approach is effective for two key reasons:

1. Insurance documents are typically well-structured, meaning the content within a page is usually coherent and contextually related.
2. Using larger chunks ensures we retain more contextual information, which is beneficial for the LLM during the generation stage.


## Approaches to Document Chunking

Below are several **chunking strategies** commonly used for processing documents before creating embeddings.  
Each approach is suited to different content structures and downstream tasks such as **information retrieval** or **LLM-based generation**.

---

### 1. Fixed-Length Chunking
- **Description:** Divide the text into chunks of a fixed size (e.g., 500 tokens each).  
- **Advantages:** Simple to implement and ensures consistent input size for models.  
- **Limitations:** May cut sentences mid-way or lose contextual coherence across chunks.

---

###  2. Sliding Window Chunking
- **Description:** Create overlapping chunks where each chunk starts slightly before the previous one ends (e.g., 500 tokens with a 100-token overlap).  
- **Advantages:** Maintains context between chunks and reduces information loss at boundaries.  
- **Limitations:** Increases data redundancy and storage requirements.

---

###  3. Semantic Chunking
- **Description:** Use NLP-based techniques such as **sentence segmentation**, **topic modeling**, or **semantic similarity** to divide text meaningfully.  
- **Advantages:** Retains semantic consistency, making it ideal for LLMs.  
- **Limitations:** More complex to implement and can produce variable chunk lengths.

---

###  4. Section-Based (Header/Footer) Chunking
- **Description:** Split the document according to its logical structure ‚Äî such as **headings**, **paragraphs**, or **sections**.  
- **Advantages:** Works well for structured documents like policies, reports, or legal texts.  
- **Limitations:** Depends heavily on consistent formatting or identifiable section markers.

---

### 5. Page-Level Chunking
- **Description:** Treat each page (e.g., from a PDF) as a separate chunk.  
- **Advantages:** Simple and maintains alignment with the original document layout.  
- **Limitations:** Page content length can vary significantly, impacting consistency and relevance.

---

###  6. Hybrid Chunking
- **Description:** Combine multiple strategies (e.g., **semantic + sliding window**) to balance structure and context.  
- **Advantages:** Offers flexibility and can be fine-tuned for diverse document types.  
- **Limitations:** Requires careful tuning and additional processing logic.

---

###  Key Insight
The choice of **chunking strategy** significantly influences retrieval accuracy and response quality.  
Experiment with different methods to find the balance between **semantic coherence**, **context preservation**, and **system efficiency** for your specific dataset.


## <font> Generate and Store Embeddings using OpenAI and ChromaDB </font>

This stage transforms our text chunks into numerical vector representations using OpenAI's `text-embedding-ada-002` model. Each document fragment is encoded into a high-dimensional embedding that captures its semantic essence. 

We then persist these vectors in ChromaDB, a specialized vector database optimized for similarity search operations. This combination provides an efficient architecture for storing and retrieving documents based on meaning rather than keywords. The ChromaDB collection creates a searchable semantic index of our content, enabling natural language queries to find contextually relevant information even when exact terminology differs between query and documents.

In [19]:
# Import the OpenAI Embedding Function into chroma
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction

## ChromaDB `PersistentClient`

`chromadb.PersistentClient` manages a **persistent vector database**, saving collections and embeddings to disk so they **persist across script or notebook restarts**.  
Unlike the in-memory client, data is **not lost** when the program ends (uses DuckDB + Parquet internally).

In [20]:
chroma_db_client = chromadb.PersistentClient(path="../chroma_db")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


In [21]:
import chromadb.utils.embedding_functions as embedding_functions
gemini_embedding_model = "models/text-embedding-004"

In [22]:
# Initialize Gemini embedding function (text-embedding-004 is used by default)
google_ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key=open("GEMINI_API_KEY.txt").read().strip(),
    model_name=gemini_embedding_model
)


In [23]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma

documents_list = insurance_pdfs_data["Page_Text"].tolist()
metadata_list = insurance_pdfs_data['Metadata'].tolist()

In [24]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents

insurance_collection = chroma_db_client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=google_ef)

Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [31]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embe

In [32]:
# Let's take a look at the first few entries in the collection

insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': [[0.036509085446596146,
   0.03283742815256119,
   -0.006328749004751444,
   -0.007022167555987835,
   -0.008437707088887691,
   0.010922549292445183,
   0.023095931857824326,
   0.07339763641357422,
   0.039684779942035675,
   0.0435480959713459,
   -4.3180280044907704e-05,
   0.042253874242305756,
   0.022973809391260147,
   0.009878566488623619,
   -0.030323462560772896,
   -0.05536549165844917,
   -0.04178239405155182,
   -0.017812509089708328,
   -0.04312549903988838,
   -0.01883009262382984,
   -0.012936219573020935,
   0.004985112231224775,
   0.03909388557076454,
   -0.0409562923014164,
   0.029109658673405647,
   -0.029867617413401604,
   -0.014902184717357159,
   -0.07830485701560974,
   -0.019402742385864258,
   -0.11540698260068893,
   0.03029187209904194,
   0.051797591149806976,
   -0.00226078019477427,
   -0.02098015882074833,
   0.017559051513671875,
   0.030634663999080658,
   -0.02596590295433998,
   -0.0093044713139534,
   0.03

In [33]:
cache_collection = chroma_db_client.get_or_create_collection(name='Insurance_Cache', embedding_function=google_ef)

Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [34]:
cache_collection.peek()

{'ids': ['Are there age-based reductions in benefit amounts?',
  'Can insurance coverage continue during an approved leave?',
  'What are the exclusions under Accidental Death?'],
 'embeddings': [[0.03034435398876667,
   0.07213963568210602,
   -0.04397154599428177,
   -0.017953241243958473,
   -0.004093424417078495,
   0.02899101749062538,
   0.03500279784202576,
   0.031027885153889656,
   0.0031284568831324577,
   -0.01402444951236248,
   0.010446053929626942,
   0.06038113310933113,
   0.04303572699427605,
   0.021845785900950432,
   -0.03395450487732887,
   -0.05200449377298355,
   0.014382394962012768,
   -0.056318800896406174,
   -0.09458132088184357,
   -0.016825266182422638,
   -0.005482219159603119,
   0.03204464912414551,
   0.002947814529761672,
   -0.021913575008511543,
   -0.0015045639593154192,
   0.0039012255147099495,
   -0.006755079608410597,
   -0.012096354737877846,
   0.0053550624288618565,
   -0.024580786004662514,
   0.030766168609261513,
   0.00867383647710085,


#  Stage 2: Semantic Search with Cache Layer

This section explains how semantic search leverages a cache layer to improve efficiency. When a query is made, the system first checks the cache for top‚Äëk semantically similar documents or chunks. If matches are found, results are returned instantly; otherwise, the query is processed in the main vector database.

New queries and their results are then stored in the cache, enabling faster retrieval for repeated or similar queries. This approach reduces latency, avoids redundant computation, and ensures quick access to relevant documents through efficient indexing and vector search.

### **QnA - Query 1**

In [35]:
# ------------------------------------------------------------------------------
# Question 1 : Can insurance coverage continue during an approved leave?
# ------------------------------------------------------------------------------

# Read the user query
# query = input()
query = "Can insurance coverage continue during an approved leave?"

In [36]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

cache_results

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


{'ids': [['Can insurance coverage continue during an approved leave?']],
 'distances': [[3.270449460117601e-16]],
 'metadatas': [[{'distances0': '0.5286916519790423',
    'distances1': '0.5456828110817101',
    'distances2': '0.5474555546245193',
    'distances3': '0.5666263667354581',
    'distances4': '0.5701872896262192',
    'distances5': '0.5760730302139674',
    'distances6': '0.6105253299245013',
    'distances7': '0.6380564140824816',
    'distances8': '0.650510813122267',
    'distances9': '0.651737639962153',
    'documents0': 'Section A ‚Äì Eligibility Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life Insurance Article 3 Section B - Effective Dates Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life Insurance Article 3 Section C - Individual Terminations Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life

In [37]:
results = insurance_collection.query(
    query_texts=query,
    n_results=10
)
results.items()

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


dict_items([('ids', [['4', '37', '34', '38', '24', '35', '22', '30', '21', '32']]), ('distances', [[0.5286916534232825, 0.5456828100194504, 0.5474555555599144, 0.5666263655014665, 0.57018728963293, 0.5760730313999377, 0.6105253289535751, 0.6380564132013675, 0.6505108128113918, 0.6517376418434461]]), ('metadatas', [[{'Page_No.': 'Page 7', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 40', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 37', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 41', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 27', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 38', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 25', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 33', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 24', 'Pol

In [38]:
# Implementing Cache in Semantic Search

# Set a threshold for cache searchA
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
    # Query the collection against the user query and return the top 10 results
    results = insurance_collection.query(
    query_texts=query,
    n_results=10
    )
    
    # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
    # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
    Keys = []
    Values = []

    for key, val in results.items():
        # print(f"Key: {key}, Type of val: {type(val)}, Length of val: {len(val) if val else 0}")
        if val is None:
        	continue
        if key != 'embeddings':
        	# Ensure val[0] exists and is a list-like object
        	if len(val) > 0 and isinstance(val[0], (list, tuple)):
        		for i in range(min(10, len(val[0]))):  # Avoid IndexError
        			Keys.append(str(key) + str(i))
        			Values.append(str(val[0][i]))
        	else:
        		# Optional: log or handle unexpected structure
        		print(f"Skipping key: {key}, unexpected structure: {type(val[0]) if val else 'empty'}")

    cache_collection.add(
      documents= [query],
      ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
      metadatas = dict(zip(Keys, Values))
    )

    print("Not found in cache. Found in main collection.")
    
    result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
    results_df = pd.DataFrame.from_dict(result_dict)
    results_df

elif cache_results['distances'][0][0] <= threshold:
    cache_result_dict = cache_results['metadatas'][0][0]

    # Loop through each inner list and then through the dictionary
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Create a DataFrame
    results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
    })


Found in cache!


In [39]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.5286916519790423,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip..."
1,37,Section E - Reinstatement Article 1 - Reinstat...,0.5456828110817101,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi..."
2,34,b. a business assignment; or c. full-time stud...,0.5474555546245193,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi..."
3,38,I f coverage for a Member or Dependent termina...,0.5666263667354581,"{'Page_No.': 'Page 41', 'Policy_Name': 'Princi..."
4,24,I f a Member's Dependent is employed and is co...,0.5701872896262192,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
5,35,Section D - Continuation Article 1 - Member Li...,0.5760730302139674,"{'Page_No.': 'Page 38', 'Policy_Name': 'Princi..."
6,22,Section D - Policy Renewal Article 1 - Renewal...,0.6105253299245013,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi..."
7,30,a . In no event will Dependent Life Insurance ...,0.6380564140824816,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi..."
8,21,T he Principal may terminate the Policyholder'...,0.650510813122267,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi..."
9,32,Section C - Individual Terminations Article 1 ...,0.651737639962153,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi..."


### **QnA - Query 2**

In [40]:
# ------------------------------------------------------------------------------
# Question 2 : Are there age-based reductions in benefit amounts?
# ------------------------------------------------------------------------------

# Read the user query
# query2 = input()
query2 = "Are there age-based reductions in benefit amounts?"

In [41]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results2 = cache_collection.query(
    query_texts=query2,
    n_results=1
)

cache_results2

{'ids': [['Are there age-based reductions in benefit amounts?']],
 'distances': [[4.751405453836574e-16]],
 'metadatas': [[{'distances0': '0.633002353766436',
    'distances1': '0.6363616274217812',
    'distances2': '0.6497752943669397',
    'distances3': '0.6673524554760106',
    'distances4': '0.67687468993528',
    'distances5': '0.6798112570864675',
    'distances6': '0.6857670322430902',
    'distances7': '0.687318287187824',
    'distances8': '0.6897947000080202',
    'distances9': '0.697204637172362',
    'documents0': 'Section D - Policy Renewal Article 1 - Renewal Insurance under this Group Policy runs annually to the Policy Anniversary, unless sooner terminated. While this Group Policy is in force, and subject to the provisions in PART II, Section C, the Policyholder may renew at the applicable premium rates in effect on the Policy Anniversary. This policy has been updated effective January 1, 2014 PART II - POLICY ADMINISTRATION GC 6005 A Section D - Policy Renewal, Page 1'

In [42]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

# Initialize lists and dataframe
ids2 = []
documents2 = []
distances2 = []
metadatas2 = []
results_df2 = pd.DataFrame()

# Check if cache is empty or not useful
if not cache_results2['distances'][0] or cache_results2['distances'][0][0] > threshold:
    # Query the main collection
    results = insurance_collection.query(
        query_texts=query2,
        n_results=10
    )

    # Prepare metadata for caching
    Keys2 = []
    Values2 = []

    for key, val in results.items():
        if val is None or key == 'embeddings':
            continue
        if len(val) > 0 and isinstance(val[0], (list, tuple)):
            for i in range(min(10, len(val[0]))):
                Keys2.append(f"{key}{i}")
                Values2.append(str(val[0][i]))
        else:
            print(f"Skipping key '{key}' due to unexpected structure or empty list.")

    # Add to cache
    cache_collection.add(
        documents=[query2],
        ids=[query2],
        metadatas=dict(zip(Keys2, Values2))
    )

    print("Not found in cache. Found in main collection.")

    # Prepare result DataFrame
    result_dict2 = {
        'Metadatas': results['metadatas'][0],
        'Documents': results['documents'][0],
        'Distances': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df2 = pd.DataFrame.from_dict(result_dict2)

# Use cached results
elif cache_results2['distances'][0][0] <= threshold:
    cache_result_dict2 = cache_results2['metadatas'][0][0]

    for key, value in cache_result_dict2.items():
        if 'ids' in key:
            ids2.append(value)
        elif 'documents' in key:
            documents2.append(value)
        elif 'distances' in key:
            distances2.append(value)
        elif 'metadatas' in key:
            metadatas2.append(value)

    print("Found in cache!")

    results_df2 = pd.DataFrame({
        'IDs': ids2,
        'Documents': documents2,
        'Distances': distances2,
        'Metadatas': metadatas2
    })


Found in cache!


In [43]:
results_df2

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,22,Section D - Policy Renewal Article 1 - Renewal...,0.633002353766436,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi..."
1,43,PART IV - BENEFITS Section A - Member Life Ins...,0.6363616274217812,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi..."
2,24,I f a Member's Dependent is employed and is co...,0.6497752943669397,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
3,3,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.6673524554760106,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip..."
4,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.67687468993528,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip..."
5,42,(1) If termination is as described in b. (1) a...,0.6798112570864675,"{'Page_No.': 'Page 45', 'Policy_Name': 'Princi..."
6,5,Section A - Member Life Insurance Schedule of ...,0.6857670322430902,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip..."
7,54,% of Scheduled Covered Loss Benefit Loss of Sp...,0.687318287187824,"{'Page_No.': 'Page 57', 'Policy_Name': 'Princi..."
8,49,(1) only one Accelerated Benefit payment will ...,0.6897947000080202,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi..."
9,50,Section B - Member Accidental Death and Dismem...,0.697204637172362,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."


### **QnA - Query 3**

In [44]:
# ------------------------------------------------------------------------------
# Question 3 : What are the exclusions under Accidental Death?
# ------------------------------------------------------------------------------

# Read the user query
# query3 = input()
query3 = "What are the exclusions under Accidental Death?"

In [45]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

cache_results3

{'ids': [['What are the exclusions under Accidental Death?']],
 'distances': [[3.933661870335439e-16]],
 'metadatas': [[{'distances0': '0.6021791318892578',
    'distances1': '0.6048759108740003',
    'distances2': '0.6208896966127476',
    'distances3': '0.623814030439745',
    'distances4': '0.638855618724952',
    'distances5': '0.6549313261740821',
    'distances6': '0.6682294986486059',
    'distances7': '0.6715736415740987',
    'distances8': '0.6755426669653122',
    'distances9': '0.6819361401091601',
    'documents0': 'Section A - Member Life Insurance Schedule of Insurance Article 1 Death Benefits Payable Article 2 Beneficiary Article 3 Facility of Payment Article 4 Settlement of Proceeds Article 5 Member Life Insurance - Coverage During Disability Article 6 Accelerated Benefits Article 7 Section B - Member Accidental Death and Dismemberment Insurance Schedule of Insurance Article 1 Benefit Qualification Article 2 Benefits Payable Article 3 Seat Belt Benefit Article 4 Loss of

In [46]:
def semantic_cache_search(query, cache_results, collection, cache_collection, threshold=0.2, top_n=10):
    """
    Performs a semantic search with caching mechanism.

    Parameters:
        query (str): The input query string.
        cache_results (dict): The results from the cache collection.
        collection (object): The main ChromaDB collection to query if cache misses.
        cache_collection (object): The cache ChromaDB collection to store/retrieve results.
        threshold (float): The similarity threshold to decide cache hit/miss.
        top_n (int): Number of top results to retrieve.

    Returns:
        pd.DataFrame: DataFrame containing search results.
    """
    ids, documents, distances, metadatas = [], [], [], []
    results_df = pd.DataFrame()

    # Check cache miss or low relevance
    if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
        results = collection.query(query_texts=query, n_results=top_n)

        keys, values = [], []

        for key, val in results.items():
            if val is None or key == 'embeddings':
                continue

            if isinstance(val, list) and len(val) > 0 and isinstance(val[0], (list, tuple)):
                for i in range(min(top_n, len(val[0]))):
                    keys.append(f"{key}{i}")
                    values.append(str(val[0][i]))
            else:
                print(f"Skipping key '{key}' due to unexpected structure.")

        cache_collection.add(
            documents=[query],
            ids=[query],
            metadatas=dict(zip(keys, values))
        )

        print("Not found in cache. Found in main collection.")

        result_dict = {
            'Metadatas': results['metadatas'][0],
            'Documents': results['documents'][0],
            'Distances': results['distances'][0],
            'IDs': results['ids'][0]
        }
        results_df = pd.DataFrame.from_dict(result_dict)

    # Cache hit
    elif cache_results['distances'][0][0] <= threshold:
        cache_result_dict = cache_results['metadatas'][0][0]

        for key, value in cache_result_dict.items():
            if 'ids' in key:
                ids.append(value)
            elif 'documents' in key:
                documents.append(value)
            elif 'distances' in key:
                distances.append(value)
            elif 'metadatas' in key:
                metadatas.append(value)

        print("Found in cache!")

        results_df = pd.DataFrame({
            'IDs': ids,
            'Documents': documents,
            'Distances': distances,
            'Metadatas': metadatas
        })

    return results_df

results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Found in cache!


In [47]:
results_df3

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,5,Section A - Member Life Insurance Schedule of ...,0.6021791318892578,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip..."
1,51,"f . claim requirements listed in PART IV, Sect...",0.6048759108740003,"{'Page_No.': 'Page 54', 'Policy_Name': 'Princi..."
2,55,"a. willful self-injury or self-destruction, wh...",0.6208896966127476,"{'Page_No.': 'Page 58', 'Policy_Name': 'Princi..."
3,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.623814030439745,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip..."
4,32,Section C - Individual Terminations Article 1 ...,0.638855618724952,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi..."
5,24,I f a Member's Dependent is employed and is co...,0.6549313261740821,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
6,54,% of Scheduled Covered Loss Benefit Loss of Sp...,0.6682294986486059,"{'Page_No.': 'Page 57', 'Policy_Name': 'Princi..."
7,23,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.6715736415740987,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi..."
8,50,Section B - Member Accidental Death and Dismem...,0.6755426669653122,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."
9,52,Exposure Exposure to the elements will be pres...,0.6819361401091601,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi..."


### What is a Semantic Cache?

A **semantic cache** stores the **meaning** (semantic representation) of a query or request ‚Äî not just the raw data ‚Äî along with the corresponding responses.  

This caching mechanism reduces the number of database queries by **recalling previously processed queries and their results**.

---

### How It Works

1. **New query processing:**
   - A **vector representation** of the query is generated.
   - The system first **searches this vector in the cache**.

2. **If the query is found** in the cache:
   - The system **skips the semantic search layer**, which is often a performance bottleneck.
   - The result is **retrieved instantly** from the cache.

3. **If the query is not found**:
   - The system queries the **main collection**.
   - It retrieves the **top *k* closest documents or chunks**.
   - These results are returned to the user and **stored in the cache** for future use.

---

### Benefits

- **Faster response times**  
- **Reduced load on the main database**  
- **Customizable and monitorable** for optimal performance  
- **Improved user experience** through quicker result retrieval  

By remembering previous queries and their results, a semantic cache can significantly enhance the efficiency and responsiveness of your application.

---

### **QnA - Query 3.1 - Checking if found in Cache**

In [48]:
query3 = "What exclusions apply to Accidental Death coverage?"

# Query the collection against the user query and return the top 20 results
cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

distances = cache_results3['distances'][0][0]
print("Threshold Distance :: ", distances)

results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Threshold Distance ::  0.08890453867705937
Found in cache!


### **QnA - Query 3.2 - Checking if found in Cache for Another Similar Question**

In [49]:
query3 = "What are the policy exclusions under Accidental Death insurance?"

# Query the collection against the user query and return the top 20 results
cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

distances = cache_results3['distances'][0][0]
print("Threshold Distance :: ", distances)

results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Threshold Distance ::  0.07049857063945829
Found in cache!


In [50]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.5286916519790423,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip..."
1,37,Section E - Reinstatement Article 1 - Reinstat...,0.5456828110817101,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi..."
2,34,b. a business assignment; or c. full-time stud...,0.5474555546245193,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi..."
3,38,I f coverage for a Member or Dependent termina...,0.5666263667354581,"{'Page_No.': 'Page 41', 'Policy_Name': 'Princi..."
4,24,I f a Member's Dependent is employed and is co...,0.5701872896262192,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
5,35,Section D - Continuation Article 1 - Member Li...,0.5760730302139674,"{'Page_No.': 'Page 38', 'Policy_Name': 'Princi..."
6,22,Section D - Policy Renewal Article 1 - Renewal...,0.6105253299245013,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi..."
7,30,a . In no event will Dependent Life Insurance ...,0.6380564140824816,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi..."
8,21,T he Principal may terminate the Policyholder'...,0.650510813122267,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi..."
9,32,Section C - Individual Terminations Article 1 ...,0.651737639962153,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi..."


In [54]:
# Import the CrossEncoder library from sentence_transformers
from sentence_transformers import CrossEncoder, util

# Initialise the cross encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

In [55]:
# Test the cross encoder model

scores = cross_encoder.predict([['Does the insurance cover diabetic patients?', 'The insurance policy covers some pre-existing conditions including diabetes, heart diseases, etc. The policy does not howev'],
                                ['Does the insurance cover diabetic patients?', 'The premium rates for various age groups are given as follows. Age group (<18 years): Premium rate']])

In [56]:
scores

array([  4.4608426, -11.197129 ], dtype=float32)

In [57]:
results_df2

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,22,Section D - Policy Renewal Article 1 - Renewal...,0.633002353766436,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi..."
1,43,PART IV - BENEFITS Section A - Member Life Ins...,0.6363616274217812,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi..."
2,24,I f a Member's Dependent is employed and is co...,0.6497752943669397,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
3,3,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.6673524554760106,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip..."
4,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.67687468993528,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip..."
5,42,(1) If termination is as described in b. (1) a...,0.6798112570864675,"{'Page_No.': 'Page 45', 'Policy_Name': 'Princi..."
6,5,Section A - Member Life Insurance Schedule of ...,0.6857670322430902,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip..."
7,54,% of Scheduled Covered Loss Benefit Loss of Sp...,0.687318287187824,"{'Page_No.': 'Page 57', 'Policy_Name': 'Princi..."
8,49,(1) only one Accelerated Benefit payment will ...,0.6897947000080202,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi..."
9,50,Section B - Member Accidental Death and Dismem...,0.697204637172362,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."


In [58]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)
cross_rerank_scores

array([-4.3488383,  4.2650614, -8.656078 , -1.8442657, -6.599899 ,
        4.4922757, -6.8963056, -1.559392 , -4.415044 , -6.4406357],
      dtype=float32)

In [61]:
# Store the rerank_scores in results_df
results_df['Reranked_scores'] = cross_rerank_scores

print("First Query :: ", query , "\n")

results_df

First Query ::  Can insurance coverage continue during an approved leave? 



Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.5286916519790423,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",-4.348838
1,37,Section E - Reinstatement Article 1 - Reinstat...,0.5456828110817101,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi...",4.265061
2,34,b. a business assignment; or c. full-time stud...,0.5474555546245193,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi...",-8.656078
3,38,I f coverage for a Member or Dependent termina...,0.5666263667354581,"{'Page_No.': 'Page 41', 'Policy_Name': 'Princi...",-1.844266
4,24,I f a Member's Dependent is employed and is co...,0.5701872896262192,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",-6.599899
5,35,Section D - Continuation Article 1 - Member Li...,0.5760730302139674,"{'Page_No.': 'Page 38', 'Policy_Name': 'Princi...",4.492276
6,22,Section D - Policy Renewal Article 1 - Renewal...,0.6105253299245013,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi...",-6.896306
7,30,a . In no event will Dependent Life Insurance ...,0.6380564140824816,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi...",-1.559392
8,21,T he Principal may terminate the Policyholder'...,0.650510813122267,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",-4.415044
9,32,Section C - Individual Terminations Article 1 ...,0.651737639962153,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",-6.440636


In [62]:
# Return the top 3 results from semantic search
top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]


Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.5286916519790423,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",-4.348838
1,37,Section E - Reinstatement Article 1 - Reinstat...,0.5456828110817101,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi...",4.265061
2,34,b. a business assignment; or c. full-time stud...,0.5474555546245193,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi...",-8.656078


In [63]:
# Return the top 3 results after reranking
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
5,35,Section D - Continuation Article 1 - Member Li...,0.5760730302139674,"{'Page_No.': 'Page 38', 'Policy_Name': 'Princi...",4.492276
1,37,Section E - Reinstatement Article 1 - Reinstat...,0.5456828110817101,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi...",4.265061
7,30,a . In no event will Dependent Life Insurance ...,0.6380564140824816,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi...",-1.559392


In [64]:
top_3_RAG_query1 = top_3_rerank[["Documents", "Metadatas"]][:3]

top_3_RAG_query1

Unnamed: 0,Documents,Metadatas
5,Section D - Continuation Article 1 - Member Li...,"{'Page_No.': 'Page 38', 'Policy_Name': 'Princi..."
1,Section E - Reinstatement Article 1 - Reinstat...,"{'Page_No.': 'Page 40', 'Policy_Name': 'Princi..."
7,a . In no event will Dependent Life Insurance ...,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi..."


#### Re-rank with Cross Encoder for Query-2

In [None]:
results_df2

In [66]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs2 = [[query2, response] for response in results_df2['Documents']]
cross_rerank_scores2 = cross_encoder.predict(cross_inputs2)
cross_rerank_scores2

array([-11.292296 ,  -0.9980262, -11.227061 , -11.319546 , -11.191068 ,
        -7.997163 , -11.009435 ,  -7.6251597,  -3.6773458,  -3.8884573],
      dtype=float32)

In [67]:
# Store the rerank_scores in results_df
results_df2['Reranked_scores'] = cross_rerank_scores2

print("Second Query :: ", query2 , "\n")

results_df2

Second Query ::  Are there age-based reductions in benefit amounts? 



Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,22,Section D - Policy Renewal Article 1 - Renewal...,0.633002353766436,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi...",-11.292296
1,43,PART IV - BENEFITS Section A - Member Life Ins...,0.6363616274217812,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi...",-0.998026
2,24,I f a Member's Dependent is employed and is co...,0.6497752943669397,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",-11.227061
3,3,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.6673524554760106,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",-11.319546
4,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.67687468993528,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",-11.191068
5,42,(1) If termination is as described in b. (1) a...,0.6798112570864675,"{'Page_No.': 'Page 45', 'Policy_Name': 'Princi...",-7.997163
6,5,Section A - Member Life Insurance Schedule of ...,0.6857670322430902,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",-11.009435
7,54,% of Scheduled Covered Loss Benefit Loss of Sp...,0.687318287187824,"{'Page_No.': 'Page 57', 'Policy_Name': 'Princi...",-7.62516
8,49,(1) only one Accelerated Benefit payment will ...,0.6897947000080202,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi...",-3.677346
9,50,Section B - Member Accidental Death and Dismem...,0.697204637172362,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",-3.888457


In [68]:
# Return the top 3 results from semantic search
top_3_semantic2_query2 = results_df2.sort_values(by='Distances')
top_3_semantic2_query2[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,22,Section D - Policy Renewal Article 1 - Renewal...,0.633002353766436,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi...",-11.292296
1,43,PART IV - BENEFITS Section A - Member Life Ins...,0.6363616274217812,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi...",-0.998026
2,24,I f a Member's Dependent is employed and is co...,0.6497752943669397,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",-11.227061


In [69]:
# Return the top 3 results after reranking
top_3_rerank_query2 = results_df2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_query2[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
1,43,PART IV - BENEFITS Section A - Member Life Ins...,0.6363616274217812,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi...",-0.998026
8,49,(1) only one Accelerated Benefit payment will ...,0.6897947000080202,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi...",-3.677346
9,50,Section B - Member Accidental Death and Dismem...,0.697204637172362,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",-3.888457


In [70]:
top_3_RAG_query2 = top_3_rerank_query2[["Documents", "Metadatas"]][:3]

top_3_RAG_query2

Unnamed: 0,Documents,Metadatas
1,PART IV - BENEFITS Section A - Member Life Ins...,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi..."
8,(1) only one Accelerated Benefit payment will ...,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi..."
9,Section B - Member Accidental Death and Dismem...,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."


### Re-rank with Cross Encoder for Query-3

In [71]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs
cross_inputs3 = [[query3, response] for response in results_df3['Documents']]
cross_rerank_scores3 = cross_encoder.predict(cross_inputs3)
cross_rerank_scores3

array([-1.6799723 , -3.4060574 , -2.7560284 , -1.2106376 ,  0.40321475,
       -9.708836  , -3.5169992 , -0.20807043,  0.4042089 ,  0.03237244],
      dtype=float32)

In [72]:
# Store the rerank_scores in results_df
results_df3['Reranked_scores'] = cross_rerank_scores3

print("Third Query :: ", query3 , "\n")

results_df3

Third Query ::  What are the policy exclusions under Accidental Death insurance? 



Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,5,Section A - Member Life Insurance Schedule of ...,0.6021791318892578,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",-1.679972
1,51,"f . claim requirements listed in PART IV, Sect...",0.6048759108740003,"{'Page_No.': 'Page 54', 'Policy_Name': 'Princi...",-3.406057
2,55,"a. willful self-injury or self-destruction, wh...",0.6208896966127476,"{'Page_No.': 'Page 58', 'Policy_Name': 'Princi...",-2.756028
3,4,Section A ‚Äì Eligibility Member Life Insurance ...,0.623814030439745,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",-1.210638
4,32,Section C - Individual Terminations Article 1 ...,0.638855618724952,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",0.403215
5,24,I f a Member's Dependent is employed and is co...,0.6549313261740821,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",-9.708836
6,54,% of Scheduled Covered Loss Benefit Loss of Sp...,0.6682294986486059,"{'Page_No.': 'Page 57', 'Policy_Name': 'Princi...",-3.516999
7,23,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.6715736415740987,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",-0.20807
8,50,Section B - Member Accidental Death and Dismem...,0.6755426669653122,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",0.404209
9,52,Exposure Exposure to the elements will be pres...,0.6819361401091601,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi...",0.032372


In [74]:
# Return the top 3 results from semantic search
top_3_semantic_query3 = results_df3.sort_values(by='Distances')
top_3_semantic_query3[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,5,Section A - Member Life Insurance Schedule of ...,0.6021791318892578,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",-1.679972
1,51,"f . claim requirements listed in PART IV, Sect...",0.6048759108740003,"{'Page_No.': 'Page 54', 'Policy_Name': 'Princi...",-3.406057
2,55,"a. willful self-injury or self-destruction, wh...",0.6208896966127476,"{'Page_No.': 'Page 58', 'Policy_Name': 'Princi...",-2.756028


In [75]:
# Return the top 3 results after reranking
top_3_rerank_query3 = results_df3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_query3[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
8,50,Section B - Member Accidental Death and Dismem...,0.6755426669653122,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",0.404209
4,32,Section C - Individual Terminations Article 1 ...,0.638855618724952,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",0.403215
9,52,Exposure Exposure to the elements will be pres...,0.6819361401091601,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi...",0.032372


In [76]:
top_3_RAG_query3 = top_3_rerank_query3[["Documents", "Metadatas"]][:3]

top_3_RAG_query3

Unnamed: 0,Documents,Metadatas
8,Section B - Member Accidental Death and Dismem...,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."
4,Section C - Individual Terminations Article 1 ...,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi..."
9,Exposure Exposure to the elements will be pres...,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi..."



### Generation Layer

In this stage, the top search results and the user's query are sent to a Large Language Model (LLM), such as GPT-4o. The model uses a carefully designed prompt to generate a natural and accurate response.  

Instead of returning full pages or document chunks, the model delivers concise answers, often with citations. This layer benefits from **prompt engineering**, **custom instructions**, and techniques like **Retrieval-Augmented Generation (RAG)**.  

The goal is to provide **smart, direct, and context-aware responses**, improving clarity, speed, and overall user experience.


In [79]:
from openai import OpenAI
import pandas as pd



def generate_response(query, top_3_RAG: pd.DataFrame):
    """
    Generate a response using Gemini (OpenAI-compatible) Chat model
    based on the user query and top 3 retrieved insurance document chunks.
    """
    # Convert the DataFrame into a readable text block
    document_text = ""
    for idx, row in top_3_RAG.iterrows():
        document_text += (
            f"\n---\nDocument #{idx+1}\n"
            f"Policy Name & Page: {row['Metadatas']}\n"
            f"Content:\n{row['Documents']}\n"
        )

    # Create the user prompt
    prompt = f"""
You are an expert assistant in the insurance domain. You accurately answer user questions using provided excerpts 
from insurance policy documents. Always be helpful, clear, and provide relevant citations.

Customer question: "{query}"

Below are 3 potentially relevant insurance document excerpts. Each document contains the policy text and its metadata (policy name and page number). Use only information relevant to the query.

{document_text}

Instructions:
1. Review all 3 documents to find information that helps answer the query.
2. If useful information is inside a table (formatted as a list of lists), convert it into a readable table.
3. Answer the question clearly and concisely using relevant content only.
4. At the end of your response, list all cited policies and their page numbers in a "Citations" section.
5. If no documents are useful, say: ‚ÄúNone of the provided documents contain relevant details to answer your query.‚Äù
6. Do not include any technical, implementation, or system details ‚Äî only answer the query.

Respond only with the final answer and citations.
"""

    # Call Gemini chat model via OpenAI-compatible API
    response = openai_client.chat.completions.create(
        model="gemini-2.5-flash",
        messages=[
            {"role": "system", "content": "You are an expert insurance assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )

    # Return the model's content as a list of lines
    return response.choices[0].message.content.split('\n')


In [80]:
# Generate the response - For Query 1
response = generate_response(query, top_3_RAG_query1)
print("-"*80,"\n","-"*78)
print("Query 1: ",query)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 1:  Can insurance coverage continue during an approved leave?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

Yes, insurance coverage for a member may be continued during an approved leave of absence.

Specifically:
*   If active work ends due to an approved leave of absence, insurance may be continued until the earliest of:
    *   The date insurance would otherwise cease.
    *   The date the approved leave of absence ends.
    *   The date the member becomes eligible for any other group life coverage.
    *   One month after the date active work ends.
*   If a member ceases active work due to an approved leave of absence under the Family and Medical Leave Act (FMLA), the Policyholder has the option to co

In [81]:
# Generate the response - For Query 2
response2 = generate_response(query2, top_3_RAG_query2)
print("-"*80,"\n","-"*78)
print("Query 2: ",query2)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response2))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 2:  Are there age-based reductions in benefit amounts?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

Yes, there are age-based reductions in benefit amounts for both Member Life Insurance and Member Accidental Death and Dismemberment Insurance. The amount of a Member's insurance will be a percentage of the Scheduled Benefit (or approved amount, if applicable) based on their age, as follows:

| Age                       | % of Scheduled Benefit (or approved amount, whichever applies) |
| :------------------------ | :----------------------------------------------------------- |
| Age 70 but less than age 75 | 65%                                                          |
| Age 75 and over           | 45%    

In [82]:
# Generate the response - For Query 3
response3 = generate_response(query3, top_3_RAG_query3)
print("-"*80,"\n","-"*78)
print("Query 3: ",query3)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response3))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 3:  What are the policy exclusions under Accidental Death insurance?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

The provided documents indicate that benefit payments for Accidental Death and Dismemberment Insurance are subject to limitations listed in Section B, Article 9. However, the specific details of these limitations (exclusions) are not provided in the excerpts.

Citations:
* Principal-Sample-Life-Insurance-Policy, Page 53


# ‚úÖ Conclusion

**Mr.HelpMate AI** ü§ñ represents a transformative step forward in how policyholders interact with complex insurance documents üìÑ. By combining the power of semantic search üîç, retrieval-augmented generation (RAG) üìö‚ûïüß†, and large language models üó£Ô∏è, the system empowers users to get fast, accurate, and easy-to-understand answers to their specific questions ‚ùì‚Äîwithout the frustration of navigating through dense paperwork or enduring long customer service waits ‚è≥üìû.

This intelligent assistant not only enhances customer satisfaction üòä but also significantly reduces operational overhead for insurance providers üíºüìâ. Its modular architecture‚Äîspanning document processing, retrieval, and generative response layers‚Äîensures scalability, adaptability, and precision üéØ.

Moreover, the learnings and architecture of **Mr.HelpMate AI** pave the way for applications in other domains such as legal tech ‚öñÔ∏è, finance üí∞, and enterprise knowledge management üè¢. With a focus on factual accuracy, personalized interactions, and seamless user experience, this solution demonstrates the practical value of AI in solving real-world, document-heavy challenges üöÄ.

**Mr.HelpMate AI** isn't just a tool‚Äîit's a step toward smarter, more human-centric digital experiences üåê‚ú®.
