## Business Objective

Build an **AI-powered document search** using vector embeddings, semantic search, and LLMs. The system processes **PDF policy documents** and provides accurate, context-aware answers to user queries.

---

### 1. Embedding Layer
- **Goal:** Preprocess, clean, and split PDFs into meaningful chunks for embedding.  
- Test different **chunking strategies** to optimize retrieval quality.  
- Use **Gemini embeddings** or **SentenceTransformers** models (Hugging Face).

---

### 2. Search Layer
- **Goal:** Test the search pipeline with **at least three sample queries**.  
- Embed queries and perform similarity search in a **ChromaDB vector store**.  
- Implement a **cache** for faster repeated queries.  
- Include a **re-ranking step** with a **cross-encoder model** to improve relevance.

---

### 3. Generation Layer
- **Goal:** Create a clear, structured **prompt** for the LLM to generate accurate answers.  
- Include all relevant context.  
- Optionally, use **few-shot examples** to improve response quality and consistency.


## Flowchart Overview: Document-Based LLM Search System

The system follows a **three-layer architecture** for document-based search using a Large Language Model (LLM). Each stage is modular, allowing experimentation to improve relevance and response quality.

---

### 1. Embedding Layer
- Convert **documents and user queries** into **vector embeddings** using models like **OpenAI** or **SentenceTransformers**.  
- Embeddings capture semantic meaning for accurate similarity matching.  
- Experiment with **different models**, **chunking strategies**, and **metadata enrichment** to optimize retrieval.

---

### 2. Search Layer
- Perform **similarity search** in a vector database (e.g., **FAISS**, **Chroma**) to retrieve the most relevant document chunks.  
- Experiment with **distance metrics** (cosine, dot product), **hybrid retrieval methods**, and **metadata filters** to improve results.

---

### 3. Generation Layer
- Combine retrieved document chunks with the **user query** into a structured **prompt** for the LLM.  
- Generate **coherent, context-aware answers**.  
- Refine using **prompt engineering**, **custom instructions**, or **retrieval-augmented generation (RAG)**.



In [1]:
#!pip install pdfplumber tiktoken openai chromadb sentence_transformers  or install through venv

In [2]:
# Import all the required Libraries
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import chromadb
import openai
from openai import OpenAI
import google.generativeai as genai

An error occurred: module 'importlib.metadata' has no attribute 'packages_distributions'


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import os
os.environ["ANONYMIZED_TELEMETRY"] = "False"
import warnings
warnings.filterwarnings("ignore", message="Failed to send telemetry event")

In [4]:
# --------------------------
# Gemini client setup
# --------------------------
openai_client = OpenAI(
    api_key=open('GEMINI_API_KEY.txt', 'r').read().strip(),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [5]:
genai.configure(api_key=open("GEMINI_API_KEY.txt").read().strip())

In [6]:
import os
os.chdir(".")
!ls

GEMINI_API_KEY.txt
Mr_HelpMate_AI.ipynb
Principal-Sample-Life-Insurance-Policy.pdf


##  <font> Read, Process, and Chunk the PDF Files </font>

We'll use pdfplumber for PDF extraction and processing, which offers several advantages over simpler PDF libraries. pdfplumber provides robust capabilities for extracting structured content from PDFs, including:

Text extraction with positional data
Table detection and extraction
Form field identification
Image extraction capabilities
Visual debugging tools for development

This library allows us to handle complex document structures by preserving the spatial relationships between text elements, which is crucial for maintaining document context during chunking. It also provides methods to extract text while preserving formatting elements like paragraphs, headers, and lists.
For optimal retrieval performance, we'll implement a strategic chunking approach that balances chunk size with semantic coherence, ensuring that related content stays together while creating chunks that are appropriately sized for our vector database.

In [7]:
pdf_path = "."

In [8]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [9]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any([check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [10]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing Principal-Sample-Life-Insurance-Policy.pdf
Finished processing Principal-Sample-Life-Insurance-Policy.pdf
All PDFs have been processed.


In [11]:
# Concatenate all the DFs in the list 'data' together
insurance_pdfs_data = pd.concat(data, ignore_index=True)

In [12]:
print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.sample(2)

Shape of the data is ::  (64, 3) 



Unnamed: 0,Page No.,Page_Text,Document Name
44,Page 45,(1) If termination is as described in b. (1) a...,Principal-Sample-Life-Insurance-Policy.pdf
42,Page 43,Any individual policy issued will then be in f...,Principal-Sample-Life-Insurance-Policy.pdf


In [13]:
# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop
insurance_pdfs_data['Text_Length'] = insurance_pdfs_data['Page_Text'].apply(lambda x: len(x.split(' ')))

insurance_pdfs_data.sample(2)

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
53,Page 54,"f . claim requirements listed in PART IV, Sect...",Principal-Sample-Life-Insurance-Policy.pdf,368
42,Page 43,Any individual policy issued will then be in f...,Principal-Sample-Life-Insurance-Policy.pdf,392


In [14]:
print("-"*50)
print("Maximum Text Length is :: ", max(insurance_pdfs_data['Text_Length']))
print("-"*50)
print("Minimum Text Length is :: ", min(insurance_pdfs_data['Text_Length']))
print("-"*50)

--------------------------------------------------
Maximum Text Length is ::  462
--------------------------------------------------
Minimum Text Length is ::  5
--------------------------------------------------


To skip pages that are essentially blank—either containing fewer than 10 words or consisting solely of a header or footer—we use the following code to filter them out during processing.

In [15]:
# Retain only the rows with a text length of at least 10
insurance_pdfs_data = insurance_pdfs_data.loc[insurance_pdfs_data['Text_Length'] >= 10]

print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.head()

Shape of the data is ::  (60, 4) 



Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176


In [16]:
# Store the metadata for each page in a separate column
insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

print("Shape of the data is :: ", insurance_pdfs_data.shape, "\n")

insurance_pdfs_data.head()

Shape of the data is ::  (60, 5) 



Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30,{'Policy_Name': 'Principal-Sample-Life-Insuran...
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230,{'Policy_Name': 'Principal-Sample-Life-Insuran...
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110,{'Policy_Name': 'Principal-Sample-Life-Insuran...
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153,{'Policy_Name': 'Principal-Sample-Life-Insuran...
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176,{'Policy_Name': 'Principal-Sample-Life-Insuran...


### Chunking Summary

Most pages in the insurance documents contain only a few hundred words, rarely exceeding 1000. Hence, additional chunking is not required — embeddings can be performed directly at the **page level**.

This approach is effective for two main reasons:

1. **Structured Content:** Insurance documents are well-organized, and the content within a page is generally coherent and contextually consistent.  
2. **Preserved Context:** Using larger chunks retains more contextual information, which improves the LLM’s understan



### Approaches to Document Chunking

Below are several **chunking strategies** commonly used for processing documents before creating embeddings.  
Each approach is suited to different content structures and downstream tasks such as **information retrieval** or **LLM-based generation**.

---

#### 1. Fixed-Length Chunking
- **Description:** Divide the text into chunks of a fixed size (e.g., 500 tokens each).  
- **Advantages:** Simple to implement and ensures consistent input size for models.  
- **Limitations:** May cut sentences mid-way or lose contextual coherence across chunks.

---

####  2. Sliding Window Chunking
- **Description:** Create overlapping chunks where each chunk starts slightly before the previous one ends (e.g., 500 tokens with a 100-token overlap).  
- **Advantages:** Maintains context between chunks and reduces information loss at boundaries.  
- **Limitations:** Increases data redundancy and storage requirements.

---

####  3. Semantic Chunking
- **Description:** Use NLP-based techniques such as **sentence segmentation**, **topic modeling**, or **semantic similarity** to divide text meaningfully.  
- **Advantages:** Retains semantic consistency, making it ideal for LLMs.  
- **Limitations:** More complex to implement and can produce variable chunk lengths.

---

####  4. Section-Based (Header/Footer) Chunking
- **Description:** Split the document according to its logical structure — such as **headings**, **paragraphs**, or **sections**.  
- **Advantages:** Works well for structured documents like policies, reports, or legal texts.  
- **Limitations:** Depends heavily on consistent formatting or identifiable section markers.

---

#### 5. Page-Level Chunking
- **Description:** Treat each page (e.g., from a PDF) as a separate chunk.  
- **Advantages:** Simple and maintains alignment with the original document layout.  
- **Limitations:** Page content length can vary significantly, impacting consistency and relevance.

---

####  6. Hybrid Chunking
- **Description:** Combine multiple strategies (e.g., **semantic + sliding window**) to balance structure and context.  
- **Advantages:** Offers flexibility and can be fine-tuned for diverse document types.  
- **Limitations:** Requires careful tuning and additional processing logic.

---

####  Key Insight
The choice of **chunking strategy** significantly influences retrieval accuracy and response quality.  
Experiment with different methods to find the balance between **semantic coherence**, **context preservation**, and **system efficiency** for your specific dataset.


## <font> Generate and Store Embeddings using Gemini and ChromaDB </font>

In this stage, text chunks are converted into high-dimensional vector representations using **Gemini embeddings**, which capture their semantic meaning.  
These embeddings are stored in **ChromaDB**, a vector database optimized for similarity search.  
This setup enables efficient retrieval of contextually relevant information based on meaning rather than exact keywords, allowing natural language queries to match related document content effectively.

In [17]:
# Import the Gemini Embedding Function into chroma
from chromadb.utils.embedding_functions import GoogleGenerativeAiEmbeddingFunction

## ChromaDB `PersistentClient`

`chromadb.PersistentClient` manages a **persistent vector database**, saving collections and embeddings to disk so they **persist across script or notebook restarts**.  
Unlike the in-memory client, data is **not lost** when the program ends (uses DuckDB + Parquet internally).

In [18]:
chroma_db_client = chromadb.PersistentClient(path="../chroma_db")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


In [19]:
# List all existing collections
all_collections = chroma_db_client.list_collections()

# Loop through and delete each one
for collection in all_collections:
    chroma_db_client.delete_collection(name=collection.name)

print("All collections have been deleted from ChromaDB.")


All collections have been deleted from ChromaDB.


In [20]:
import chromadb.utils.embedding_functions as embedding_functions
gemini_embedding_model = "models/text-embedding-004"

In [21]:
# Initialize Gemini embedding function (text-embedding-004 is used by default)
google_ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key=open("GEMINI_API_KEY.txt").read().strip(),
    model_name=gemini_embedding_model
)


In [22]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma

documents_list = insurance_pdfs_data["Page_Text"].tolist()
metadata_list = insurance_pdfs_data['Metadata'].tolist()

In [23]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents

insurance_collection = chroma_db_client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=google_ef)

Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [24]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


In [25]:
# Let's take a look at the first few entries in the collection

insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


{'ids': ['0', '1', '2'],
 'embeddings': [[0.036509085,
   0.03283743,
   -0.006328749,
   -0.0070221676,
   -0.008437707,
   0.010922549,
   0.023095932,
   0.07339764,
   0.03968478,
   0.043548096,
   -4.318028e-05,
   0.042253874,
   0.02297381,
   0.0098785665,
   -0.030323463,
   -0.05536549,
   -0.041782394,
   -0.01781251,
   -0.0431255,
   -0.018830093,
   -0.01293622,
   0.004985112,
   0.039093886,
   -0.040956292,
   0.029109659,
   -0.029867617,
   -0.014902185,
   -0.07830486,
   -0.019402742,
   -0.11540698,
   0.030291872,
   0.05179759,
   -0.0022607802,
   -0.020980159,
   0.017559052,
   0.030634664,
   -0.025965903,
   -0.009304471,
   0.031897355,
   -0.06042376,
   -0.031816464,
   -0.0578455,
   -0.01978256,
   0.04471265,
   -0.055952374,
   -0.060135003,
   -0.043428116,
   -0.0075726113,
   -0.0048819166,
   0.066209406,
   0.04104092,
   0.058819726,
   -0.046974394,
   0.042611305,
   0.02679489,
   -0.03689733,
   0.018140664,
   -0.02985466,
   0.003888174,

In [26]:
cache_collection = chroma_db_client.get_or_create_collection(name='Insurance_Cache', embedding_function=google_ef)

Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [27]:
cache_collection.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['embeddings', 'metadatas', 'documents']}

#  Stage 2: Semantic Search with Cache Layer

This section explains how semantic search leverages a cache layer to improve efficiency. When a query is made, the system first checks the cache for top‑k semantically similar documents or chunks. If matches are found, results are returned instantly; otherwise, the query is processed in the main vector database.

New queries and their results are then stored in the cache, enabling faster retrieval for repeated or similar queries. This approach reduces latency, avoids redundant computation, and ensures quick access to relevant documents through efficient indexing and vector search.

### **QnA - Query 1**

In [28]:
def semantic_cache_search(query, cache_results, collection, cache_collection, threshold=0.2, top_n=10):
    """
    Performs a semantic search with caching mechanism.

    Parameters:
        query (str): The input query string.
        cache_results (dict): The results from the cache collection.
        collection (object): The main ChromaDB collection to query if cache misses.
        cache_collection (object): The cache ChromaDB collection to store/retrieve results.
        threshold (float): The similarity threshold to decide cache hit/miss.
        top_n (int): Number of top results to retrieve.

    Returns:
        pd.DataFrame: DataFrame containing search results.
    """
    ids, documents, distances, metadatas = [], [], [], []
    results_df = pd.DataFrame()

    # Check cache miss or low relevance
    if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
        results = collection.query(query_texts=query, n_results=top_n)

        keys, values = [], []

        for key, val in results.items():
            if val is None or key == 'embeddings':
                continue

            if isinstance(val, list) and len(val) > 0 and isinstance(val[0], (list, tuple)):
                for i in range(min(top_n, len(val[0]))):
                    keys.append(f"{key}{i}")
                    values.append(str(val[0][i]))
            else:
                print(f"Skipping key '{key}' due to unexpected structure.")

        cache_collection.add(
            documents=[query],
            ids=[query],
            metadatas=dict(zip(keys, values))
        )

        print("Not found in cache. Found in main collection.")

        result_dict = {
            'Metadatas': results['metadatas'][0],
            'Documents': results['documents'][0],
            'Distances': results['distances'][0],
            'IDs': results['ids'][0]
        }
        results_df = pd.DataFrame.from_dict(result_dict)

    # Cache hit
    elif cache_results['distances'][0][0] <= threshold:
        cache_result_dict = cache_results['metadatas'][0][0]

        for key, value in cache_result_dict.items():
            if 'ids' in key:
                ids.append(value)
            elif 'documents' in key:
                documents.append(value)
            elif 'distances' in key:
                distances.append(value)
            elif 'metadatas' in key:
                metadatas.append(value)

        print("Found in cache!")

        results_df = pd.DataFrame({
            'IDs': ids,
            'Documents': documents,
            'Distances': distances,
            'Metadatas': metadatas
        })

    return results_df

In [29]:
# ------------------------------------------------------------------------------
# Question 1 : Can insurance coverage continue during an approved leave?
# ------------------------------------------------------------------------------

# Read the user query
# query = input()
query = "What are the specific eligibility requirements for a person to be considered a 'Member'?"

In [30]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [31]:
#cache_results

In [32]:
results = insurance_collection.query(
    query_texts=query,
    n_results=10
)


Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [33]:
#results.items()

In [34]:
results_df = semantic_cache_search(
    query=query,
    cache_results=cache_results,
    collection=insurance_collection,
    cache_collection=cache_collection
)


Skipping key 'included' due to unexpected structure.


Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


Not found in cache. Found in main collection.


In [35]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,0.57085,24
1,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.603818,23
2,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.658628,3
3,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",Section A – Eligibility Member Life Insurance ...,0.661206,4
4,"{'Page_No.': 'Page 15', 'Policy_Name': 'Princi...",A record which is on or transmitted by paper o...,0.666961,12
5,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.686768,13
6,"{'Page_No.': 'Page 34', 'Policy_Name': 'Princi...",provided The Principal has been notified of th...,0.732336,31
7,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi...",b. a business assignment; or c. full-time stud...,0.734358,34
8,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.73645,14
9,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",Section C - Individual Terminations Article 1 ...,0.747997,32


### **QnA - Query 2**

In [36]:
# ------------------------------------------------------------------------------
# Question 2 : Are there age-based reductions in benefit amounts?
# ------------------------------------------------------------------------------

# Read the user query
# query2 = input()
query2 = "Under what circumstances can The Principal or the Policyholder terminate the group policy?"

In [37]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results2 = cache_collection.query(
    query_texts=query2,
    n_results=1
)



In [38]:
# cache_results2

In [39]:
results_df2 = semantic_cache_search(
    query=query2,
    cache_results=cache_results2,
    collection=insurance_collection,
    cache_collection=cache_collection
)


Skipping key 'included' due to unexpected structure.
Not found in cache. Found in main collection.


In [40]:
results_df2

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,0.324605,21
1,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.435124,13
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,0.438009,20
3,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.449229,16
4,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,0.499732,33
5,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.511799,3
6,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",Section C - Individual Terminations Article 1 ...,0.518675,32
7,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,0.533114,24
8,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi...",Section D - Policy Renewal Article 1 - Renewal...,0.550025,22
9,"{'Page_No.': 'Page 5', 'Policy_Name': 'Princip...",PRINCIPAL LIFE INSURANCE COMPANY (called The P...,0.558822,2


### **QnA - Query 3**

In [41]:
# ------------------------------------------------------------------------------
# Question 3 : What are the exclusions under Accidental Death?
# ------------------------------------------------------------------------------

# Read the user query
# query3 = input()
query3 = "What are the premium rates for Member Life Insurance?"

In [42]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)



In [43]:
# cache_results3

In [44]:


results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Skipping key 'included' due to unexpected structure.
Not found in cache. Found in main collection.


In [45]:
results_df3

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 22', 'Policy_Name': 'Princi...",The number of Members insured for Dependent Li...,0.521012,19
1,"{'Page_No.': 'Page 52', 'Policy_Name': 'Princi...",(1) only one Accelerated Benefit payment will ...,0.550875,49
2,"{'Page_No.': 'Page 46', 'Policy_Name': 'Princi...",PART IV - BENEFITS Section A - Member Life Ins...,0.57727,43
3,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",Section A - Member Life Insurance Schedule of ...,0.580156,5
4,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",Section A – Eligibility Member Life Insurance ...,0.589136,4
5,"{'Page_No.': 'Page 25', 'Policy_Name': 'Princi...",Section D - Policy Renewal Article 1 - Renewal...,0.595903,22
6,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,0.598417,18
7,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",Section C - Individual Terminations Article 1 ...,0.637666,32
8,"{'Page_No.': 'Page 20', 'Policy_Name': 'Princi...",Section B - Premiums Article 1 - Payment Respo...,0.641679,17
9,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.646611,23


### What is a Semantic Cache?

A **semantic cache** stores the **meaning** (semantic representation) of a query or request — not just the raw data — along with the corresponding responses.  

This caching mechanism reduces the number of database queries by **recalling previously processed queries and their results**.

---

### How It Works

1. **New query processing:**
   - A **vector representation** of the query is generated.
   - The system first **searches this vector in the cache**.

2. **If the query is found** in the cache:
   - The system **skips the semantic search layer**, which is often a performance bottleneck.
   - The result is **retrieved instantly** from the cache.

3. **If the query is not found**:
   - The system queries the **main collection**.
   - It retrieves the **top *k* closest documents or chunks**.
   - These results are returned to the user and **stored in the cache** for future use.

---

### Benefits

- **Faster response times**  
- **Reduced load on the main database**  
- **Customizable and monitorable** for optimal performance  
- **Improved user experience** through quicker result retrieval  

By remembering previous queries and their results, a semantic cache can significantly enhance the efficiency and responsiveness of your application.

---

### **QnA - Query 3.1 - Checking if found in Cache**

In [46]:
query3 = "What exclusions apply to Accidental Death coverage?"

# Query the collection against the user query and return the top 20 results
cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

distances = cache_results3['distances'][0][0]
print("Threshold Distance :: ", distances)

results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Threshold Distance ::  0.5222050366914388
Skipping key 'included' due to unexpected structure.
Not found in cache. Found in main collection.


### **QnA - Query 3.2 - Checking if found in Cache for Another Similar Question**

In [47]:
query3 = "What are the policy exclusions under Accidental Death insurance?"

# Query the collection against the user query and return the top 20 results
cache_results3 = cache_collection.query(
    query_texts=query3,
    n_results=1
)

distances = cache_results3['distances'][0][0]
print("Threshold Distance :: ", distances)

results_df3 = semantic_cache_search(
    query=query3,
    cache_results=cache_results3,
    collection=insurance_collection,
    cache_collection=cache_collection,
    threshold=0.2,
    top_n=10
)

Threshold Distance ::  0.07421278422696857
Found in cache!


In [48]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,0.57085,24
1,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.603818,23
2,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.658628,3
3,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",Section A – Eligibility Member Life Insurance ...,0.661206,4
4,"{'Page_No.': 'Page 15', 'Policy_Name': 'Princi...",A record which is on or transmitted by paper o...,0.666961,12
5,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.686768,13
6,"{'Page_No.': 'Page 34', 'Policy_Name': 'Princi...",provided The Principal has been notified of th...,0.732336,31
7,"{'Page_No.': 'Page 37', 'Policy_Name': 'Princi...",b. a business assignment; or c. full-time stud...,0.734358,34
8,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.73645,14
9,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",Section C - Individual Terminations Article 1 ...,0.747997,32


In [49]:
# Import the CrossEncoder library from sentence_transformers
from sentence_transformers import CrossEncoder, util

# Initialise the cross encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

In [50]:
# Test the cross encoder model

scores = cross_encoder.predict([['Does the insurance cover diabetic patients?', 'The insurance policy covers some pre-existing conditions including diabetes, heart diseases, etc. The policy does not howev'],
                                ['Does the insurance cover diabetic patients?', 'The premium rates for various age groups are given as follows. Age group (<18 years): Premium rate']])

In [51]:
scores

array([  4.4608426, -11.197129 ], dtype=float32)

In [52]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

def rerank_with_cross_encoder(query, results_df, cross_encoder):
    """
    Re-ranks retrieved documents based on semantic similarity to the query 
    using a cross-encoder model.
    
    Parameters:
        query (str): The user query.
        results_df (pd.DataFrame): DataFrame containing at least a 'Documents' column.
        cross_encoder: A trained cross-encoder model (e.g., from SentenceTransformers).
        
    Returns:
        list: Cross-encoder similarity scores for each document.
    """
    if 'Documents' not in results_df.columns or results_df.empty:
        raise ValueError("results_df must contain a non-empty 'Documents' column.")

    # Create query-response pairs
    cross_inputs = [[query, doc] for doc in results_df['Documents']]

    # Predict similarity scores
    cross_rerank_scores = cross_encoder.predict(cross_inputs)

    return cross_rerank_scores


In [53]:

cross_rerank_scores = rerank_with_cross_encoder(query, results_df, cross_encoder)


In [54]:
# Store the rerank_scores in results_df
results_df['Reranked_scores'] = cross_rerank_scores

In [55]:
# results_df

In [56]:
print("First Query :: ", query , "\n")

First Query ::  What are the specific eligibility requirements for a person to be considered a 'Member'? 



In [57]:
# Return the top 3 results from semantic search
top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]


Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,0.57085,24,0.183803
1,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.603818,23,1.642594
2,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.658628,3,-7.846834


In [58]:
# Return the top 3 results after reranking
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
1,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.603818,23,1.642594
0,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,0.57085,24,0.183803
8,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.73645,14,-1.257841


In [59]:
top_3_RAG_query1 = top_3_rerank[["Documents", "Metadatas"]][:3]

top_3_RAG_query1

Unnamed: 0,Documents,Metadatas
1,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi..."
0,I f a Member's Dependent is employed and is co...,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi..."
8,a. be actively engaged in business for profit ...,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi..."


#### Re-rank with Cross Encoder for Query-2

In [60]:
# results_df2

In [61]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs
 
cross_rerank_scores2 = rerank_with_cross_encoder(query2, results_df2, cross_encoder)


In [62]:
# Store the rerank_scores in results_df
results_df2['Reranked_scores'] = cross_rerank_scores2

In [63]:
#results_df2

In [64]:
print("Second Query :: ", query2 , "\n")

Second Query ::  Under what circumstances can The Principal or the Policyholder terminate the group policy? 



In [65]:
# Return the top 3 results from semantic search
top_3_semantic2_query2 = results_df2.sort_values(by='Distances')
top_3_semantic2_query2[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,0.324605,21,8.025837
1,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.435124,13,4.941992
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,0.438009,20,7.518888


In [66]:
# Return the top 3 results after reranking
top_3_rerank_query2 = results_df2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_query2[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,0.324605,21,8.025837
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,0.438009,20,7.518888
1,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.435124,13,4.941992


In [67]:
top_3_RAG_query2 = top_3_rerank_query2[["Documents", "Metadatas"]][:3]

top_3_RAG_query2

Unnamed: 0,Documents,Metadatas
0,T he Principal may terminate the Policyholder'...,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi..."
2,Section C - Policy Termination Article 1 - Fai...,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi..."
1,PART II - POLICY ADMINISTRATION Section A - Co...,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi..."


### Re-rank with Cross Encoder for Query-3

In [68]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs
cross_rerank_scores3 = rerank_with_cross_encoder(query3, results_df3, cross_encoder)

In [69]:
# Store the rerank_scores in results_df
results_df3['Reranked_scores'] = cross_rerank_scores3


In [70]:
# results_df3

In [71]:
print("Third Query :: ", query3 , "\n")

Third Query ::  What are the policy exclusions under Accidental Death insurance? 



In [72]:
# Return the top 3 results from semantic search
top_3_semantic_query3 = results_df3.sort_values(by='Distances')
top_3_semantic_query3[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,4,Section A – Eligibility Member Life Insurance ...,0.5230893531756438,"{'Page_No.': 'Page 7', 'Policy_Name': 'Princip...",-1.210638
1,5,Section A - Member Life Insurance Schedule of ...,0.5339254244021225,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",-1.679972
2,32,Section C - Individual Terminations Article 1 ...,0.5703461560200235,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",0.403215


In [73]:
# Return the top 3 results after reranking
top_3_rerank_query3 = results_df3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_query3[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
2,32,Section C - Individual Terminations Article 1 ...,0.5703461560200235,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi...",0.403215
8,52,Exposure Exposure to the elements will be pres...,0.6161296850710246,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi...",0.032372
6,23,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.5985395633181447,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi...",-0.20807


In [74]:
top_3_RAG_query3 = top_3_rerank_query3[["Documents", "Metadatas"]][:3]

top_3_RAG_query3

Unnamed: 0,Documents,Metadatas
2,Section C - Individual Terminations Article 1 ...,"{'Page_No.': 'Page 35', 'Policy_Name': 'Princi..."
8,Exposure Exposure to the elements will be pres...,"{'Page_No.': 'Page 55', 'Policy_Name': 'Princi..."
6,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,"{'Page_No.': 'Page 26', 'Policy_Name': 'Princi..."



### Generation Layer

In this stage, the top search results and the user's query are sent to a Large Language Model (LLM), such as GPT-4o. The model uses a carefully designed prompt to generate a natural and accurate response.  

Instead of returning full pages or document chunks, the model delivers concise answers, often with citations. This layer benefits from **prompt engineering**, **custom instructions**, and techniques like **Retrieval-Augmented Generation (RAG)**.  

The goal is to provide **smart, direct, and context-aware responses**, improving clarity, speed, and overall user experience.


In [75]:
from openai import OpenAI
import pandas as pd



def generate_response(query, top_3_RAG: pd.DataFrame):
    """
    Generate a response using Gemini (OpenAI-compatible) Chat model
    based on the user query and top 3 retrieved insurance document chunks.
    """
    # Convert the DataFrame into a readable text block
    document_text = ""
    for idx, row in top_3_RAG.iterrows():
        document_text += (
            f"\n---\nDocument #{idx+1}\n"
            f"Policy Name & Page: {row['Metadatas']}\n"
            f"Content:\n{row['Documents']}\n"
        )

    # Create the user prompt
    prompt = f"""
You are an expert assistant in the insurance domain. You accurately answer user questions using provided excerpts 
from insurance policy documents. Always be helpful, clear, and provide relevant citations.

Customer question: "{query}"

Below are 3 potentially relevant insurance document excerpts. Each document contains the policy text and its metadata (policy name and page number). Use only information relevant to the query.

{document_text}

Instructions:
1. Review all 3 documents to find information that helps answer the query.
2. If useful information is inside a table (formatted as a list of lists), convert it into a readable table.
3. Answer the question clearly and concisely using relevant content only.
4. At the end of your response, list all cited policies and their page numbers in a "Citations" section.
5. If no documents are useful, say: “None of the provided documents contain relevant details to answer your query.”
6. Do not include any technical, implementation, or system details — only answer the query.

Respond only with the final answer and citations.
"""

    # Call Gemini chat model via OpenAI-compatible API
    response = openai_client.chat.completions.create(
        model="gemini-2.5-flash",
        messages=[
            {"role": "system", "content": "You are an expert insurance assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )

    # Return the model's content as a list of lines
    return response.choices[0].message.content.split('\n')


In [76]:
# Generate the response - For Query 1
response = generate_response(query, top_3_RAG_query1)
print("-"*80,"\n","-"*78)
print("Query 1: ",query)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 1:  What are the specific eligibility requirements for a person to be considered a 'Member'?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

To be eligible for Member Life Insurance, a person must complete 30 consecutive days of continuous Active Work with the Policyholder as a Member. Additionally, a person is not eligible if they are already eligible under any other Group Term Life Insurance policy underwritten by The Principal.

Citations:
* Principal-Sample-Life-Insurance-Policy, Page 26


In [77]:
# Generate the response - For Query 2
response2 = generate_response(query2, top_3_RAG_query2)
print("-"*80,"\n","-"*78)
print("Query 2: ",query2)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response2))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 2:  Under what circumstances can The Principal or the Policyholder terminate the group policy?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

The Principal or the Policyholder can terminate the group policy under the following circumstances:

**The Policyholder can terminate the group policy if:**
*   The total premium due has not been received by The Principal before the end of the Grace Period. Failure to pay the premium within the Grace Period is considered notice by the Policyholder to discontinue the Group Policy.
*   The Policyholder provides Written notice to The Principal prior to any premium due date, making the termination effective the day before that premium due date.
*   The Policyholder issue

In [78]:
# Generate the response - For Query 3
response3 = generate_response(query3, top_3_RAG_query3)
print("-"*80,"\n","-"*78)
print("Query 3: ",query3)
print("-"*80,"\n","-"*78,"\n")

# Print the response
print("\n".join(response3))

-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------
Query 3:  What are the policy exclusions under Accidental Death insurance?
-------------------------------------------------------------------------------- 
 ------------------------------------------------------------------------------ 

None of the provided documents contain relevant details to answer your query regarding policy exclusions under Accidental Death insurance. The documents discuss policy terminations, eligibility, and benefits, but do not list specific exclusions.


## Conclusion  

**Mr.HelpMate AI** marks a significant advancement in how policyholders access and understand complex insurance documents.  
By integrating **semantic search**, **retrieval-augmented generation (RAG)**, and **large language models (LLMs)**, the system delivers fast, accurate, and easy-to-understand answers — eliminating the need to manually sift through lengthy policy documents or rely on time-consuming customer support.  

This intelligent assistant not only improves **customer satisfaction** but also reduces **operational costs** for insurers.  
Its **modular architecture**, combining document processing, retrieval, and generation layers, ensures **scalability**, **adaptability**, and **precision**.  

Beyond insurance, the design and learnings from **Mr.HelpMate AI** can be extended to domains such as **legal tech**, **finance**, and **enterprise knowledge management**.  
With its focus on **accuracy**, **personalization**, and **efficiency**, it exemplifies how AI can transform document-intensive workflows into seamless, human-centric digital experiences.  

