# **HelpMateAI Project - Generative AI**
This project focuses on building a generative search system in the insurance domain. Inspired by the approach demonstrated in the "Retrieval Augmented Generation (RAG)" session, the goal is to design a system that can effectively and accurately answer questions derived from a policy document.

The project will utilize a single comprehensive life insurance policy document as the primary knowledge source. By leveraging retrieval and generation techniques, the system will ensure that user queries are grounded in the document’s content, providing precise and contextually relevant answers.


# 1. <font color = Green> Install and Import necessary libraries



In [122]:
# Install all the required libraries
!pip install -U -q pdfplumber tiktoken openai chromadb sentence-transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/928.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m928.2/928.2 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [123]:
# Import necessary libraries
import pandas as pd
import numpy as np
import pdfplumber
from pathlib import Path
from operator import itemgetter
import json
import tiktoken
import openai
from sentence_transformers import CrossEncoder, util
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import chromadb
import os, json, ast
from google.colab import userdata
from enum import Enum

# <font color = Green>  **2. Data PreProcessing Layer**
Read, Process, and Chunk the PDF Files, using pdfplumber to read and process the PDF files.

**pdfplumber** allows for better parsing of the PDF file as it can read various elements of the PDF apart from the plain text, such as, tables, images, etc. It also offers wide functionaties and visual debugging features to help with advanced preprocessing as well.

<font color=orange>**2.1 Open and Read PDF File and Extract text into dataframe**

In [124]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [125]:
# Define the path of the PDF
pdf_path = '/content/drive/MyDrive/upgrad/HelpMateAI/Principal-Sample-Life-Insurance-Policy.pdf'

In [126]:
# Open the PDF file
with pdfplumber.open(pdf_path) as pdf:

    # Get one of the pages from the PDF and examine it
    single_page = pdf.pages[6]

    # Extract text from the first page
    text = single_page.extract_text()

    # Extract tables from the first page
    tables = single_page.extract_tables()

    # Print the extracted text
    print(text)

Section A – Eligibility
Member Life Insurance Article 1
Member Accidental Death and Dismemberment Insurance Article 2
Dependent Life Insurance Article 3
Section B - Effective Dates
Member Life Insurance Article 1
Member Accidental Death and Dismemberment Insurance Article 2
Dependent Life Insurance Article 3
Section C - Individual Terminations
Member Life Insurance Article 1
Member Accidental Death and Dismemberment Insurance Article 2
Dependent Life Insurance Article 3
Termination for Fraud Article 4
Coverage While Outside of the United States Article 5
Section D - Continuation
Member Life Insurance Article 1
Dependent Insurance - Developmentally Disabled or
Physically Handicapped Children Article 2
Section E - Reinstatement
Reinstatement Article 1
Federal Required Family and Medical Leave Act (FMLA) Article 2
Reinstatement of Coverage for a Member or Dependent When
Coverage Ends due to Living Outside of the United States Article 3
Section F - Individual Purchase Rights
Member Life In

In [127]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables
def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [128]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):

    full_text = []

    with pdfplumber.open(pdf_path) as pdf:

        for p, page in enumerate(pdf.pages, start=1):

            page_no = f"Page {p}"
            text = page.extract_text()

            # Extract heading (if text exists)
            heading = text.split('\n')[0].strip() if text else None

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                check_bboxes(word, table_bbox) for table_bbox in table_bboxes)]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):
                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass
                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))

            full_text.append([page_no, heading, " ".join(lines)])

    # Convert the extracted data to a DataFrame
    df = pd.DataFrame(full_text, columns=['Page_Number', 'Heading', 'Text'])

    return df

In [129]:
df = extract_text_from_pdf(pdf_path)

In [130]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally,This page left blank intentionally
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...


In [132]:
# Check one of the extracted page texts to ensure that the text has been correctly read
df['Text'][5]

'TABLE OF CONTENTS PART I - DEFINITIONS PART II - POLICY ADMINISTRATION Section A – Contract Entire Contract Article 1 Policy Changes Article 2 Policyholder Eligibility Requirements Article 3 Policy Incontestability Article 4 Individual Incontestability Article 5 Information to be Furnished Article 6 Certificates Article 7 Assignments Article 8 Dependent Rights Article 9 Policy Interpretation Article 10 Electronic Transactions Article 11 Section B – Premium Payment Responsibility; Due Dates; Grace Period Article 1 Premium Rates Article 2 Premium Rate Changes Article 3 Premium Amount Article 4 Contributions from Members Article 5 Section C - Policy Termination Failure to Pay Premium Article 1 Termination Rights of the Policyholder Article 2 Termination Rights of The Principal Article 3 Policyholder Responsibility to Members Article 4 Section D - Policy Renewal Renewal Article 1 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS This policy has been updated effective January 1, 2014 GC 6001 T

<font color=orange>**2.2 Extract Text Length of each page and store it into a seperate column inside dataframe**



In [133]:
# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop
df['Text_Length'] = df['Text'].apply(lambda x: len(x.split(' ')))

In [134]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
1,Page 2,This page left blank intentionally,This page left blank intentionally,5
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
3,Page 4,This page left blank intentionally,This page left blank intentionally,5
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251


In [135]:
# Retain only the rows with a text length of at least 10
df = df.loc[df['Text_Length']>=10]
df.head(10)

Unnamed: 0,Page_Number,Heading,Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251
10,Page 11,(2) has been placed with the Member or spouse ...,(2) has been placed with the Member or spouse ...,299
11,Page 12,An institution that is licensed as a Hospital ...,An institution that is licensed as a Hospital ...,352


<font color=orange>**2.3 Generate Metata of each page and store it into a seperate column inside dataframe**



In [136]:
# Store the metadata for each page in a separate column
df['Metadata'] = df.apply(
    lambda x: {
        'Section': (x['Heading'][:20] if x['Heading'] else ''),
        'Page_No.': x['Page_Number']
    },
    axis=1
)

In [137]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,"{'Section': 'DOROTHEA GLAUSE S655', 'Page_No.'..."
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,"{'Section': 'POLICY RIDER', 'Page_No.': 'Page 3'}"
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,"{'Section': 'PRINCIPAL LIFE INSUR', 'Page_No.'..."
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153,"{'Section': 'TABLE OF CONTENTS', 'Page_No.': '..."
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176,"{'Section': 'Section A – Eligibil', 'Page_No.'..."
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171,"{'Section': 'Section A - Member L', 'Page_No.'..."
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387,"{'Section': 'P ART I - DEFINITION', 'Page_No.'..."
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251,"{'Section': 'T he legally recogni', 'Page_No.'..."
10,Page 11,(2) has been placed with the Member or spouse ...,(2) has been placed with the Member or spouse ...,299,"{'Section': '(2) has been placed ', 'Page_No.'..."
11,Page 12,An institution that is licensed as a Hospital ...,An institution that is licensed as a Hospital ...,352,"{'Section': 'An institution that ', 'Page_No.'..."


<font color=orange>**2.4 Generate chunks of each page and store it into a seperate column inside dataframe**



In [138]:
# Splits a large text into smaller overlapping chunks.
# Each chunk contains up to 'chunk_size' words, with 'overlap_size' words
# carried over from the previous chunk to preserve context.
def chunk_text(text, chunk_size=300, overlap_size=50):
    # Split the text into individual words
    words = text.split()
    chunks = []

    # Iterate over the words to create chunks with overlap
    for i in range(0, len(words), chunk_size - overlap_size):
        # Create a chunk from the current position
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

In [139]:
df['Chunks'] = df['Text'].apply(lambda x: chunk_text(x))

# Flatten the DataFrame to have one row per chunk
chunked_df = df.explode('Chunks').reset_index(drop=True)

# Add an identifier to each chunk to keep track of the page and chunk number
chunked_df['Chunk_ID'] = chunked_df.index + 1

In [140]:
chunked_df.head(20)

Unnamed: 0,Page_Number,Heading,Text,Text_Length,Metadata,Chunks,Chunk_ID
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,"{'Section': 'DOROTHEA GLAUSE S655', 'Page_No.'...",DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,1
1,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,"{'Section': 'POLICY RIDER', 'Page_No.': 'Page 3'}",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,2
2,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,"{'Section': 'PRINCIPAL LIFE INSUR', 'Page_No.'...",PRINCIPAL LIFE INSURANCE COMPANY (called The P...,3
3,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153,"{'Section': 'TABLE OF CONTENTS', 'Page_No.': '...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,4
4,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176,"{'Section': 'Section A – Eligibil', 'Page_No.'...",Section A – Eligibility Member Life Insurance ...,5
5,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171,"{'Section': 'Section A - Member L', 'Page_No.'...",Section A - Member Life Insurance Schedule of ...,6
6,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387,"{'Section': 'P ART I - DEFINITION', 'Page_No.'...",P ART I - DEFINITIONS When used in this Group ...,7
7,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387,"{'Section': 'P ART I - DEFINITION', 'Page_No.'...",f. Continence - the ability to voluntarily con...,8
8,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251,"{'Section': 'T he legally recogni', 'Page_No.'...",T he legally recognized union of two eligible ...,9
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251,"{'Section': 'T he legally recogni', 'Page_No.'...",2,10


#<font color=Green>**3. Generate and Store Embeddings using OpenAI and ChromaDB**
In this section, we will embed the pages in the dataframe through OpenAI's <font color=Green>text-embedding-ada-002</font> model, and store them in a ChromaDB collection bold text.

<font color=orange>**3.1 Retrieve OpenAI Key**

In [141]:
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

In [142]:
chroma_data_path = '/content/drive/MyDrive/upgrad/HelpMateAI'

<font color=orange>3.2  Initialize Vector Database - ChromdDB Persistent

In [143]:
# Call PersistentClient()
client = chromadb.PersistentClient()

In [144]:
# Set up the embedding function using the OpenAI embedding model
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"),model_name=model)

In [145]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
insurance_collection = client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=embedding_function)

In [146]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma
documents_list = chunked_df["Chunks"].tolist()
metadata_list = chunked_df['Metadata'].tolist()

<font color=orange>3.3  Store Documents into Vector Database - ChromdDB

In [147]:
# Add the documents and metadata to the collection alongwith generic integer IDs.
insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

In [148]:
# Take a look at the first few entries in the collection
insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-2.25940198e-02,  1.86782442e-02, -2.72537917e-02, ...,
         -3.69388051e-02,  2.90256785e-03, -1.38683687e-03],
        [-1.31580187e-02,  8.86348262e-03, -4.63755010e-03, ...,
         -1.56551618e-02, -9.10059171e-05,  7.27875810e-03],
        [-1.20378779e-02,  1.40740369e-02, -3.30295507e-03, ...,
         -2.85194907e-02, -9.43796150e-03,  1.02139572e-02]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Principal a Financial Services Hotline and Gri

<font color=orange>3.4  Initialise Vector Database Cache

In [195]:
cache_collection = client.get_or_create_collection(name='Insurance_Cache', embedding_function=embedding_function)
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}


# **<font color=Green >4.Semantic Search with Cache**
In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.


In [152]:
# Perform semantic search with a cache layer.
# If the query is found in cache (distance <= threshold), return cached results.
# Otherwise, query the main collection, store results in cache, and return them.
# query (str): User query text.
# cache_results (dict): Results from searching the cache.
# cache_collection: Cache collection object (ChromaDB).
# insurance_collection: Main collection object (ChromaDB).
# threshold (float): Distance threshold for cache hit/miss. Default 0.2.
def semantic_search_with_cache(query, cache_results, cache_collection, insurance_collection, threshold=0.2, n_results=10):
    ids, documents, distances, metadatas = [], [], [], []

    # Check if cache has valid results and avoid index out of range
    if (
        cache_results.get('distances')
        and cache_results['distances'][0]
        and cache_results['distances'][0][0] <= threshold
    ):
        print("Found in cache!")

        cache_result_dict = cache_results['metadatas'][0][0]
        for key, value in cache_result_dict.items():
            if 'ids' in key:
                ids.append(value)
            elif 'documents' in key:
                documents.append(value)
            elif 'distances' in key:
                distances.append(value)
            elif 'metadatas' in key:
                metadatas.append(value)

        results_df = pd.DataFrame({
            'IDs': ids,
            'Documents': documents,
            'Distances': distances,
            'Metadatas': metadatas
        })

    else:
        print("Not found in cache. Found in main collection.")

        results = insurance_collection.query(query_texts=query, n_results=n_results)

        # Store results in cache for future use
        Keys, Values = [], []
        for key, val in results.items():
            if val is None:
                continue
            for i in range(min(len(val[0]), 10)):  # safer than fixed 10
                Keys.append(str(key) + str(i))
                Values.append(str(val[0][i]))

        cache_collection.add(
            documents=[query],
            ids=[query],
            metadatas=dict(zip(Keys, Values))
        )

        result_dict = {
            'IDs': results['ids'][0],
            'Documents': results['documents'][0],
            'Distances': results['distances'][0],
            'Metadatas': results['metadatas'][0]
        }
        results_df = pd.DataFrame.from_dict(result_dict)

    return results_df

In [196]:
# Read the user query
query = input()

what is the life insurance coverage for disability?


In [197]:
# Searh the Cache collection first
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [198]:
cache_results

{'ids': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[]],
 'distances': [[]]}

In [199]:
results_df = semantic_search_with_cache(
    query=query,
    cache_results=cache_results,
    cache_collection=cache_collection,
    insurance_collection=insurance_collection,
    threshold=0.2,
    n_results=10
)
results_df

Not found in cache. Found in main collection.


Unnamed: 0,IDs,Documents,Distances,Metadatas
0,74,Payment of benefits will be subject to the Ben...,0.256554,"{'Section': 'Payment of benefits ', 'Page_No.'..."
1,77,pay the Member's beneficiary the Member Life I...,0.278729,"{'Section': 'The Principal may re', 'Page_No.'..."
2,54,Section D - Continuation Article 1 - Member Li...,0.281515,"{'Page_No.': 'Page 38', 'Section': 'Section D ..."
3,78,Coverage During Disability will cease on the e...,0.305581,"{'Page_No.': 'Page 51', 'Section': 'Coverage D..."
4,55,2 - Dependent Insurance - Developmentally Disa...,0.307007,"{'Section': 'Section D - Continua', 'Page_No.'..."
5,5,Section A - Member Life Insurance Schedule of ...,0.308737,"{'Page_No.': 'Page 8', 'Section': 'Section A -..."
6,38,to an individual policy; or (2) were eligible ...,0.312167,"{'Section': 'Section B - Effectiv', 'Page_No.'..."
7,76,The Principal may require that a ADL Disabled ...,0.314496,"{'Page_No.': 'Page 50', 'Section': 'The Princi..."
8,75,ADL Disability or Total Disability has continu...,0.32233,"{'Section': 'Payment of benefits ', 'Page_No.'..."
9,62,or (4) the Member's Accelerated Benefits Premi...,0.326764,"{'Page_No.': 'Page 42', 'Section': 'Section F ..."


# **<font color=Green>5. Re-Ranking with a Cross Encoder**

Re-ranking the results obtained from semantic search this can sometime significantly improve the relevance of the retrieved results.

In [200]:
# Import the CrossEncoder library from sentence_transformers
from sentence_transformers import CrossEncoder, util

In [201]:
# Initialise the cross encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [202]:
# Prepare input pairs [query, document] for cross-encoder reranking.
def generate_cross_inputs(query, results_df):
    return [[query, response] for response in results_df['Documents']]

# Compute rerank scores using cross-encoder and add them to results_df.
def compute_rerank_scores(cross_encoder, cross_inputs, results_df):
    scores = cross_encoder.predict(cross_inputs)
    results_df['Reranked_scores'] = scores
    return results_df

# Return top N results from semantic search (sorted by distance ascending)
def get_top_semantic(results_df, top_n=3):
    return results_df.sort_values(by='Distances').head(top_n)

# Return top N results after reranking (sorted by rerank score descending).
def get_top_rerank(results_df, top_n=3):
    return results_df.sort_values(by='Reranked_scores', ascending=False).head(top_n)

# Return top N RAG-ready results (documents + metadata only).
def get_top_rag(results_df, top_n=3):
    return results_df[["Documents", "Metadatas"]].head(top_n)

In [203]:
cross_inputs = generate_cross_inputs(query,results_df )
results_df = compute_rerank_scores(cross_encoder, cross_inputs, results_df)
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,74,Payment of benefits will be subject to the Ben...,0.256554,"{'Section': 'Payment of benefits ', 'Page_No.'...",2.50594
1,77,pay the Member's beneficiary the Member Life I...,0.278729,"{'Section': 'The Principal may re', 'Page_No.'...",-0.145489
2,54,Section D - Continuation Article 1 - Member Li...,0.281515,"{'Page_No.': 'Page 38', 'Section': 'Section D ...",-0.18451
3,78,Coverage During Disability will cease on the e...,0.305581,"{'Page_No.': 'Page 51', 'Section': 'Coverage D...",-0.938965
4,55,2 - Dependent Insurance - Developmentally Disa...,0.307007,"{'Section': 'Section D - Continua', 'Page_No.'...",1.854577
5,5,Section A - Member Life Insurance Schedule of ...,0.308737,"{'Page_No.': 'Page 8', 'Section': 'Section A -...",-0.163983
6,38,to an individual policy; or (2) were eligible ...,0.312167,"{'Section': 'Section B - Effectiv', 'Page_No.'...",-2.223355
7,76,The Principal may require that a ADL Disabled ...,0.314496,"{'Page_No.': 'Page 50', 'Section': 'The Princi...",-2.505526
8,75,ADL Disability or Total Disability has continu...,0.32233,"{'Section': 'Payment of benefits ', 'Page_No.'...",-3.928953
9,62,or (4) the Member's Accelerated Benefits Premi...,0.326764,"{'Page_No.': 'Page 42', 'Section': 'Section F ...",-0.172405


In [204]:
# Return the top 3 results from semantic search
top_3_semantic = get_top_semantic(results_df)
top_3_semantic

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,74,Payment of benefits will be subject to the Ben...,0.256554,"{'Section': 'Payment of benefits ', 'Page_No.'...",2.50594
1,77,pay the Member's beneficiary the Member Life I...,0.278729,"{'Section': 'The Principal may re', 'Page_No.'...",-0.145489
2,54,Section D - Continuation Article 1 - Member Li...,0.281515,"{'Page_No.': 'Page 38', 'Section': 'Section D ...",-0.18451


In [205]:
# Return the top 3 results after reranking
top_3_rerank = get_top_rerank(results_df)
top_3_rerank

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,74,Payment of benefits will be subject to the Ben...,0.256554,"{'Section': 'Payment of benefits ', 'Page_No.'...",2.50594
4,55,2 - Dependent Insurance - Developmentally Disa...,0.307007,"{'Section': 'Section D - Continua', 'Page_No.'...",1.854577
1,77,pay the Member's beneficiary the Member Life I...,0.278729,"{'Section': 'The Principal may re', 'Page_No.'...",-0.145489


In [206]:
top_3_RAG = get_top_rag(results_df)
top_3_RAG

Unnamed: 0,Documents,Metadatas
0,Payment of benefits will be subject to the Ben...,"{'Section': 'Payment of benefits ', 'Page_No.'..."
1,pay the Member's beneficiary the Member Life I...,"{'Section': 'The Principal may re', 'Page_No.'..."
2,Section D - Continuation Article 1 - Member Li...,"{'Page_No.': 'Page 38', 'Section': 'Section D ..."


# <font Color=Green>6. Retrieval Augmented Generation


In [207]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-4o-mini ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [208]:
# Generate the response
response = generate_response(query, top_3_RAG)

In [209]:
# Print the response
print("\n".join(response))

Life insurance coverage for disability typically provides benefits in case of permanent or long-term disability that affects the ability to work. The specifics regarding life's insurance coverage for disability can vary by policy, including the eligibility criteria and the payout amount.

Unfortunately, the documents retrieved don't appear to provide direct information about the disability coverage specifics. However, they do mention sections relevant to member life insurance policies and payment of benefits.

To find further details about the life insurance coverage for disability, I recommend reviewing the documents cited below. Look particularly for sections regarding "Payment of benefits" and "Continuation" that may outline coverage conditions in more depth.

| Policy Name                 | Page Number |
|-----------------------------|-------------|
| Life Insurance Policies      | Page 38    |
| Member Benefits Overview     | Page 1     |

For a deeper understanding, please refer 

# <font Color=Green>7. User Queries

In [210]:
query = 'what types of coverage does this policy include?'

results_df = semantic_search_with_cache(
    query=query,
    cache_results=cache_results,
    cache_collection=cache_collection,
    insurance_collection=insurance_collection,
    threshold=0.2,
    n_results=10
    )

cross_inputs = generate_cross_inputs(query,results_df)
results_df = compute_rerank_scores(cross_encoder, cross_inputs, results_df)

top_3_semantic = get_top_semantic(results_df)
print("Query1: "+query)
print("Semantic Search Results:")
top_3_semantic

Not found in cache. Found in main collection.
Query1: what types of coverage does this policy include?
Semantic Search Results:


Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,20,"coverage, benefits, and participation privileg...",0.377626,"{'Section': 'PART II - POLICY ADM', 'Page_No.'...",-2.172329
1,86,"""Automobile"" means a four-wheel passenger vehi...",0.384721,"{'Page_No.': 'Page 55', 'Section': 'Exposure'}",-5.015879
2,52,state or federal law. Article 5 - Coverage Whi...,0.387849,"{'Page_No.': 'Page 36', 'Section': 'A Member's...",-1.930627


In [211]:
top_3_rerank = get_top_rerank(results_df)
top_3_RAG = get_top_rag(top_3_rerank)

print("Query1: "+query)
print("Generative AI Results: ")
print("\n")

response = generate_response(query, top_3_RAG)
print("\n".join(response))

Query1: what types of coverage does this policy include?
Generative AI Results: 


The policy includes various types of coverage, focusing on essential benefits and participation privileges. Here's a summary of the coverage types found in the retrieved documents:

| Coverage Type                   | Description                                                  |
|----------------------------------|--------------------------------------------------------------|
| Medical Services                 | Coverage for services provided by a licensed Doctor of Medicine (M.D.) or other specialists. |
| Member's Benefits                | Specific benefits outlined for members under state or federal law. |
| Participation Privileges         | Details on eligibility and rights concerning participation in the insurance plan. |

For additional specifics on the types of coverage, please refer to the following sections of the policy documents:

- **Page 13**: Information regarding coverage by licensed me

In [212]:
query = 'what documentation is required when filing a claim?'

results_df = semantic_search_with_cache(
    query=query,
    cache_results=cache_results,
    cache_collection=cache_collection,
    insurance_collection=insurance_collection,
    threshold=0.2,
    n_results=10
    )

cross_inputs = generate_cross_inputs(query,results_df)
results_df = compute_rerank_scores(cross_encoder, cross_inputs, results_df)

top_3_semantic = get_top_semantic(results_df)
print("Query2: "+query)
print("Semantic Search Results:")
top_3_semantic

Not found in cache. Found in main collection.
Query2: what documentation is required when filing a claim?
Semantic Search Results:


Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,95,Section D - Claim Procedures Article 1 - Notic...,0.336341,"{'Section': 'Section D - Claim Pr', 'Page_No.'...",-0.474061
1,98,of loss has been filed and before the appeal p...,0.355083,"{'Section': 'A claimant may reque', 'Page_No.'...",-2.255106
2,97,A claimant may request an appeal of a claim de...,0.362709,"{'Page_No.': 'Page 62', 'Section': 'A claimant...",-0.586972


In [213]:
top_3_rerank = get_top_rerank(results_df)
top_3_RAG = get_top_rag(top_3_rerank)

print("Query2: "+query)
print("Generative AI Results: ")
print("\n")

response = generate_response(query, top_3_RAG)
print("\n".join(response))

Query2: what documentation is required when filing a claim?
Generative AI Results: 


When filing a claim, the documentation required typically includes specific information to support the claim. Although the details are not fully outlined in the provided documents, common requirements generally include:

1. **Claim Form**: This is usually required to initiate the claim process.
2. **Proof of Loss**: Documentation that substantiates the claim being made.
3. **Supporting Documents**: These may include photographs, receipts, police reports (if applicable), and any other relevant information related to the incident.

For the specific requirements in your case, please refer to the policy documents.

### Relevant Information from Policy Documents

| Document Section                        | Page Number |
|-----------------------------------------|-------------|
| Section D - Claim Procedures            | [Not Specified] |
| A claimant may request an appeal        | 62          |

This infor

In [214]:
query = 'what happens if i miss a payment?'

results_df = semantic_search_with_cache(
    query=query,
    cache_results=cache_results,
    cache_collection=cache_collection,
    insurance_collection=insurance_collection,
    threshold=0.2,
    n_results=10
    )

cross_inputs = generate_cross_inputs(query,results_df)
results_df = compute_rerank_scores(cross_encoder, cross_inputs, results_df)

top_3_semantic = get_top_semantic(results_df)
print("Query3: "+query)
print("Semantic Search Results:")
top_3_semantic

Not found in cache. Found in main collection.
Query3: what happens if i miss a payment?
Semantic Search Results:


Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,32,or d. fails to pay premium in accordance with ...,0.425527,"{'Page_No.': 'Page 23', 'Section': 'Section C ...",-9.93795
1,84,"Settlement of Proceeds provisions of PART IV, ...",0.451416,"{'Page_No.': 'Page 54', 'Section': 'f . claim ...",-5.530197
2,31,Section C - Policy Termination Article 1 - Fai...,0.453663,"{'Section': 'Section C - Policy T', 'Page_No.'...",-7.300392


In [215]:
top_3_rerank = get_top_rerank(results_df)
top_3_RAG = get_top_rag(top_3_rerank)

print("Query3: "+query)
print("Generative AI Results: ")
print("\n")

response = generate_response(query, top_3_RAG)
print("\n".join(response))

Query3: what happens if i miss a payment?
Generative AI Results: 


If you miss a payment on your insurance policy, it is possible that your coverage could lapse or be affected. Generally, most insurance policies offer a grace period during which you can make a missed payment without losing coverage. If you do not make the payment within this timeframe, the policy may be canceled, and you may not be able to file claims for any incidents that occur thereafter.

Unfortunately, the specific documents referenced in your search did not provide direct information about payment missed or grace periods. For more detailed information regarding your policy, including consequences for missed payments, please refer to the sections outlined in your policy documents, especially regarding premium payments and policy lapses.

You can look for further details in your documents, specifically:

- **Page 54:**  Settlement of Proceeds provisions.
- **Page 47:**  Member's obligations regarding payments.

Th