## 1. <font color = red> Install and Import the Required Libraries

In [None]:
# Install all the required libraries

!pip install pdfplumber
!pip install tiktoken
!pip install openai
!pip install chromaDB
!pip install sentence-transformers



In [None]:
# Import all the required Libraries

import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import openai
import chromadb

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## 2. <font color = red> Read, Process, and Chunk the PDF Files

We will be using [pdfplumber](https://https://pypi.org/project/pdfplumber/) to read and process the PDF files.

`pdfplumber` allows for better parsing of the PDF file as it can read various elements of the PDF apart from the plain text, such as, tables, images, etc. It also offers wide functionaties and visual debugging features to help with advanced preprocessing as well.

In [None]:
# Define the path of the PDF
single_pdf_path = "/content/drive/My Drive/GenerativeAI/MateAI/Policy_Documents/Policy-Document.pdf"

#### <font color = red>  2.1 Reading a single PDF file and exploring it through pdfplumber

In [None]:
# Open the PDF file
with pdfplumber.open(single_pdf_path) as pdf:

    # Get one of the pages from the PDF and examine it
    single_page = pdf.pages[6]

    # Extract text from the first page
    text = single_page.extract_text()

    # Extract tables from the first page
    tables = single_page.extract_tables()

    # Print the extracted text
    print(text)

Part C
1. Benefits:
(1) Benefits on Death or diagnosis of contingency covered –
Plan Option Events Benefit
Life Death In the event of the death of the Scheme Member, the
benefit payable shall be the Sum Assured.
Extra Life Option Death In the event of the death of the Scheme Member, the
benefit payable shall be the Sum Assured.
Accidental Death In event of the Scheme Member’s death due to
Accident, an additional death benefit equal to the Sum
Assured will be payable.
This is in addition to the death benefit mentioned
above
Accelerated Critical Illness Death In the event of the death of the Scheme Member, the
Option benefit payable shall be the Sum Assured.
Diagnosis of a In the event of Scheme Member being diagnosed with
Critical Illness any of the covered Critical Illnesses during the Policy
Term, the benefit payable shall be the Sum Assured
and the policy will terminate.
a. The Policy Term, Sum Assured, Cover option, and Mode of Premium Payment will be chosen by Scheme
Member and the

In [None]:
# View the table in the page, if any

tables[0]

[['Plan Option', 'Events', 'Benefit'],
 ['Life',
  'Death',
  'In the event of the death of the Scheme Member, the\nbenefit payable shall be the Sum Assured.'],
 ['Extra Life Option',
  'Death',
  'In the event of the death of the Scheme Member, the\nbenefit payable shall be the Sum Assured.'],
 [None,
  'Accidental Death',
  'In event of the Scheme Member’s death due to\nAccident, an additional death benefit equal to the Sum\nAssured will be payable.\nThis is in addition to the death benefit mentioned\nabove'],
 ['Accelerated Critical Illness\nOption',
  'Death',
  'In the event of the death of the Scheme Member, the\nbenefit payable shall be the Sum Assured.'],
 [None,
  'Diagnosis of a\nCritical Illness',
  'In the event of Scheme Member being diagnosed with\nany of the covered Critical Illnesses during the Policy\nTerm, the benefit payable shall be the Sum Assured\nand the policy will terminate.']]

#### <font color = red> 2.2 Extracting text from multiple PDFs

Trying reading multiple documents, extract text from them using appropriate preprocessing, and store them in a dataframe


In [None]:
# Define the path where all pdf documents are present

pdf_path = "/content/drive/My Drive/GenerativeAI/MateAI/Policy_Documents/"

In [None]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [None]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [None]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf
Finished processing HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf
...Processing HDFC-Life-Sanchay-Plus-Life-Long-Income-Option-101N134V19-Policy-Document.pdf
Finished processing HDFC-Life-Sanchay-Plus-Life-Long-Income-Option-101N134V19-Policy-Document.pdf
...Processing HDFC-Surgicare-Plan-101N043V01.pdf
Finished processing HDFC-Surgicare-Plan-101N043V01.pdf
...Processing HDFC-Life-Group-Term-Life-Policy.pdf
Finished processing HDFC-Life-Group-Term-Life-Policy.pdf
...Processing HDFC-Life-Group-Poorna-Suraksha-101N137V02-Policy-Document.pdf
Finished processing HDFC-Life-Group-Poorna-Suraksha-101N137V02-Policy-Document.pdf
...Processing HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf
Finished processing HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf
...Processing HDFC-Life-Smart-Pension-Plan-Policy-Document-Online.pdf
Finished processing HDFC-Life-Smart-Pension-Plan-Policy-

In [None]:
# Concatenate all the DFs in the list 'data' together
insurance_pdfs_data = pd.concat(data, ignore_index=True)

In [None]:
insurance_pdfs_data

Unnamed: 0,Page No.,Page_Text,Document Name
0,Page 1,PART A: Covering Letter with Policy Schedule _...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...
1,Page 2,A.1. Policy Preamble HDFC Life Sampoorna Jeeva...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...
2,Page 3,Stamp Duty of Rs«ADDAMT» /- is paid as provide...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...
3,Page 4,11. Guaranteed Surrender Value (GSV)means the ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...
4,Page 5,30. Regulations mean the laws and Regulations ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...
...,...,...,...
212,Page 33,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...
213,Page 34,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...
214,Page 35,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...
215,Page 36,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...


In [None]:
# Check one of the extracted page texts to ensure that the text has been correctly read
insurance_pdfs_data.Page_Text[2]

'Stamp Duty of Rs«ADDAMT» /- is paid as provided under Article 47D(iii) of Indian Stamp Act, 1899 and included in Consolidated Stamp Duty Paid to the Government of Maharashtra Treasury vide Order of Addl. Controller Of Stamps, Mumbai at General Stamp Office, Fort, Mumbai - 400001., vide this Order No. (_/_/_/_/Validity Period Dt.__ To Dt.__ (O/w.No.__)/Date: __/_/__). The Modal Premium shown in the Policy Schedule above is exclusive of Goods and Services Tax (GST). GST at the applicable rates will be charged on Premiums paid. «Legend_schedule»«ADDAMT» PART B Important Terms and Definitions B.1. DEFINITIONS In this Policy, unless the context requires otherwise, the following words and expressions shall have the meaning ascribed to them respectively herein below: 1. Age shall be Age of Life Assured at Policy Commencement Date as at last birthday i.e. the Age in completed years and is recorded in the Policy Schedule based on the details provided by the Policyholder. 2. Basic Sum Assured m

In [None]:
# Also checking the length of all the texts as there might be some empty pages or pages with very few words that we can drop
insurance_pdfs_data['Text_Length'] = insurance_pdfs_data['Page_Text'].apply(lambda x: len(x.split(' ')))

In [None]:
insurance_pdfs_data['Text_Length']

0      351
1      395
2      564
3      512
4      568
      ... 
212    677
213    565
214    236
215    548
216    288
Name: Text_Length, Length: 217, dtype: int64

In [None]:
# Retain only the rows with a text length of at least 10
insurance_pdfs_data = insurance_pdfs_data.loc[insurance_pdfs_data['Text_Length'] >= 10]
insurance_pdfs_data

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
0,Page 1,PART A: Covering Letter with Policy Schedule _...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,351
1,Page 2,A.1. Policy Preamble HDFC Life Sampoorna Jeeva...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,395
2,Page 3,Stamp Duty of Rs«ADDAMT» /- is paid as provide...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,564
3,Page 4,11. Guaranteed Surrender Value (GSV)means the ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,512
4,Page 5,30. Regulations mean the laws and Regulations ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,568
...,...,...,...,...
212,Page 33,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,677
213,Page 34,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,565
214,Page 35,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,236
215,Page 36,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,548


In [None]:
# Store the metadata for each page in a separate column
insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)


In [None]:
insurance_pdfs_data

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length,Metadata
0,Page 1,PART A: Covering Letter with Policy Schedule _...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,351,{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-10...
1,Page 2,A.1. Policy Preamble HDFC Life Sampoorna Jeeva...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,395,{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-10...
2,Page 3,Stamp Duty of Rs«ADDAMT» /- is paid as provide...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,564,{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-10...
3,Page 4,11. Guaranteed Surrender Value (GSV)means the ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,512,{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-10...
4,Page 5,30. Regulations mean the laws and Regulations ...,HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-D...,568,{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-10...
...,...,...,...,...,...
212,Page 33,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,677,{'Policy_Name': 'HDFC-Life-Smart-Pension-Plan-...
213,Page 34,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,565,{'Policy_Name': 'HDFC-Life-Smart-Pension-Plan-...
214,Page 35,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,236,{'Policy_Name': 'HDFC-Life-Smart-Pension-Plan-...
215,Page 36,HDFC Life Smart Pension Plan 101L164V02 – Term...,HDFC-Life-Smart-Pension-Plan-Policy-Document-O...,548,{'Policy_Name': 'HDFC-Life-Smart-Pension-Plan-...


This concludes the chunking aspect also, as we can see that mostly the pages contain few hundred words, maximum going upto 1000. So, we don't need to chunk the documents further; we can perform the embeddings on individual pages. This strategy makes sense for 2 reasons:
1. The way insurance documents are generally structured, you will not have a lot of extraneous information in a page, and all the text pieces in that page will likely be interrelated.
2. We want to have larger chunk sizes to be able to pass appropriate context to the LLM during the generation layer.

## 3. <font color = red> Generate and Store Embeddings using OpenAI and ChromaDB

We will now embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [None]:
# Set the API key
filepath = '/content/drive/My Drive/GenerativeAI/MateAI/'

with open(filepath + "Jasper_OpenAI_API_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

In [None]:
openai.api_key

'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'

In [None]:
# Import the OpenAI Embedding Function into chroma
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [None]:
# Define the path where chroma collections will be stored
chroma_data_path = '/content/drive/My Drive/GenerativeAI/MateAI/Policy_Documents/ChromaDB_Data'

In [None]:
# Call PersistentClient()
client = chromadb.PersistentClient(path=chroma_data_path)

In [None]:
# Set up the embedding function using the OpenAI embedding model
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

In [None]:
embedding_function

<chromadb.utils.embedding_functions.OpenAIEmbeddingFunction at 0x7ebab9fb7490>

In [None]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
insurance_collection = client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=embedding_function)

In [None]:
insurance_collection.peek

<bound method Collection.peek of Collection(name=RAG_on_Insurance)>

In [None]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma
documents_list = insurance_pdfs_data["Page_Text"].tolist()
metadata_list = insurance_pdfs_data['Metadata'].tolist()

In [None]:
metadata_list

[{'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 1'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 2'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 3'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 4'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 5'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 6'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 7'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 8'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)',
  'Page_No.': 'Page 9'},
 {'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (

In [None]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)



In [None]:
# Let's take a look at the first few entries in the collection

insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': [[-0.014016683213412762,
   0.010641542263329029,
   0.008998112753033638,
   -0.0385458804666996,
   -0.011130495928227901,
   0.027870381250977516,
   0.008081324398517609,
   0.0034736113157123327,
   -0.020373085513710976,
   0.0019099768251180649,
   0.014682204462587833,
   0.041887067258358,
   -0.018987717106938362,
   0.01575518772006035,
   0.0039048416074365377,
   0.005802934058010578,
   0.016855333000421524,
   -0.006305470131337643,
   0.029636049643158913,
   -0.011884300038218498,
   -0.003908236976712942,
   0.015619366429746151,
   -0.006787633057683706,
   0.014464891515672207,
   0.002716411603614688,
   -0.016963990405201912,
   0.025710834190249443,
   -0.014913098886609077,
   -0.0013293438823893666,
   0.007585579063743353,
   -0.0070694610476493835,
   0.0037282747216522694,
   -0.0221794992685318,
   -0.012855417095124722,
   -0.00230555422604084,
   -0.0015059106517583132,
   -0.007042296696454287,
   -0.01534772478044

In [None]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
cache_collection = client.get_or_create_collection(name='RAG_on_Insurance_Cache', embedding_function=embedding_function)
#insurance_collection = client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=embedding_function)

In [None]:
cache_collection.peek()

{'ids': ['what are the accidental death benefits received in a life insurance'],
 'embeddings': [[-0.01305423490703106,
   0.005004005506634712,
   0.0016088089905679226,
   -0.04325496405363083,
   -0.03497149422764778,
   0.02387504279613495,
   -0.028667014092206955,
   -0.002954342169687152,
   -0.02739480882883072,
   0.022051548585295677,
   0.02418602630496025,
   0.031776849180459976,
   -0.010757198557257652,
   0.02687179110944271,
   -0.0077109746634960175,
   0.020312869921326637,
   0.038392312824726105,
   -0.0002411888272035867,
   0.019549546763300896,
   -0.013782218098640442,
   -0.00037967361276969314,
   0.018446968868374825,
   -0.02085002325475216,
   0.010375536978244781,
   0.0001631116756470874,
   -0.0013040099292993546,
   0.008318806067109108,
   -0.02657494507730007,
   0.006491778418421745,
   0.006293879821896553,
   0.02134476974606514,
   -0.01725957728922367,
   -0.028695285320281982,
   -0.021669888868927956,
   0.011004570871591568,
   0.011930453591

## 4. <font color = red> Semantic Search with Cache

We will now perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [None]:
# Read the user query
query = input()

What are the accidental death benefits and policy that are provided by the life insurance?


In [None]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [None]:
cache_results

{'ids': [['what are the accidental death benefits received in a life insurance']],
 'distances': [[0.1284160030439106]],
 'metadatas': [[{'distances0': '0.2894439697265625',
    'distances1': '0.322196900844574',
    'distances2': '0.32245951890945435',
    'distances3': '0.332907497882843',
    'distances4': '0.34236782789230347',
    'distances5': '0.34330233931541443',
    'distances6': '0.34550556540489197',
    'distances7': '0.3493587076663971',
    'distances8': '0.3595898151397705',
    'distances9': '0.3597981333732605',
    'documents0': 'Note: For the purpose of waiting period, Date of commencement or inception of coverage for a benefit option shall mean the date from which the member is covered under that benefit option. iv. Accidental Death Benefit Exclusions: 1. No Accidental Death Benefit will be payable if the death of the Scheme Members occurs after 180 days from the date of Accident. Specific Exclusions for this benefit are listed below We will not pay Accidental Deat

In [None]:
# Query the collection against the user query and return the top 10 results
results = insurance_collection.query(
          query_texts=query,
          n_results=10
          )

In [None]:
results

{'ids': [['124', '50', '18', '90', '116', '5', '195', '180', '117', '51']],
 'distances': [[0.2800048887729645,
   0.31305503845214844,
   0.3144375681877136,
   0.31846752762794495,
   0.32026463747024536,
   0.3227227032184601,
   0.3258749842643738,
   0.329505980014801,
   0.3329979181289673,
   0.3418991267681122]],
 'metadatas': [[{'Page_No.': 'Page 15',
    'Policy_Name': 'HDFC-Life-Group-Poorna-Suraksha-101N137V02-Policy-Document'},
   {'Page_No.': 'Page 13',
    'Policy_Name': 'HDFC-Life-Sanchay-Plus-Life-Long-Income-Option-101N134V19-Policy-Document'},
   {'Page_No.': 'Page 19',
    'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)'},
   {'Page_No.': 'Page 11', 'Policy_Name': 'HDFC-Life-Group-Term-Life-Policy'},
   {'Page_No.': 'Page 7',
    'Policy_Name': 'HDFC-Life-Group-Poorna-Suraksha-101N137V02-Policy-Document'},
   {'Page_No.': 'Page 6',
    'Policy_Name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1)'},
   {'Page_No.': 'Page 23',
  

In [None]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = insurance_collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if key != 'embeddings':
          for i in range(10):
            Keys.append(str(key)+str(i))
            if val is not None:
              if val[0][i] is not None:
                Values.append(str(val[0][i]))

      cache_collection.add(
          documents= [query],
          ids = [query],  
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, we can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })


Found in cache!


In [None]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,124,"Note: For the purpose of waiting period, Date ...",0.2894439697265625,"{'Page_No.': 'Page 15', 'Policy_Name': 'HDFC-L..."
1,90,PART C PRODUCT CORE BENEFITS BENEFITS PAYABLE ...,0.322196900844574,"{'Page_No.': 'Page 11', 'Policy_Name': 'HDFC-L..."
2,5,PART C Product Core Benefits BENEFITS PAYABLE ...,0.3224595189094543,"{'Page_No.': 'Page 6', 'Policy_Name': 'HDFC-Li..."
3,50,HDFC Life Sanchay Plus (UIN – 101N134V19) – Ap...,0.332907497882843,"{'Page_No.': 'Page 13', 'Policy_Name': 'HDFC-L..."
4,116,Part C 1. Benefits: (1) Benefits on Death or d...,0.3423678278923034,"{'Page_No.': 'Page 7', 'Policy_Name': 'HDFC-Li..."
5,180,HDFC Life Smart Pension Plan 101L164V02 – Term...,0.3433023393154144,"{'Page_No.': 'Page 8', 'Policy_Name': 'HDFC-Li..."
6,18,"(i) Death Certificate, in original, issued by ...",0.3455055654048919,"{'Page_No.': 'Page 19', 'Policy_Name': 'HDFC-L..."
7,117,"[[""21. Progressive\nScleroderma"", ""22. Muscula...",0.3493587076663971,"{'Page_No.': 'Page 8', 'Policy_Name': 'HDFC-Li..."
8,113,Part B Definitions The following capitalized t...,0.3595898151397705,"{'Page_No.': 'Page 4', 'Policy_Name': 'HDFC-Li..."
9,195,HDFC Life Smart Pension Plan 101L164V02 – Term...,0.3597981333732605,"{'Page_No.': 'Page 23', 'Policy_Name': 'HDFC-L..."


## 5. <font color = red> Re-Ranking with a Cross Encoder

Re-ranking the results obtained from your semantic search can sometime significantly improve the relevance of the retrieved results. This is often done by passing the query paired with each of the retrieved responses into a cross-encoder to score the relevance of the response w.r.t. the query.

In [None]:
# Import the CrossEncoder library from sentence_transformers

from sentence_transformers import CrossEncoder, util

In [None]:
# Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Test the cross encoder model

scores = cross_encoder.predict([['Does the insurance cover diabetic patients?', 'The insurance policy covers some pre-existing conditions including diabetes, heart diseases, etc. The policy does not howev'],
                                ['Does the insurance cover diabetic patients?', 'The premium rates for various age groups are given as follows. Age group (<18 years): Premium rate']])

In [None]:
scores

array([  3.84676 , -11.252879], dtype=float32)

In [None]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [None]:
cross_rerank_scores

array([ 4.0051785 ,  1.391854  , -0.4556378 , -0.0562063 ,  3.2025957 ,
       -1.5104895 , -1.7928175 , -1.8404415 ,  2.366757  , -0.37114334],
      dtype=float32)

In [None]:
# Store the rerank_scores in results_df

results_df['Reranked_scores'] = cross_rerank_scores

In [None]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,124,"Note: For the purpose of waiting period, Date ...",0.2894439697265625,"{'Page_No.': 'Page 15', 'Policy_Name': 'HDFC-L...",4.340906
1,90,PART C PRODUCT CORE BENEFITS BENEFITS PAYABLE ...,0.322196900844574,"{'Page_No.': 'Page 11', 'Policy_Name': 'HDFC-L...",-0.221117
2,5,PART C Product Core Benefits BENEFITS PAYABLE ...,0.3224595189094543,"{'Page_No.': 'Page 6', 'Policy_Name': 'HDFC-Li...",-1.86434
3,50,HDFC Life Sanchay Plus (UIN – 101N134V19) – Ap...,0.332907497882843,"{'Page_No.': 'Page 13', 'Policy_Name': 'HDFC-L...",-2.41257
4,116,Part C 1. Benefits: (1) Benefits on Death or d...,0.3423678278923034,"{'Page_No.': 'Page 7', 'Policy_Name': 'HDFC-Li...",2.687955
5,180,HDFC Life Smart Pension Plan 101L164V02 – Term...,0.3433023393154144,"{'Page_No.': 'Page 8', 'Policy_Name': 'HDFC-Li...",-3.532527
6,18,"(i) Death Certificate, in original, issued by ...",0.3455055654048919,"{'Page_No.': 'Page 19', 'Policy_Name': 'HDFC-L...",-3.926629
7,117,"[[""21. Progressive\nScleroderma"", ""22. Muscula...",0.3493587076663971,"{'Page_No.': 'Page 8', 'Policy_Name': 'HDFC-Li...",-5.017279
8,113,Part B Definitions The following capitalized t...,0.3595898151397705,"{'Page_No.': 'Page 4', 'Policy_Name': 'HDFC-Li...",0.410579
9,195,HDFC Life Smart Pension Plan 101L164V02 – Term...,0.3597981333732605,"{'Page_No.': 'Page 23', 'Policy_Name': 'HDFC-L...",-2.686286


In [None]:
# Return the top 3 results from semantic search

top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,124,"Note: For the purpose of waiting period, Date ...",0.2894439697265625,"{'Page_No.': 'Page 15', 'Policy_Name': 'HDFC-L...",4.005178
1,90,PART C PRODUCT CORE BENEFITS BENEFITS PAYABLE ...,0.322196900844574,"{'Page_No.': 'Page 11', 'Policy_Name': 'HDFC-L...",1.391854
2,5,PART C Product Core Benefits BENEFITS PAYABLE ...,0.3224595189094543,"{'Page_No.': 'Page 6', 'Policy_Name': 'HDFC-Li...",-0.455638


In [None]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,124,"Note: For the purpose of waiting period, Date ...",0.2894439697265625,"{'Page_No.': 'Page 15', 'Policy_Name': 'HDFC-L...",4.005178
4,116,Part C 1. Benefits: (1) Benefits on Death or d...,0.3423678278923034,"{'Page_No.': 'Page 7', 'Policy_Name': 'HDFC-Li...",3.202596
8,113,Part B Definitions The following capitalized t...,0.3595898151397705,"{'Page_No.': 'Page 4', 'Policy_Name': 'HDFC-Li...",2.366757


In [None]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]

In [None]:
top_3_RAG

Unnamed: 0,Documents,Metadatas
0,"Note: For the purpose of waiting period, Date ...","{'Page_No.': 'Page 15', 'Policy_Name': 'HDFC-L..."
4,Part C 1. Benefits: (1) Benefits on Death or d...,"{'Page_No.': 'Page 7', 'Policy_Name': 'HDFC-Li..."
8,Part B Definitions The following capitalized t...,"{'Page_No.': 'Page 4', 'Policy_Name': 'HDFC-Li..."


## 6. Retrieval Augmented Generation

Now that we have the final top search results, we can pass it to an GPT 3.5 along with the user query and a well-engineered prompt, to generate a direct answer to the query along with citations, rather than returning whole pages/chunks.

In [None]:
!pip install openai==0.28



In [None]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.

                                                """},
              ]

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response['choices'][0]['message']['content'].split('\n')

In [None]:
# Generate the response

response = generate_response(query, top_3_RAG)

In [None]:
# Print the response

print("\n".join(response))

Accidental death benefits are a type of coverage provided by life insurance policies. In the context of life insurance, accidental death benefits refer to the additional coverage provided in case the insured dies as a result of an accident. This coverage is designed to provide financial support to the beneficiary in the event of accidental death.

Based on the search results from the insurance documents, it appears that the documents contain relevant information about accidental death benefits and policies. However, without access to the actual text of the documents, it is not possible to provide specific details.

To find the specific information on accidental death benefits and policies provided by life insurance, I recommend referring to the following policy documents:

1. HDFC-Life Insurance Policy Document (Page 7): This document contains information regarding the benefits on death or disability, which may include accidental death benefits. Please refer to this document for detail