### 1. Install Required Libraries

In [160]:
!pip install pdfplumber tiktoken openai chromadb sentence_transformers



In [161]:
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import openai
from sentence_transformers import CrossEncoder, util
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import chromadb

## 2. Read, Process, and Chunk the PDF Files

We will be using [pdfplumber](https://https://pypi.org/project/pdfplumber/) to read and process the PDF files.

`pdfplumber` allows for better parsing of the PDF file as it can read various elements of the PDF apart from the plain text, such as, tables, images, etc. It also offers wide functionaties and visual debugging features to help with advanced preprocessing as well.

#### 2.1 Reading a single PDF file and exploring it through pdfplumber

In [162]:
pdf_path='./Principal-Sample-Life-Insurance-Policy.pdf'
with pdfplumber.open(pdf_path) as pdf:
    table_of_content = pdf.pages[5]
    text = table_of_content.extract_text()
    print(text)

TABLE OF CONTENTS
PART I - DEFINITIONS
PART II - POLICY ADMINISTRATION
Section A – Contract
Entire Contract Article 1
Policy Changes Article 2
Policyholder Eligibility Requirements Article 3
Policy Incontestability Article 4
Individual Incontestability Article 5
Information to be Furnished Article 6
Certificates Article 7
Assignments Article 8
Dependent Rights Article 9
Policy Interpretation Article 10
Electronic Transactions Article 11
Section B – Premium
Payment Responsibility; Due Dates; Grace Period Article 1
Premium Rates Article 2
Premium Rate Changes Article 3
Premium Amount Article 4
Contributions from Members Article 5
Section C - Policy Termination
Failure to Pay Premium Article 1
Termination Rights of the Policyholder Article 2
Termination Rights of The Principal Article 3
Policyholder Responsibility to Members Article 4
Section D - Policy Renewal
Renewal Article 1
PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS
This policy has been updated effective January 1, 2014
GC 6001 TA

#### 2.2 Extracting text from multiple PDFs

Let's now try and read multiple documents, extract text from them using appropriate preprocessing, and store them in a dataframe


In [163]:
def check_bboxes(word, bbox):
    """
    Function to check if a word's bbox is within a table's bbox.
    """
    word_bbox = (word['x0'], word['top'], word['x1'], word['bottom'])
    return (word_bbox[0] >= bbox[0] and word_bbox[2] <= bbox[2] and
            word_bbox[1] >= bbox[1] and word_bbox[3] <= bbox[3])


In [164]:
def extract_text_from_pdf(pdf_path):
    full_text = []
    with pdfplumber.open(pdf_path) as pdf:
        for p, page in enumerate(pdf.pages, start=1):
            page_no = f"Page {p}"
            text = page.extract_text()
            
            # Extract heading (if text exists)
            heading = text.split('\n')[0].strip() if text else None

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                check_bboxes(word, table_bbox) for table_bbox in table_bboxes)]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):
                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass
                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))

            full_text.append([page_no, heading, " ".join(lines)])

    # Convert the extracted data to a DataFrame
    df = pd.DataFrame(full_text, columns=['Page_Number', 'Heading', 'Text'])
    
    return df

In [165]:
df = extract_text_from_pdf(pdf_path)

In [166]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally,This page left blank intentionally
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...


In [167]:
df['Text'][8]

"P ART I - DEFINITIONS When used in this Group Policy the terms listed below will mean: Active Work; Actively at Work A Member will be considered Actively at Work if he or she is able and available for active performance of all of his or her regular duties. Short term absence because of a regularly scheduled day off, holiday, vacation day, jury duty, funeral leave, or personal time off is considered Active Work provided the Member is able and available for active performance of all of his or her regular duties and was working the day immediately prior to the date of his or her absence. Activities of Daily Living (ADL) a. Bathing - the ability to wash oneself in the tub or shower or by sponge with or without equipment or adaptive devices. b. Dressing - the ability to put on and take off garments and medically necessary braces or artificial limbs usually worn and to fasten or unfasten them. c. Eating/Feeding - the ability to get nourishment into the body by any means once it has been pre

In [168]:
df['Text_Length'] = df['Text'].apply(lambda x: len(x.split(' ')))

In [169]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
1,Page 2,This page left blank intentionally,This page left blank intentionally,5
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
3,Page 4,This page left blank intentionally,This page left blank intentionally,5
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251


In [170]:
df = df.loc[df['Text_Length']>=10]

In [171]:
df.head(10)

Unnamed: 0,Page_Number,Heading,Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251
10,Page 11,(2) has been placed with the Member or spouse ...,(2) has been placed with the Member or spouse ...,299
11,Page 12,An institution that is licensed as a Hospital ...,An institution that is licensed as a Hospital ...,352


In [172]:
df['Metadata'] = df.apply(
    lambda x: {
        'Section': (x['Heading'][:20] if x['Heading'] else ''),
        'PageNo.': x['Page_Number']
    },
    axis=1
)

In [173]:
def chunk_text(text, chunk_size=300, overlap_size=50):
    # Split the text into individual words
    words = text.split()
    chunks = []
    
    # Iterate over the words to create chunks with overlap
    for i in range(0, len(words), chunk_size - overlap_size):
        # Create a chunk from the current position
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    
    return chunks


In [174]:
df['Chunks'] = df['Text'].apply(lambda x: chunk_text(x))

# Flatten the DataFrame to have one row per chunk
chunked_df = df.explode('Chunks').reset_index(drop=True)

# Add an identifier to each chunk to keep track of the page and chunk number
chunked_df['Chunk_ID'] = chunked_df.index + 1


In [175]:
chunked_df.head(15)

Unnamed: 0,Page_Number,Heading,Text,Text_Length,Metadata,Chunks,Chunk_ID
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,"{'Section': 'DOROTHEA GLAUSE S655', 'PageNo.':...",DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,1
1,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,"{'Section': 'POLICY RIDER', 'PageNo.': 'Page 3'}",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,2
2,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,"{'Section': 'PRINCIPAL LIFE INSUR', 'PageNo.':...",PRINCIPAL LIFE INSURANCE COMPANY (called The P...,3
3,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153,"{'Section': 'TABLE OF CONTENTS', 'PageNo.': 'P...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,4
4,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176,"{'Section': 'Section A – Eligibil', 'PageNo.':...",Section A – Eligibility Member Life Insurance ...,5
5,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171,"{'Section': 'Section A - Member L', 'PageNo.':...",Section A - Member Life Insurance Schedule of ...,6
6,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387,"{'Section': 'P ART I - DEFINITION', 'PageNo.':...",P ART I - DEFINITIONS When used in this Group ...,7
7,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387,"{'Section': 'P ART I - DEFINITION', 'PageNo.':...",f. Continence - the ability to voluntarily con...,8
8,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251,"{'Section': 'T he legally recogni', 'PageNo.':...",T he legally recognized union of two eligible ...,9
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251,"{'Section': 'T he legally recogni', 'PageNo.':...",2,10



## 3. Generate and Store Embeddings using OpenAI and ChromaDB

In this section, we will embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [176]:
filepath = "./"

with open(filepath + "api_key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

In [177]:

chroma_data_path = './chromadb'
client = chromadb.PersistentClient(path=chroma_data_path)

In [178]:
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name='text-embedding-ada-002')
insurance_collection = client.get_or_create_collection(name='InsurancePolicyDoc', embedding_function=embedding_function)

In [179]:
documents_list = chunked_df["Chunks"].tolist()
metadata_list = chunked_df['Metadata'].tolist()


In [180]:
insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

In [181]:

cache_collection = client.get_or_create_collection(name='Insurance_Cache_Coll', embedding_function=embedding_function)
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}

In [None]:
## 4. Semantic Search with Cache

In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [182]:
query_1=input()

 What are the policy holders eligibility requirements?


In [183]:
# Query the cache collection
cache_results = cache_collection.query(
    query_texts=query_1,
    n_results=1
)
print(cache_results)

# If the cache did not return satisfactory results, query the main collection
results = insurance_collection.query(
    query_texts=query_1,
    n_results=10
)
print(results)

{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[]], 'distances': [[]]}
{'ids': [['22', '23', '42', '40', '37', '16', '35', '5', '21', '25']], 'embeddings': None, 'documents': [["coverage, benefits, and participation privileges, may be made without the consent of any Member or Dependent. Payment of premium beyond the effective date of the change constitutes the Policyholder's consent to the change. Article 3 - Policyholder Eligibility Requirements To be an eligible group and to remain an eligible group, the Policyholder must: This policy has been updated effective January 1, 2014 PART II - POLICY ADMINISTRATION GC 6003 Section A - Contract, Page 1", "a. be actively engaged in business for profit within the meaning of the Internal Revenue Code, or be established as a legitimate nonprofit corporation within the meaning of the Internal Revenue Code; and b. make at least the level of premium

In [184]:
import json

threshold = 0.2

results_df_1 = pd.DataFrame()

# Query the cache collection
cache_results = cache_collection.query(
    query_texts=query_1,
    n_results=1
)

print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query_1,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query_1],
        ids=[query_1],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_1': results['metadatas'][0],
        'Documents_1': results['documents'][0],
        'Distances_1': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_1 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert to a DataFrame
    results_df_1 = pd.DataFrame({
        'IDs_1': ids,
        'Documents_1': documents,
        'Distances_1': distances,
        'Metadatas_1': metadatas
    })

# Display the DataFrame with results
#print(results_df_1)


{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[]], 'distances': [[]]}
Not found in cache. Found in main collection.


In [185]:
results_df_1.head()

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs
0,"{'Section': 'PART II - POLICY ADM', 'PageNo.':...","coverage, benefits, and participation privileg...",0.248706,22
1,"{'Section': 'a. be actively engag', 'PageNo.':...",a. be actively engaged in business for profit ...,0.259511,23
2,"{'PageNo.': 'Page 29', 'Section': 'Insurance f...",by The Principal. A Member must submit Proof o...,0.303865,42
3,"{'PageNo.': 'Page 28', 'Section': 'Section B -...",to an individual policy; or (2) were eligible ...,0.317705,40
4,"{'Section': 'PART III - INDIVIDUA', 'PageNo.':...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.31788,37


In [186]:
query_2=input()

 What type of coverage does this policy include?


In [187]:
cache_results= cache_collection.query(
         query_texts=query_2,
          n_results=1
     )
cache_results

{'ids': [['What are the policy holders eligibility requirements?']],
 'embeddings': None,
 'documents': [['What are the policy holders eligibility requirements?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'metadatas_5': '{"PageNo.": "Page 13", "Section": "a . A licensed Docto"}',
    'metadatas_7': '{"Section": "TABLE OF CONTENTS", "PageNo.": "Page 6"}',
    'documents_5': "a . A licensed Doctor of Medicine (M.D.) or Osteopathy (D.O.); or b. any other licensed health care practitioner that state law requires be recognized as a Physician under this Group Policy. The term Physician does not include the Member, an employee of the Member, a business or professional partner or associate of the Member, any person who has a financial affiliation or business interest with the Member, anyone related to the Member by blood or marriage, or anyone living in the Member's household. Policy Anniversary November 1, 2014 and the same day of e

In [188]:
threshold = 0.2

results_df_2 = pd.DataFrame()

# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query_2,
    n_results=1
)

# Print the results from the cache query for debugging
print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query_2,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query_2],
        ids=[query_2],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_2': results['metadatas'][0],
        'Documents_2': results['documents'][0],
        'Distances_2': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_2 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df_2 = pd.DataFrame({
        'IDs_2': ids,
        'Documents_2': documents,
        'Distances_2': distances,
        'Metadatas_2': metadatas
    })

{'ids': [['What are the policy holders eligibility requirements?']], 'embeddings': None, 'documents': [['What are the policy holders eligibility requirements?']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'ids_6': '35', 'metadatas_3': '{"PageNo.": "Page 28", "Section": "Section B - Effectiv"}', 'included_4': 'd', 'metadatas_8': '{"Section": "PART II - POLICY ADM", "PageNo.": "Page 16"}', 'distances_8': 0.329443097114563, 'included_2': 't', 'distances_3': 0.3177047073841095, 'documents_7': 'TABLE OF CONTENTS PART I - DEFINITIONS PART II - POLICY ADMINISTRATION Section A – Contract Entire Contract Article 1 Policy Changes Article 2 Policyholder Eligibility Requirements Article 3 Policy Incontestability Article 4 Individual Incontestability Article 5 Information to be Furnished Article 6 Certificates Article 7 Assignments Article 8 Dependent Rights Article 9 Policy Interpretation Article 10 Electronic Transactions Article 11 Section B

In [189]:
results_df_2

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs
0,"{'PageNo.': 'Page 16', 'Section': 'PART II - P...","coverage, benefits, and participation privileg...",0.368972,22
1,"{'Section': 'Coverage During Disa', 'PageNo.':...",This policy has been updated effective January...,0.378372,81
2,"{'Section': '(2) has been placed ', 'PageNo.':...",dependent on the Member for principal support....,0.380819,13
3,"{'Section': 'Exposure', 'PageNo.': 'Page 55'}","""Automobile"" means a four-wheel passenger vehi...",0.380867,88
4,"{'Section': 'A Member's insurance', 'PageNo.':...",state or federal law. Article 5 - Coverage Whi...,0.380963,54
5,"{'Section': 'a. be actively engag', 'PageNo.':...",a. be actively engaged in business for profit ...,0.380973,23
6,"{'PageNo.': 'Page 13', 'Section': 'a . A licen...",a . A licensed Doctor of Medicine (M.D.) or Os...,0.382814,16
7,"{'Section': 'Exposure', 'PageNo.': 'Page 55'}",Exposure Exposure to the elements will be pres...,0.383495,87
8,"{'Section': 'T he Principal has c', 'PageNo.':...",T he Principal has complete discretion to cons...,0.387429,27
9,"{'Section': 'c . a copy of the fo', 'PageNo.':...",c . a copy of the form which contains the stat...,0.389168,25


In [190]:
query_3 = input()

 What If I miss payment?


In [219]:
cache_results= cache_collection.query(
         query_texts=query_3,
          n_results=1
     )
cache_results

{'ids': [['What If I miss payment?']],
 'embeddings': None,
 'documents': [['What If I miss payment?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'metadatas_2': '{"Section": "f . claim requiremen", "PageNo.": "Page 54"}',
    'ids_5': '96',
    'distances_1': 0.4594007730484009,
    'documents_9': 'Reinstatement, Page 2',
    'included_2': 't',
    'documents_0': 'or d. fails to pay premium in accordance with the requirements of PART II, Section B; or e. has performed an act or practice that constitutes fraud or has made an intentional misrepresentation of material fact under the terms of this Group Policy; or f. does not promptly provide The Principal with information that is reasonably required; or g. fails to perform any of its obligations that relate to this Group Policy. This policy has been updated effective January 1, 2014 PART II - POLICY ADMINISTRATION GC 6005 Section C - Policy Termination, Page 1',
    'included_5':

In [192]:
threshold = 0.2

results_df_3 = pd.DataFrame()

# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query_3,
    n_results=1
)

# Print the results from the cache query for debugging
#print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query_3,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query_3],
        ids=[query_3],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_3': results['metadatas'][0],
        'Documents_3': results['documents'][0],
        'Distances_3': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_3 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df_3 = pd.DataFrame({
        'IDs_3': ids,
        'Documents_3': documents,
        'Distances_3': distances,
        'Metadatas_3': metadatas
    })

Not found in cache. Found in main collection.


In [193]:
results_df_3

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs
0,"{'PageNo.': 'Page 23', 'Section': 'Section C -...",or d. fails to pay premium in accordance with ...,0.43673,34
1,"{'Section': 'Section C - Policy T', 'PageNo.':...",Section C - Policy Termination Article 1 - Fai...,0.459401,33
2,"{'Section': 'f . claim requiremen', 'PageNo.':...","Settlement of Proceeds provisions of PART IV, ...",0.461815,86
3,"{'PageNo.': 'Page 20', 'Section': 'Section B -...",Section B - Premiums Article 1 - Payment Respo...,0.473405,28
4,"{'Section': 'T he Principal may t', 'PageNo.':...",T he Principal may terminate the Policyholder'...,0.475949,35
5,"{'Section': 'I f a Dependent who ', 'PageNo.':...","before a change request is received, that paym...",0.482655,96
6,"{'PageNo.': 'Page 54', 'Section': 'f . claim r...","f . claim requirements listed in PART IV, Sect...",0.483409,85
7,"{'Section': 'M ember's death, the', 'PageNo.':...",insurance and recorded by the Policyholder or ...,0.488051,73
8,"{'PageNo.': 'Page 20', 'Section': 'Section B -...",Premium Rate Changes The Principal may change ...,0.490084,29
9,"{'Section': 'I f coverage for a M', 'PageNo.':...","Reinstatement, Page 2",0.490146,62


## 5. Re-Ranking with a Cross Encoder

Re-ranking the results obtained from your semantic search can sometime significantly improve the relevance of the retrieved results. This is often done by passing the query paired with each of the retrieved responses into a cross-encoder to score the relevance of the response w.r.t. the query.


In [194]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [195]:
results_df_3

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs
0,"{'PageNo.': 'Page 23', 'Section': 'Section C -...",or d. fails to pay premium in accordance with ...,0.43673,34
1,"{'Section': 'Section C - Policy T', 'PageNo.':...",Section C - Policy Termination Article 1 - Fai...,0.459401,33
2,"{'Section': 'f . claim requiremen', 'PageNo.':...","Settlement of Proceeds provisions of PART IV, ...",0.461815,86
3,"{'PageNo.': 'Page 20', 'Section': 'Section B -...",Section B - Premiums Article 1 - Payment Respo...,0.473405,28
4,"{'Section': 'T he Principal may t', 'PageNo.':...",T he Principal may terminate the Policyholder'...,0.475949,35
5,"{'Section': 'I f a Dependent who ', 'PageNo.':...","before a change request is received, that paym...",0.482655,96
6,"{'PageNo.': 'Page 54', 'Section': 'f . claim r...","f . claim requirements listed in PART IV, Sect...",0.483409,85
7,"{'Section': 'M ember's death, the', 'PageNo.':...",insurance and recorded by the Policyholder or ...,0.488051,73
8,"{'PageNo.': 'Page 20', 'Section': 'Section B -...",Premium Rate Changes The Principal may change ...,0.490084,29
9,"{'Section': 'I f coverage for a M', 'PageNo.':...","Reinstatement, Page 2",0.490146,62


In [196]:
cross_inputs_1 = [[query_1, response] for response in results_df_1['Documents_1']]
cross_rerank_scores_1 = cross_encoder.predict(cross_inputs_1)
cross_rerank_scores_1

array([ 6.58327  , -1.0143627, -3.4940925, -1.9038168, -0.1226889,
       -3.1502113, -5.9020834,  1.3004377, -0.681476 , -9.010473 ],
      dtype=float32)

In [197]:
results_df_1['Reranked_scores'] = cross_rerank_scores_1
results_df_1

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
0,"{'Section': 'PART II - POLICY ADM', 'PageNo.':...","coverage, benefits, and participation privileg...",0.248706,22,6.58327
1,"{'Section': 'a. be actively engag', 'PageNo.':...",a. be actively engaged in business for profit ...,0.259511,23,-1.014363
2,"{'PageNo.': 'Page 29', 'Section': 'Insurance f...",by The Principal. A Member must submit Proof o...,0.303865,42,-3.494092
3,"{'PageNo.': 'Page 28', 'Section': 'Section B -...",to an individual policy; or (2) were eligible ...,0.317705,40,-1.903817
4,"{'Section': 'PART III - INDIVIDUA', 'PageNo.':...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.31788,37,-0.122689
5,"{'PageNo.': 'Page 13', 'Section': 'a . A licen...",a . A licensed Doctor of Medicine (M.D.) or Os...,0.318569,16,-3.150211
6,"{'Section': 'T he Principal may t', 'PageNo.':...",T he Principal may terminate the Policyholder'...,0.32545,35,-5.902083
7,"{'Section': 'TABLE OF CONTENTS', 'PageNo.': 'P...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.325484,5,1.300438
8,"{'Section': 'PART II - POLICY ADM', 'PageNo.':...",PART II - POLICY ADMINISTRATION Section A - Co...,0.329443,21,-0.681476
9,"{'Section': 'c . a copy of the fo', 'PageNo.':...",c . a copy of the form which contains the stat...,0.333937,25,-9.010473


In [198]:
top_3_semantic_1 = results_df_1.sort_values(by='Distances_1')
top_3_semantic_1[:3]

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
0,"{'Section': 'PART II - POLICY ADM', 'PageNo.':...","coverage, benefits, and participation privileg...",0.248706,22,6.58327
1,"{'Section': 'a. be actively engag', 'PageNo.':...",a. be actively engaged in business for profit ...,0.259511,23,-1.014363
2,"{'PageNo.': 'Page 29', 'Section': 'Insurance f...",by The Principal. A Member must submit Proof o...,0.303865,42,-3.494092


In [199]:
top_3_rerank_1 = results_df_1.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_1[:3]

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
0,"{'Section': 'PART II - POLICY ADM', 'PageNo.':...","coverage, benefits, and participation privileg...",0.248706,22,6.58327
7,"{'Section': 'TABLE OF CONTENTS', 'PageNo.': 'P...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,0.325484,5,1.300438
4,"{'Section': 'PART III - INDIVIDUA', 'PageNo.':...",PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.31788,37,-0.122689


In [200]:
cross_inputs_2 = [[query_2, response] for response in results_df_2['Documents_2']]
cross_rerank_scores_2 = cross_encoder.predict(cross_inputs_2)
results_df_2['Reranked_scores'] = cross_rerank_scores_2

In [201]:
top_3_semantic_2 = results_df_2.sort_values(by='Distances_2')
top_3_semantic_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
0,"{'PageNo.': 'Page 16', 'Section': 'PART II - P...","coverage, benefits, and participation privileg...",0.368972,22,-1.230197
1,"{'Section': 'Coverage During Disa', 'PageNo.':...",This policy has been updated effective January...,0.378372,81,-4.356529
2,"{'Section': '(2) has been placed ', 'PageNo.':...",dependent on the Member for principal support....,0.380819,13,-1.817021


In [202]:
top_3_rerank_2 = results_df_2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
4,"{'Section': 'A Member's insurance', 'PageNo.':...",state or federal law. Article 5 - Coverage Whi...,0.380963,54,-0.963687
6,"{'PageNo.': 'Page 13', 'Section': 'a . A licen...",a . A licensed Doctor of Medicine (M.D.) or Os...,0.382814,16,-0.973411
0,"{'PageNo.': 'Page 16', 'Section': 'PART II - P...","coverage, benefits, and participation privileg...",0.368972,22,-1.230197


In [203]:
cross_inputs_3 = [[query_3, response] for response in results_df_3['Documents_3']]
cross_rerank_scores_3 = cross_encoder.predict(cross_inputs_3)

In [204]:
results_df_3['Reranked_scores'] = cross_rerank_scores_3
top_3_semantic_3 = results_df_3.sort_values(by='Distances_3')
top_3_semantic_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
0,"{'PageNo.': 'Page 23', 'Section': 'Section C -...",or d. fails to pay premium in accordance with ...,0.43673,34,-10.301773
1,"{'Section': 'Section C - Policy T', 'PageNo.':...",Section C - Policy Termination Article 1 - Fai...,0.459401,33,-7.991843
2,"{'Section': 'f . claim requiremen', 'PageNo.':...","Settlement of Proceeds provisions of PART IV, ...",0.461815,86,-6.399134


In [205]:
top_3_rerank_3 = results_df_3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
2,"{'Section': 'f . claim requiremen', 'PageNo.':...","Settlement of Proceeds provisions of PART IV, ...",0.461815,86,-6.399134
6,"{'PageNo.': 'Page 54', 'Section': 'f . claim r...","f . claim requirements listed in PART IV, Sect...",0.483409,85,-7.258598
7,"{'Section': 'M ember's death, the', 'PageNo.':...",insurance and recorded by the Policyholder or ...,0.488051,73,-7.543719


In [206]:
top_3_RAG_1 = top_3_rerank_1[["Documents_1", "Metadatas_1"]][:3]
top_3_RAG_2 = top_3_rerank_2[["Documents_2", "Metadatas_2"]][:3]
top_3_RAG_3 = top_3_rerank_3[["Documents_3", "Metadatas_3"]][:3]

## 6. Retrieval Augmented Generation

Now that we have the final top search results, we can pass it to an GPT 3.5 along with the user query and a well-engineered prompt, to generate a direct answer to the query along with citations, rather than returning whole pages/chunks.

In [207]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, top_3_RAG):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
        {"role": "user", "content": f"""
            You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
            You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

            The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

            Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

            Follow the guidelines below when performing the task:
            1. Try to provide relevant/accurate numbers if available.
            2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
            3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular format.
            4. Use the 'metadata' columns in the dataframe to retrieve and cite the policy name(s) and page number(s) as citation.
            5. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
            6. You are a customer-facing assistant, so do not provide any information on internal workings, just answer the query directly.

            The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.

            ### Few-Shot Examples

            ### Example 1: Basic Query about Coverage
            **Query:**  
            What does the policy say about coverage for accidental death?

            **Top 3 RAG Results:**  
            - **Document 1:** "This policy provides coverage for accidental death. The insured amount for accidental death is 200% of the base coverage amount if the death occurs within 90 days of the accident..."
            - **Document 2:** "Accidental death benefits are payable under this policy if the insured dies as a result of an accident. The benefit amount equals double the coverage amount, provided the death is a direct result of the accident and occurs within a specified time frame..."
            - **Document 3:** "In the event of accidental death, the policy pays an additional benefit, which is equal to twice the original coverage amount. This benefit is contingent on the death occurring within 180 days from the date of the accident..."

            **Response:**  
            The policy provides coverage for accidental death, where the benefit amount is typically 200% of the base coverage. The death must occur as a direct result of an accident and within a specified period, which varies between 90 to 180 days depending on the policy.  
            **Citations:**  
            Document 1: Policy X, Page 5  
            Document 2: Policy Y, Page 12  
            Document 3: Policy Z, Page 7

            ### Example 2: Query about Exclusions
            **Query:**  
            Are there any exclusions for pre-existing conditions in this policy?

            **Top 3 RAG Results:**  
            - **Document 1:** "This policy excludes coverage for any conditions that were diagnosed or treated within 12 months prior to the policy's start date. However, if the condition remains stable for 24 months after the policy's start date, it may be eligible for coverage..."
            - **Document 2:** "Pre-existing conditions are generally not covered under this policy unless explicitly stated otherwise. Any condition that has shown symptoms or required medical attention in the 12 months before the policy start date is excluded..."
            - **Document 3:** "Exclusions apply to pre-existing conditions if they were present within a 12-month window before the policy commencement. After a waiting period of 24 months, these conditions may be reconsidered for coverage..."

            **Response:**  
            The policy excludes coverage for pre-existing conditions that were diagnosed or treated within 12 months before the policy start date. However, if the condition remains stable and does not require treatment for 24 months after the policy start date, it may be eligible for coverage.  
            **Citations:**  
            Document 1: Policy X, Page 8  
            Document 2: Policy Y, Page 15  
            Document 3: Policy Z, Page 10

            ### Example 3: Query about Beneficiaries
            **Query:**  
            How can I update the beneficiary for my life insurance policy?

            **Top 3 RAG Results:**  
            - **Document 1:** "To update the beneficiary, the policyholder must submit a written request to the insurance company. The request should include the policy number, current beneficiary, and the new beneficiary's details..."
            - **Document 2:** "Beneficiary changes can be made by filling out the 'Beneficiary Change Form,' which must be signed by the policyholder and submitted to the insurance provider. The change takes effect upon the company's receipt and acceptance of the form..."
            - **Document 3:** "The policyholder has the right to change the beneficiary at any time by providing written notice to the insurer. The change will be recorded and acknowledged by the insurer upon receipt..."

            **Response:**  
            To update the beneficiary of your life insurance policy, you need to submit a written request or complete a 'Beneficiary Change Form' provided by your insurer. This request must be signed by you as the policyholder, and the change will take effect once the insurance company has received and acknowledged it.  
            **Citations:**  
            Document 1: Policy X, Page 3  
            Document 2: Policy Y, Page 6  
            Document 3: Policy Z, Page 4

            ### Example 4: Complex Query about Premium Payments
            **Query:**  
            What are the options if I miss a premium payment?

            **Top 3 RAG Results:**  
            - **Document 1:** "If a premium payment is missed, the policyholder typically has a 30-day grace period to make the payment without penalty. If payment is not received within this period, the policy may lapse, and coverage could be lost..."
            - **Document 2:** "The policy includes a 30-day grace period for missed payments. During this time, coverage remains in force. If payment is not made by the end of the grace period, the policyholder may request reinstatement of the policy, subject to underwriting..."
            - **Document 3:** "In the event of a missed premium, a 30-day grace period is granted. If the premium is not paid within this period, the policyholder may choose to reinstate the policy, which may require proof of insurability and payment of overdue premiums..."

            **Response:**  
            If you miss a premium payment, your policy provides a 30-day grace period during which you can make the payment without losing coverage. If the payment is not made within this period, the policy may lapse. However, you may have the option to reinstate the policy by providing proof of insurability and paying the overdue premiums.  
            **Citations:**  
            Document 1: Policy X, Page 10  
            Document 2: Policy Y, Page 11  
            Document 3: Policy Z, Page 9
        """},
    ]

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [208]:
# Generate the response - For Query 1

response = generate_response(query_1, top_3_RAG_1)
print("Query 1: ","\n",query_1,"\n_________________________________________________________________________________________________________________\n_________________________________________________________________________________________________________________\n")
# Print the response
print("\n".join(response))

Query 1:  
 What are the policy holders eligibility requirements? 
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________

**Response:**  
The eligibility requirements for policyholders typically include specific criteria relating to age, residency, and health status. Unfortunately, the detailed eligibility requirements were not directly visible in the provided excerpts of the documents. However, one section does mention "Individual Requirements and Rights," which may contain relevant information regarding eligibility. 

To find specific eligibility requirements, you should refer to the section on "Individual Requirements and Rights" in the relevant policy documents.

**Citations:**  
Document 1: Policy (part of), Page No. - N/A (Section reference indicates it is part of a broader context)  
Document 2: Policy (part

In [216]:
# Generate the response - For Query 2

response = generate_response(query_2, top_3_RAG_2)
print("Query 2: ","\n",query_2,"\n_________________________________________________________________________________________________________________\n_________________________________________________________________________________________________________________\n")
# Print the response
print("\n".join(response))

Query 2:  
 What type of coverage does this policy include? 
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________

The policy covers various types of insurance benefits, including standard coverage for essential services. Specific provisions related to these offerings can be found throughout the document. Here are some key coverage details:

1. **Coverage Type:** The policy includes coverage for mean medical treatments as prescribed by licensed healthcare providers.
2. **Benefits and Participation:** The policy defines participation privileges for members, indicating that certain coverage types may depend on the member's status and compliance with specified criteria.

To assist you further, you can look into sections discussing these aspects in the relevant documents.

**Citations:**  
Document 1: Policy on Covera

In [210]:
top_3_RAG_3['Metadatas_3']

2    {'Section': 'f . claim requiremen', 'PageNo.':...
6    {'PageNo.': 'Page 54', 'Section': 'f . claim r...
7    {'Section': 'M ember's death, the', 'PageNo.':...
Name: Metadatas_3, dtype: object

In [214]:
# Generate the response - For Query 3

response = generate_response(query_3, top_3_RAG_3)
print("Query 3: ","\n",query_3,"\n_________________________________________________________________________________________________________________\n_________________________________________________________________________________________________________________\n")
# Print the response
print("\n".join(response))

Query 3:  
 What If I miss payment? 
_________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________

**Response:**  
If you miss a premium payment, typically, there is a 30-day grace period during which you can make your payment without losing coverage. If the payment is not made within this period, there is a risk that your policy may lapse, leading to potential loss of coverage. However, many policies allow for reinstatement after the grace period under certain conditions, which could include providing evidence of insurability and paying any overdue premiums.

It's important to review your specific policy documents for precise terms regarding grace periods and reinstatement. 

**Citations:**  
Document 1: Policy Section, Page 54  
Document 2: Policy Section, Page 54  
Document 3: Policy Section, Page 54
