# Insurance Documents RAG (Retrieval-Augmented Generation) System

This notebook demonstrates a complete RAG implementation for processing and querying insurance policy documents. The system uses OpenAI embeddings, Chroma vector database, and cross-encoder reranking to provide accurate answers to insurance-related questions.

## 🎯 Objectives
- Process PDF insurance documents
- Create vector embeddings for semantic search
- Implement advanced retrieval with reranking
- Build a question-answering system using RAG architecture

## 📋 Table of Contents
1. [Environment Setup](#environment-setup)
2. [Document Loading and Processing](#document-loading)
3. [Text Splitting and Chunking](#text-splitting)
4. [Embeddings and Vector Database](#embeddings)
5. [Retrieval System](#retrieval)

---

## 1. Environment Setup

### Installing Required Dependencies
First, we install the necessary packages for our RAG system.

In [None]:
!pip install langchainhub --user

### Import Required Libraries
Importing all necessary libraries for document processing, embeddings, vector storage, and retrieval.

In [1]:
# Core libraries
import openai
import os
from pathlib import Path

# LangChain components
from langchain_openai import OpenAI, ChatOpenAI
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore 
from langchain.embeddings import CacheBackedEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker 
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_text_splitters import RecursiveCharacterTextSplitter

# RAG chain components
from langchain import hub
from langchain_core.runnables import RunnablePassthrough 
from langchain_core.output_parsers import StrOutputParser

import warnings
warnings.filterwarnings('ignore')

### API Key Configuration
Setting up OpenAI API key for accessing embeddings and chat models.

In [2]:
def read_api_key_from_file(file_path):
    """Read API key from a text file"""
    try:
        with open(file_path, 'r') as file:
            api_key = file.read().strip()
        return api_key
    except FileNotFoundError:
        print(f"Error: File {file_path} not found")
        return None
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

# Load API key
api_key = read_api_key_from_file('openai_key.txt')

if api_key:
    os.environ['OPENAI_API_KEY'] = api_key

### Initialize Language Model
Setting up the OpenAI ChatGPT model for our RAG system.

In [3]:
llm_chat = ChatOpenAI()

---

## 2. Document Loading and Processing

### Load PDF Documents
Loading insurance policy documents from the specified directory.

In [4]:
# Load PDF documents from directory
pdf_directory_loader = PyPDFDirectoryLoader('./InsuranceDocuments')
documents = pdf_directory_loader.load()

### Document Analysis
Analyzing the loaded documents to understand our dataset.

In [5]:
# Extract unique document sources and filenames
unique_sources = list(set(doc.metadata['source'] for doc in documents))
filenames = [Path(source).name for source in sorted(unique_sources)]

In [6]:
print(filenames)

['Motor Vehicle Insurance Policy Against Loss and Damage.pdf', 'Motor Vehicle Insurance Policy Against Third Party Liability.pdf']


In [7]:
# Display document loading summary
print(f"Loaded {len(documents)} document pages from {len(unique_sources)} PDF files:")
for i, filename in enumerate(filenames, 1):
    page_count = sum(1 for doc in documents if Path(doc.metadata['source']).name == filename)
    print(f"{i:2d}. {filename} ({page_count} pages)")

Loaded 34 document pages from 2 PDF files:
 1. Motor Vehicle Insurance Policy Against Loss and Damage.pdf (16 pages)
 2. Motor Vehicle Insurance Policy Against Third Party Liability.pdf (18 pages)


---

## 3. Text Splitting and Chunking

### Configure Text Splitter
Breaking down large documents into manageable chunks for better retrieval performance.

In [8]:
# Configure text splitter with optimal parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # Maximum characters per chunk
    chunk_overlap=200,       # Overlap between chunks to maintain context
    length_function=len,
    is_separator_regex=False
)

# Split documents into chunks
splits = text_splitter.split_documents(documents)

### Chunking Analysis
Analyzing the text splitting results to ensure optimal chunk sizes.

In [9]:
print(f"Split {len(documents)} documents into {len(splits)} chunks")
print(f"Average chunks per document: {len(splits) / len(documents):.1f}")

Split 34 documents into 85 chunks
Average chunks per document: 2.5


In [10]:
# Analyze chunk size distribution
chunk_sizes = [len(chunk.page_content) for chunk in splits]
print(f"Chunk sizes - Min: {min(chunk_sizes)}, Max: {max(chunk_sizes)}, Avg: {sum(chunk_sizes) / len(chunk_sizes):.0f}")

Chunk sizes - Min: 112, Max: 998, Avg: 790


### Preview Document Chunks
Examining the first few chunks to understand the content structure.

In [11]:
print("Quick Text Preview:")
print("=" * 50)

for i in range(min(6, len(splits))):
    content = splits[i].page_content
    source = splits[i].metadata.get('source', 'Unknown')
    filename = source.split('/')[-1] if '/' in source else source
    
    print(f"Chunk {i+1} from {filename} ({len(content)} chars):")
    print(content[:200] + "..." if len(content) > 200 else content)
    print("-" * 40)

Quick Text Preview:
Chunk 1 from InsuranceDocuments\Motor Vehicle Insurance Policy Against Loss and Damage.pdf (964 chars):
Insurance Authority  
 
The Unified Motor Vehicle Insurance Policy Against Loss and Damage  
issued  pursuant to  the Regulation of Unified  Motor Vehicle Insurance Policies 
according to Insurance Au...
----------------------------------------
Chunk 2 from InsuranceDocuments\Motor Vehicle Insurance Policy Against Loss and Damage.pdf (268 chars):
the accident or was an injured party ; 
 
Therefore,  this Policy was entered into to cover the damages that befall on the 
Insured Motor Vehicle in the UAE during the insurance period according to th...
----------------------------------------
Chunk 3 from InsuranceDocuments\Motor Vehicle Insurance Policy Against Loss and Damage.pdf (959 chars):
Definitions:  
The following terms and phrases shall have the meanings indicated beside  each  of 
them  unless the context provides otherwise:  
 
Policy:  The Unified Motor Veh

---

## 4. Embeddings and Vector Database

### Initialize Embedding Model
Setting up OpenAI embeddings for converting text to vector representations.

In [12]:
# Initialize OpenAI embeddings model
embeddings_model = OpenAIEmbeddings(
    model="text-embedding-ada-002"  # OpenAI's recommended embedding model
)

# Test embeddings with sample text
sample_texts = [splits[0].page_content]
embeddings = embeddings_model.embed_documents(sample_texts)

### Embedding Model Testing
Verifying that embeddings are generated correctly.

In [13]:
print(f"Number of embeddings: {len(embeddings)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"Sample embedding (first 10 values): {embeddings[0][:10]}")

Number of embeddings: 1
Embedding dimension: 1536
Sample embedding (first 10 values): [0.0013451644002890914, -0.00765103255596719, -0.004609648689306264, -0.03582730583135431, -0.04965953248310294, 0.009724553854536255, -0.027087016532736485, -0.013635373926829458, 0.00594497017395262, -0.0019143985473220376]


In [14]:
# Test multiple embeddings
if len(splits) >= 3:
    multi_embeddings = embeddings_model.embed_documents([
        splits[0].page_content,
        splits[1].page_content,
        splits[2].page_content
    ])
    print(f"\nMultiple embeddings created: {len(multi_embeddings)} embeddings")
    print(f"All embeddings have same dimension: {all(len(emb) == len(multi_embeddings[0]) for emb in multi_embeddings)}")


Multiple embeddings created: 3 embeddings
All embeddings have same dimension: True


### Create Vector Database
Building a persistent Chroma vector database with cached embeddings for efficient retrieval.

In [16]:
# Set up cached embeddings to avoid recomputation
openai_embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    show_progress_bar=True
)

cache_store = InMemoryStore()
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=openai_embeddings,
    document_embedding_cache=cache_store,
    namespace="insurance_docs_embeddings"
)

# Create persistent vector database
persist_dir = "./chroma_db"
db = Chroma.from_documents(
    documents=splits,
    embedding=cached_embeddings,
    persist_directory=persist_dir,
    collection_name="insurance_collection"
)

# Display creation results
print(f"Vector database created with {len(splits)} document chunks")
print(f"Database persisted to: {os.path.abspath(persist_dir)}")
print(f"Collection name: insurance_collection")

# Test the database
query_result = db.similarity_search("insurance policy", k=2)
print(f"\nTest query returned {len(query_result)} similar documents")

  0%|          | 0/1 [00:00<?, ?it/s]

Vector database created with 85 document chunks
Database persisted to: C:\Users\aatir\OneDrive\Documents\upGrad_MLAI\SemanticSpotter\chroma_db
Collection name: insurance_collection


  0%|          | 0/1 [00:00<?, ?it/s]


Test query returned 2 similar documents


### Basic Similarity Search Function
Creating a simple search function for testing the vector database.

In [17]:
def similarity_search(query): 
    """Perform basic similarity search on the vector database"""
    return db.similarity_search(query)

### Enhanced Search Function with Formatting
Creating a more sophisticated search function with formatted output for better readability.

In [18]:
def similarity_search(query, k=3):
    """Search for similar documents with pretty formatted output"""
    results = db.similarity_search(query, k=k)
    
    print("=" * 80)
    print(f"🔍 SEARCH QUERY: {query}")
    print("=" * 80)
    print(f"Found {len(results)} relevant documents:\n")
    
    for i, doc in enumerate(results, 1):
        # Extract source file name
        source = doc.metadata.get('source', 'Unknown')
        filename = Path(source).name if source != 'Unknown' else 'Unknown'
        
        # Get page number if available
        page = doc.metadata.get('page', 'N/A')
        
        print(f"📄 RESULT {i}")
        print(f"   Source: {filename}")
        print(f"   Page: {page}")
        print(f"   Content Length: {len(doc.page_content)} characters")
        print(f"   Content Preview:")
        print("   " + "-" * 60)
        
        content = doc.page_content.strip()
        lines = content.split('\n')
        for line in lines[:100]:  # Show first 50 lines
            print(f"   {line}")
        
        if len(lines) > 100:
            print(f"   ... ({len(lines) - 100} more lines)")
        
        print("   " + "-" * 60)
        print()
    
    return results

### Testing Enhanced Search
Testing the search functionality with sample queries.

In [19]:
docs = similarity_search("What happens if the motor vehicle becomes unroadworthy due to damage?", k=1)

  0%|          | 0/1 [00:00<?, ?it/s]

🔍 SEARCH QUERY: What happens if the motor vehicle becomes unroadworthy due to damage?
Found 1 relevant documents:

📄 RESULT 1
   Source: Motor Vehicle Insurance Policy Against Loss and Damage.pdf
   Page: 5
   Content Length: 928 characters
   Content Preview:
   ------------------------------------------------------------
   5. If the Insured Motor Vehicle is lost , proves to be irreparable , or that costs of 
   repair exceed 50% of the Motor Vehicle value before the accident, the  insured 
   value of the Motor Vehicle agreed upon between the Insurer and the Insured on 
   signing of the Insurance Policy will be the basis of calculation of the 
   compensation of loss and damage insured hereunder after deduction of the 
   Depreciation  Percentage of 20% f rom the insured value, and taking into account 
   the fraction of insurance period  (i.e., the proportion of  the period from the 
   commencement date of the insurance period to the date of the accident  to the 
   total insuran

---

## 5. Advanced Retrieval System

### Retriever with Cross-Encoder Reranking
Implementing advanced retrieval with MMR (Maximal Marginal Relevance) and cross-encoder reranking for improved accuracy.

In [22]:
def create_retriever(top_k=5):
    """Create retriever with reranking"""
    # Base retriever with MMR
    base_retriever = db.as_retriever(
        search_type="mmr",
        search_kwargs={"k": top_k, "score_threshold": 0.8})
    
    # Add reranker
    cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    reranker = CrossEncoderReranker(model=cross_encoder, top_n=top_k)
    
    return ContextualCompressionRetriever(
        base_compressor=reranker,
        base_retriever=base_retriever
    )

def search_documents(query, top_k=1):
    """Search and display results"""
    retriever = create_retriever(top_k)
    docs = retriever.invoke(query)
    
    print(f"Found {len(docs)} documents for: '{query}'\n")
    
    for i, doc in enumerate(docs, 1):
        source = Path(doc.metadata.get('source', 'Unknown')).name
        page = doc.metadata.get('page', 'N/A')
        
        print(f"Document {i} - {source} (Page {page})")
        print(f"Content: {doc.page_content[:500]}...")
        print("-" * 50)
    
    return docs

In [23]:
docs = search_documents("What are the Depreciation Percentages for Taxi Vehicles, Public Transport Vehicles and Rental Vehicles According to the Date of First Registration and Use")

  0%|          | 0/1 [00:00<?, ?it/s]

Found 1 documents for: 'What are the Depreciation Percentages for Taxi Vehicles, Public Transport Vehicles and Rental Vehicles According to the Date of First Registration and Use'

Document 1 - Motor Vehicle Insurance Policy Against Loss and Damage.pdf (Page 11)
Content: Schedule No. (2)  
Depreciation Percentages for Taxi  Vehicles , Public Transport Vehicles and Rental 
Vehicles According to the Date of First Registration and Use  
Year  Percentage  
Last si x months of the first year  10% 
Second  20% 
Third  25% 
Fourth  30% 
Fifth  35% 
Sixth  and above  40%...
--------------------------------------------------


In [30]:
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    if not docs:
        return "No relevant documents found."
    return "\n\n".join(doc.page_content for doc in docs if doc.page_content)

Please use the `langsmith sdk` instead:
  pip install langsmith
Use the `pull_prompt` method.
  res_dict = client.pull_repo(owner_repo_commit)


In [31]:
def create_rag_chain(retriever, llm_model=None, temperature=0):
    """Create a RAG chain with configurable components."""
    llm_model = llm_model or ChatOpenAI(temperature=temperature)
    
    return (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough()
        }
        | prompt
        | llm_model
        | StrOutputParser()
    )

In [32]:
rag_chain = create_rag_chain(create_retriever(top_k=50))

In [33]:
query = "Which courts are competent to determine disputes from the Policy?" 
rag_chain.invoke(query)

  0%|          | 0/1 [00:00<?, ?it/s]

C:\ProgramData\Anaconda3\lib\site-packages\langchain_community\embeddings\openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  response = response.dict()


'The courts of the United Arab Emirates are competent to determine any disputes arising from this Policy.'

In [34]:
query = "What is the definition of natural disaster as per the policy?" 
rag_chain.invoke(query)

  0%|          | 0/1 [00:00<?, ?it/s]

C:\ProgramData\Anaconda3\lib\site-packages\langchain_community\embeddings\openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  response = response.dict()


'A natural disaster is defined as any general phenomenon that arises from nature such as floods, tornadoes, hurricanes, volcanoes, earthquakes, and quakes, leading to extensive damage and requiring a decision from the concerned authority in the country. Floods are considered events within the concept of natural disasters. This definition is outlined in the policy regarding insurance coverage for such occurrences.'

In [35]:
query = "Which is a third party liability?" 
rag_chain.invoke(query)

  0%|          | 0/1 [00:00<?, ?it/s]

C:\ProgramData\Anaconda3\lib\site-packages\langchain_community\embeddings\openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  response = response.dict()


'Third party liability is the liability for injuries and damages arising from the use of the insured motor vehicle to a third party or injured party. It covers bodily injury to a third party, either inside or outside the motor vehicle, and property damages to a third party. This type of liability does not cover accidents outside the borders of the state or those caused by natural disasters, warlike operations, or civil unrest.'

In [36]:
query = "When can the policy be terminated before its expiration?" 
rag_chain.invoke(query)

  0%|          | 0/1 [00:00<?, ?it/s]

C:\ProgramData\Anaconda3\lib\site-packages\langchain_community\embeddings\openai.py:500: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  response = response.dict()


'The policy can be terminated before its expiration if there are serious grounds for termination during the policy period, with a notice sent to the insured thirty days prior to the fixed date of termination. The company must refund the paid premium after deducting a portion in proportion to the period the policy has remained in effect. Additionally, the policy can be terminated in case of a total loss to the motor vehicle, provided its registration is deleted with a report confirming it is unroadworthy.'