
# RAG Proof of Concept for Contract Documents using BM25 Sparse Vectors

In this notebook, we will build a Retrieval-Augmented Generation (RAG) system using BM25 sparse vectors for document chunk retrieval. The system will work on contract documents, allowing us to extract relevant chunks based on a user's query.

We will implement the following steps:
1. **Document Ingestion**: Extract text from PDF contract documents.
2. **Chunking**: Split the document text into smaller, manageable chunks.
3. **Indexing with BM25**: Index the chunks using BM25 for efficient retrieval.
4. **Querying**: Retrieve the top N relevant document chunks for a given query using BM25 scores.
5. **Output**: Display the results.

By the end of this notebook, you will have a working document retrieval system that can be extended for future improvements, such as using LLM-based embeddings for even better performance.



## Step 2: Reading PDF Contract Documents

In this step, we will use the `PyMuPDF` library to extract text from PDF contract documents. PyMuPDF provides a simple way to open PDF files and extract text from each page. This allows us to load the contents of a contract document for further processing.

We will define a function `extract_text_from_pdf` that accepts the file path of a PDF and returns the extracted text.

### Code to Extract Text from PDFs:


In [None]:

import fitz  # PyMuPDF

# Function to extract text from a PDF document
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # Open the PDF file
    text = ""
    for page in doc:
        text += page.get_text()  # Extract text from each page
    return text

# Example usage
pdf_path = "path_to_contract_document.pdf"
document_text = extract_text_from_pdf(pdf_path)
print(document_text[:500])  # Print the first 500 characters of the extracted text



## Step 3: Chunking the Document Text

After extracting the text from the PDF contract document, we need to split the document into smaller chunks. Chunking helps ensure that we can process and query the document efficiently. We will split the document into chunks of a fixed size (e.g., 500 characters).

The function `chunk_document` takes the entire document text and splits it into smaller chunks, each containing a maximum of 500 characters.

### Code to Chunk Document Text:


In [None]:

# Function to chunk document text into fixed-size chunks
def chunk_document(text, chunk_size=500):
    # Split text into chunks of specified size (e.g., 500 characters per chunk)
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Example usage to chunk the extracted document text
chunks = chunk_document(document_text)
print(f"Number of chunks: {len(chunks)}")
print(f"First chunk: {chunks[0][:100]}...")  # Print first 100 characters of the first chunk



## Step 4: Storing Document Chunks in MongoDB

Once we have chunked the document text, we need to store these chunks in a database for easy retrieval. MongoDB is an ideal choice for this task, as it offers a flexible, scalable storage solution.

We will store the chunks in a collection called `chunks` within a MongoDB database `contract_documents`. Each chunk will be associated with its document ID and chunk text.

### Code to Store Document Chunks in MongoDB:


In [None]:

import pymongo
from bson import ObjectId

# Establish MongoDB connection
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client['contract_documents']
collection = db['chunks']

# Function to store document chunks in MongoDB
def store_document_chunks(doc_id, document_text):
    chunks = chunk_document(document_text)
    chunk_data = [{'document_id': doc_id, 'text': chunk} for chunk in chunks]
    collection.insert_many(chunk_data)

# Example usage
doc_id = ObjectId()  # Generate a unique document ID
store_document_chunks(doc_id, document_text)



## Step 5: BM25 Indexing for Document Chunks

Now that we have stored the document chunks in MongoDB, we need to index them using BM25. BM25 is a popular ranking function used for information retrieval. It ranks documents based on term frequency and inverse document frequency (TF-IDF), allowing us to identify relevant chunks for a given query.

We will use the `rank_bm25` library to build the BM25 index. This library computes BM25 scores for each chunk in the document corpus.

### Code to Create BM25 Index:


In [None]:

from rank_bm25 import BM25Okapi
import numpy as np

# Function to calculate BM25 embeddings for document chunks
def create_bm25_index():
    # Fetch chunks from MongoDB
    chunks = list(collection.find())
    corpus = [chunk['text'] for chunk in chunks]
    
    # Tokenize the corpus for BM25
    tokenized_corpus = [doc.split() for doc in corpus]
    
    # Create BM25 index
    bm25 = BM25Okapi(tokenized_corpus)
    
    return bm25, chunks

bm25_index, chunks_data = create_bm25_index()



## Step 6: Querying for Relevant Document Chunks

With the BM25 index created, we can now query the document chunks to retrieve the most relevant ones for a given search query. We will define a function `query_bm25` that accepts a search query, computes BM25 scores for all document chunks, and returns the top N most relevant chunks.

The query will be tokenized, and the BM25 scores will be calculated to determine the relevance of each chunk to the query.

### Code to Query BM25:


In [None]:

# Function to query the BM25 index
def query_bm25(query, bm25, chunks_data, top_n=5):
    tokenized_query = query.split()  # Tokenize the query
    scores = bm25.get_scores(tokenized_query)
    
    # Get top N relevant chunks based on BM25 score
    top_n_indexes = np.argsort(scores)[-top_n:][::-1]
    
    relevant_chunks = []
    for index in top_n_indexes:
        chunk = chunks_data[index]
        relevant_chunks.append({'document_id': chunk['document_id'], 'text': chunk['text'], 'score': scores[index]})
    
    return relevant_chunks

# Example query
query = "contract payment terms"
relevant_chunks = query_bm25(query, bm25_index, chunks_data)
for chunk in relevant_chunks:
    print(f"Document ID: {chunk['document_id']}, Score: {chunk['score']}, Text: {chunk['text'][:100]}...")



## Step 7: Outputting Results to the User

Finally, we will display the relevant document chunks to the user. We will show the top N relevant chunks, including their BM25 score, document ID, and a snippet of the chunk text.

This step ensures that the user can easily access the relevant parts of the contract document based on their query.

### Code to Display Results:


In [None]:

# Display top N relevant chunks to the user
def display_relevant_chunks(chunks):
    for chunk in chunks:
        print(f"Document ID: {chunk['document_id']}, Score: {chunk['score']}, Text: {chunk['text'][:150]}...")

display_relevant_chunks(relevant_chunks)



## Step 8: Future Extension to LLM Embeddings

In the future, we can extend this system to use embeddings generated by a large language model (LLM) instead of BM25. By using LLMs, we can capture more nuanced semantic meaning in both the query and document chunks.

We can generate embeddings for the document chunks and the user query, and then compute cosine similarity between them to retrieve the most relevant document chunks. This approach is more sophisticated and could yield better results for complex queries.

### LLM-based Retrieval Pseudocode:


In [None]:

# Placeholder for LLM embedding retrieval
# Embed document chunks and user query using LLM
# Compute cosine similarity between query embedding and document chunk embeddings



## Conclusion

In this notebook, we built a basic Retrieval-Augmented Generation (RAG) system using BM25 for document retrieval. We:
1. Extracted text from contract PDF documents.
2. Split the documents into chunks and stored them in MongoDB.
3. Indexed the chunks using BM25 for efficient retrieval.
4. Implemented a query function to retrieve relevant chunks based on BM25 scores.

This system can easily be extended in the future to use LLM-based embeddings for better retrieval and semantic understanding. The notebook provides a solid foundation for building a scalable and efficient document retrieval system.
