| S. No. | BITS ID      | Name                   | Contribution |
|--------|------------|----------------------|--------------|
| 1      | 2023aa05972 | ABHISHEK SHUKLA      | 100%         |
| 2      | 2023aa05128 | RAJ KUMAR            | 100%         |
| 3      | 2023ab05157 | RAVI KRISHNA MAYURA  | 100%         |
| 4      | 2023aa05152 | SHRUTI S KUMAR       | 100%         |
| 5      | 2023ab05148 | DEEPAK DAMODAR SHARMA | 100%         |


In [48]:
# Import necessary libraries and modules
import os
import PyPDF2
import torch
import faiss
import spacy
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import pandas as pd
from rank_bm25 import BM25Okapi
from collections import defaultdict
import re
from transformers import pipeline

In [49]:
# Define a function to extract text from a PDF file
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
        return text

# List to store the contents of the PDF files
contents = []

# List of file paths for the PDF files to be processed
file_paths = [
    './data/SBI_Annual_Report_2022.pdf',
    './data/SBI_Annual_Report_2023.pdf'
]

# Extract the filenames from the file paths
filenames = [os.path.basename(file_path) for file_path in file_paths]

# Extract text from each PDF file and store it in the 'contents' list
for file_path in file_paths:
    text_data = extract_text_from_pdf(file_path)
    contents.append(text_data)

# Now 'contents' will have the text data from both PDF files

In [6]:
# Create a DataFrame
dataset = pd.DataFrame({
    'filename': filenames,
    'content': contents
})

# Display the dataset
print(dataset.head())

                     filename  \
0  SBI_Annual_Report_2022.pdf   
1  SBI_Annual_Report_2023.pdf   

                                             content  
0  01\nAnnual Report 2021 – 22\nAnnual Report 202...  
1  /gid00019/gid00032/gid00046/gid00043/gid00042/...  


In [50]:
def clean_text(text):
    text = re.sub(r"\\uf[a-fA-F0-9]{3}", " ", text)  # Handles multiple \uf0fc
    text = re.sub(r"[_]+", " ", text)
    # Remove '/gid000XX' patterns
    text = re.sub(r"/gid\d{5}", "", text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Convert to lowercase
    text = text.lower()

    return text

In [8]:
dataset['processed_content'] = dataset['content'].apply(clean_text)

### 2. Basic RAG Implementation

In [None]:
# Load the spaCy model for English language processing
nlp = spacy.load("en_core_web_sm")

# Define a function to perform semantic chunking on the input text
def semantic_chunking(text, max_chunk_size=500):
    # Process the text using the spaCy model
    doc = nlp(text)
    
    # Initialize variables to store chunks and the current chunk
    chunks, current_chunk = [], ""
    
    # Iterate through the sentences in the processed text
    for sent in doc.sents:
        # Check if adding the current sentence exceeds the maximum chunk size
        if len(current_chunk) + len(sent.text) > max_chunk_size:
            # If it does, append the current chunk to the list of chunks and start a new chunk
            chunks.append(current_chunk)
            current_chunk = sent.text
        else:
            # Otherwise, add the sentence to the current chunk
            current_chunk += " " + sent.text
    
    # Append any remaining text in the current chunk to the list of chunks
    if current_chunk:
        chunks.append(current_chunk)
    
    # Print the total number of chunks created and a sample of the chunks
    print("Total chunks created:", len(chunks))
    print("Sample chunks:", chunks[:3])
    
    # Return the list of chunks
    return chunks

In [52]:
# Apply the semantic_chunking function to the 'processed_content' column of the dataset
# and store the resulting chunks in the 'text_chunks' column
dataset['text_chunks'] = dataset['processed_content'].apply(semantic_chunking)

Total chunks created: 2048
Sample chunks: [' 01 annual report 2021 – 22 annual report 2021 - 22 setting new standards in banking excellence state bank bhavan, corporate centre, madame cama road, mumbai, maharashtra - 400021, india bank.sbi scan qr code to download02 annual report 2 021 – 22 contents 03 notice 09 awards & recognitions 10 about sbi 11 sbi’s journey through numbers 12 setting new standards in banking excellence.', '14 sbi group structure 16 financial highlights: 10 years at a glance 17 ratings 18 central board of directors 20 committees of the board/ members of central management committee/ members of local boards/ bank’s auditors 24 chairman’s message 30 directors’ report 30 (i) e conomic backdrop and banking environment 32 (ii) f inancial performance 33 (iii) cor e operations 33 1. re tail & digital banking group 33 a. p ersonal banking 36 b.', 'a ny time channels 37 2. s mall & medium enterprises 39 3. r ural banking 40 4. g overnment business 41 5. d igital & transact

In [11]:
# Step 4: Generate embeddings using an open-source model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [53]:
# Apply the model's encode function to the 'text_chunks' column of the dataset
# to generate embeddings for each chunk, and store the resulting embeddings in the 'embeddings' column
dataset['embeddings'] = dataset['text_chunks'].apply(lambda x: np.array(model.encode(x, convert_to_numpy=True)))

In [54]:
# Iterate through the 'embeddings' column of the dataset and print the shape of each embedding
for i, emb in enumerate(dataset['embeddings']):
    print(f"Embedding {i} shape: {np.array(emb).shape}")

Embedding 0 shape: (2048, 384)
Embedding 1 shape: (2192, 384)


In [55]:
# Display the first few rows of the dataset to inspect its contents
dataset.head()

Unnamed: 0,filename,content,processed_content,text_chunks,embeddings
0,SBI_Annual_Report_2022.pdf,01\nAnnual Report 2021 – 22\nAnnual Report 202...,01 annual report 2021 – 22 annual report 2021 ...,[ 01 annual report 2021 – 22 annual report 202...,"[[-0.010079363, -0.0644554, -0.080679506, -0.0..."
1,SBI_Annual_Report_2023.pdf,/gid00019/gid00032/gid00046/gid00043/gid00042/...,draft contents company overview sbi at a glanc...,"[, draft contents company overview sbi at a gl...","[[-0.1188385, 0.04829873, -0.0025481575, -0.01..."


In [56]:
# Flatten all embeddings into a single NumPy array
all_embeddings = np.vstack(dataset["embeddings"].values)

# Get the embedding dimension
embedding_dim = all_embeddings.shape[1]

print(f"Final Embedding Matrix Shape: {all_embeddings.shape}")  # (Total_chunks, 384)

Final Embedding Matrix Shape: (4240, 384)


In [57]:
# Create a FAISS index (L2 distance for similarity search)
index = faiss.IndexFlatL2(embedding_dim)

# Add embeddings to FAISS
index.add(all_embeddings)

# Check if embeddings are stored
print("FAISS index size:", index.ntotal)  # Should match total chunks

FAISS index size: 4240


In [59]:
# Define a function to retrieve the most relevant chunk of text based on the query
def retrieve_most_relevant_chunk(query, top_k=5):
    # Step 1: Generate query embedding
    query_embedding = model.encode([query], convert_to_numpy=True).astype(np.float32)

    # Step 2: Perform FAISS search to find the top_k most similar chunks
    distances, indices = index.search(query_embedding, top_k)

    # Step 3: Convert distances to similarity scores and collect results
    results = []
    for i, idx in enumerate(indices[0]):
        if idx == -1:  # Ignore invalid indices
            continue

        # Find corresponding document index
        doc_idx = np.searchsorted(np.cumsum([len(e) for e in dataset['embeddings']]), idx, side='right')

        # Ensure index is within range
        if doc_idx >= len(dataset['text_chunks']):
            continue
        if idx >= len(dataset['text_chunks'][doc_idx]):
            continue

        # Retrieve chunk text and calculate similarity score
        chunk_text = dataset['text_chunks'][doc_idx][idx]
        similarity_score = 1 / (1 + distances[0][i])  # Convert L2 distance to similarity

        results.append((dataset['filename'][doc_idx], chunk_text, similarity_score))

    # Step 4: Sort results by similarity score and return the most relevant chunk
    results = sorted(results, key=lambda x: x[2], reverse=True)
    most_relevant_chunk = results[0] if results else None

    min_similarity = 0.5
    if most_relevant_chunk and most_relevant_chunk[2] >= min_similarity:
        print(f"Document: {most_relevant_chunk[0]}")
        print(f"Similarity Score: {most_relevant_chunk[2]:.4f}")
        print(f"Most Relevant Chunk:\n{most_relevant_chunk[1][:500]}...")  # Show first 500 characters
        return
    else:
        return "No relevant chunk found."

In [60]:
#TESTING DIFF QUESTIONS -1

query = "What was the net profit in 2022?"
retrieve_most_relevant_chunk(query)

Document: SBI_Annual_Report_2022.pdf
Similarity Score: 0.5816
Most Relevant Chunk:
for and on behalf of the central board of directors chairman date: 13 th may, 2022net profit margin: the net profit has registered yoy growth of 55.19% (from a profit of `20,410 cr in fy21 to net profit of `31,676 cr during fy22) as against yoy growth of only 2.39% in total income (from `3,08,647 cr in fy21 to `3,16,021 cr in fy22)....


In [61]:

#TESTING DIFF QUESTIONS -2
query = "Who won 2023 ODI worldcup in cricket"
retrieve_most_relevant_chunk(query)

'No relevant chunk found.'

### 3. Advanced RAG Implementation

In [63]:
# Display a concise summary of the dataset, including the data types and non-null counts for each column
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   filename           2 non-null      object
 1   content            2 non-null      object
 2   processed_content  2 non-null      object
 3   text_chunks        2 non-null      object
 4   embeddings         2 non-null      object
dtypes: object(5)
memory usage: 212.0+ bytes


In [64]:
# Initialize the CrossEncoder model for reranking
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Tokenize the text chunks for BM25
tokenized_chunks = [chunk.split() for doc_chunks in dataset['text_chunks'] for chunk in doc_chunks]

# Initialize the BM25 model with the tokenized chunks
bm25 = BM25Okapi(tokenized_chunks)

# Initialize the memory store to cache query results
memory_store = defaultdict(list)

In [66]:
# Function to retrieve and rerank with memory
def retrieve_hybrid_rerank_with_memory(query, top_k=5, alpha=0.5, min_similarity=0.5):
    # Validate and process the query with guardrails
    processed_query = answer_query(query)
    if "Invalid query" in processed_query:
        return processed_query

    # Check Memory
    if query in memory_store:
        print("Fetching results from memory...")
        return memory_store[query]

    # Encode query for FAISS
    query_embedding = model.encode([query], convert_to_numpy=True).astype(np.float32)

    # FAISS search
    distances, indices = index.search(query_embedding, top_k)

    # BM25 search
    bm25_scores = bm25.get_scores(query.split())

    # Collect results
    results = []
    for i, idx in enumerate(indices[0]):
        if idx == -1:
            continue

        doc_idx = np.searchsorted(np.cumsum([len(e) for e in dataset['embeddings']]), idx, side='right')

        if doc_idx >= len(dataset['text_chunks']) or idx >= len(dataset['text_chunks'][doc_idx]):
            continue

        chunk_text = dataset['text_chunks'][doc_idx][idx]
        faiss_score = 1 / (1 + distances[0][i])  # Convert L2 distance to similarity
        bm25_score = bm25_scores[idx] / max(bm25_scores)  # Normalize BM25 score

        # Hybrid Score
        hybrid_score = alpha * faiss_score + (1 - alpha) * bm25_score

        results.append((dataset['filename'][doc_idx], chunk_text, hybrid_score))

        # Memory-Augmented Retrieval: Store in cache
        memory_store[query].append((chunk_text, hybrid_score))

    # Sort by Hybrid Score
    results = sorted(results, key=lambda x: x[2], reverse=True)[:top_k]

    # Cross-Encoder Reranking: Ensures the most contextually relevant chunks are ranked higher
    rerank_inputs = [(query, chunk) for _, chunk, _ in results]
    rerank_scores = cross_encoder.predict(rerank_inputs)

    # Attach rerank scores & sort
    results = [(doc, chunk, score) for (doc, chunk, _), score in zip(results, rerank_scores)]
    results = sorted(results, key=lambda x: x[2], reverse=True)

    # Check similarity threshold
    most_relevant_chunk = results[0] if results else None

    if most_relevant_chunk and most_relevant_chunk[2] >= min_similarity:
        print(f"Document: {most_relevant_chunk[0]}")
        print(f"Similarity Score: {most_relevant_chunk[2]:.4f}")
        print(f"Most Relevant Chunk:\n{most_relevant_chunk[1][:500]}...")  # Show first 500 characters

        numbers = re.findall(r"\d{1,3}(?:,\d{3})*\.\d{2}", most_relevant_chunk[1])  # Matches numbers with commas and decimals
        print("Extracted Numbers:", numbers)

        chunk_search = re.search(r"net profit\s*#([\d,]+\.\d{2})", most_relevant_chunk[1], re.IGNORECASE)
        if chunk_search:
            net_profit = chunk_search.group(1)
            print("Net Profit in 2022 (in Crores):", net_profit)
        else:
            print("Net Profit not found")
        return most_relevant_chunk[1]
    else:
        return "No relevant chunk found."



### 5. Guard rail implementation

In [67]:
# Display the dataset
print(dataset.head())

# Function to validate and filter user queries
def validate_query(query):
    """
    Ensures user query is relevant to financial statements and does not contain harmful content.
    """
    if not re.match(r'^[a-zA-Z0-9 ?!.,]+$', query):
        return False, "Invalid characters detected."
    # Allow only finance-related keywords
    finance_keywords = ["revenue", "profit", "expenses", "debt", "financial", "statement"]
    if not any(keyword in query.lower() for keyword in finance_keywords):
        return False, "Query must be related to financial statements."
    return True, "Valid query."

# Function to retrieve relevant context from the dataset
def retrieve_context(query):
    """
    Retrieves relevant financial information based on user query.
    """
    context = "\n".join(dataset['content'].values)  # Simple retrieval
    return context

# Load a text generation model (RAG can use a fine-tuned model)
generator = pipeline("text-generation", model="gpt2")

# Function to generate a response
def generate_response(query):
    """
    Generates a response using a text generation model.
    """
    context = retrieve_context(query)
    prompt = f"Based on the financial data:\n{context}\nAnswer the question: {query}"
    response = generator(prompt, max_length=150, num_return_sequences=1, max_new_tokens=50)[0]['generated_text']
    return response

# Function to filter misleading responses
def filter_response(response):
    """
    Ensures response does not contain hallucinations or misleading information.
    """
    if "prediction" in response.lower() or "guess" in response.lower():
        return "Response removed due to uncertainty."
    return response

# Main function to process user query
def answer_query(user_query):
    """
    Processes user query with input and output guardrails.
    """
    is_valid, message = validate_query(user_query)
    if not is_valid:
        return f"Invalid query: {message}"
    raw_response = generate_response(user_query)
    final_response = filter_response(raw_response)
    return final_response


                     filename  \
0  SBI_Annual_Report_2022.pdf   
1  SBI_Annual_Report_2023.pdf   

                                             content  \
0  01\nAnnual Report 2021 – 22\nAnnual Report 202...   
1  /gid00019/gid00032/gid00046/gid00043/gid00042/...   

                                   processed_content  \
0  01 annual report 2021 – 22 annual report 2021 ...   
1  draft contents company overview sbi at a glanc...   

                                         text_chunks  \
0  [ 01 annual report 2021 – 22 annual report 202...   
1  [, draft contents company overview sbi at a gl...   

                                          embeddings  
0  [[-0.010079363, -0.0644554, -0.080679506, -0.0...  
1  [[-0.1188385, 0.04829873, -0.0025481575, -0.01...  


Device set to use mps:0


In [68]:
# Example usage
query = "What was the profit in 2022?"
response = retrieve_hybrid_rerank_with_memory(query)
print(response)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=50) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Document: SBI_Annual_Report_2022.pdf
Similarity Score: 4.5482
Most Relevant Chunk:
for and on behalf of the central board of directors chairman date: 13 th may, 2022net profit margin: the net profit has registered yoy growth of 55.19% (from a profit of `20,410 cr in fy21 to net profit of `31,676 cr during fy22) as against yoy growth of only 2.39% in total income (from `3,08,647 cr in fy21 to `3,16,021 cr in fy22)....
Extracted Numbers: ['55.19', '2.39']
Net Profit not found
for and on behalf of the central board of directors chairman date: 13 th may, 2022net profit margin: the net profit has registered yoy growth of 55.19% (from a profit of `20,410 cr in fy21 to net profit of `31,676 cr during fy22) as against yoy growth of only 2.39% in total income (from `3,08,647 cr in fy21 to `3,16,021 cr in fy22).


In [69]:
# Example usage to test guardrails
query = "What is the capital of France?"
response = retrieve_hybrid_rerank_with_memory(query)
print(response)

Invalid query: Query must be related to financial statements.


In [70]:
# Example usage to test guardrails
query = "What was the profit in 2022?#"
response = retrieve_hybrid_rerank_with_memory(query)
print(response)

Invalid query: Invalid characters detected.
