## Evaluating the baseline RAG with 20 test cases.

### 1. Baseline RAG
**Copy pasting the code of the naive baseline RAG system here. And doing some changes!**

**Conclusions from the naive basline RAG:**

* The previous baseline RAG system that we coded initially had a chunk size of 200.
* 200 is very less for legal documents. But we were just testing how it performs with 200, k = 2, for similarity search.
* We are not moving into the **'experiments for the purpose of improvement'** section yet.
* So let's define a new baseline.

**The new baseline will have:**
1. Chunk size = 500
2. Retrieve top 3 documents.
3. Similarity search
4. Recursive character text splitter
5. For continuous and consistent sort of relationship between chunks - Overlap size = 50




In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import os
import re

In [2]:
db_dir = 'chroma_dbs/baseline_db_v2'
ch_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
embeddings = OllamaEmbeddings(model='nomic-embed-text:v1.5')

In [3]:
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

def create_vector_db(
        splitter=ch_splitter, embed_func=embeddings, db_dir=db_dir,
        raw_data_path='./raw_data/BNS2023.pdf'):
    """Creates persistent vector database if it doens't exist."""
    if os.path.exists(db_dir):
        print("Vector database already exists!")
    else:
        print("Vector database doesn't exist.")
        print("-> Initializing vector db creation...")
        loader = PyPDFLoader(file_path=raw_data_path, mode='page')
        pages = loader.load()
        for page in pages:
            page.page_content = clean_text(page.page_content)
            
        chunks = splitter.split_documents(pages)       
        db = Chroma(
            persist_directory=db_dir,
            embedding_function=embed_func
        )
        print("-> Adding chunks to vector db...")
        db.add_documents(chunks)
        print("-> Chunks have been added to to vector db")

In [4]:
create_vector_db() # Create vector database

Vector database already exists!


In [5]:
# The retriever
def retrieve_docs(query, db_dir, embed_func=embeddings):
    """Retrieves chunks and combines them into single string!"""

    info_fetched = ""
    db = Chroma(
        persist_directory=db_dir, embedding_function=embed_func
    )
    docs = db.similarity_search(query=query, k=3)
    for doc in docs:
        info_fetched += doc.page_content

    return info_fetched

In [6]:
res = retrieve_docs(
    query="What is considered counterfeiting of coins, Government stamps, or currency notes, and what is the punishment?",
    db_dir=db_dir,
    embed_func=embeddings
)
len(res)

1492

In [7]:
res

'coin, Government stamps, currency -notes or bank - notes. 178. Whoever counterfeits, or knowingly performs any part of the process of counterfeiting, any coin, stamp issued by Government for the purpose of revenue, currency -note or bank -note, shall be punished with imprisonment for life, or with imprisonment of either description for a term which may extend to ten years, and shall also be liable to fine. Explanation.—For the purposes of this Chapter,— (1) the expression “bank -note” means aUsing as genuine, forged or counte rfeit coin, Government stamp, currency-notes or bank-notes. 179. Whoever imports or exports, or sells or delivers to, or buys or receives from, any other person, or otherwise traffics or uses as genuine, any forged or counterfeit coin, stamp, currency-note or bank-note, knowing or having reason to believe the same to be forged or counterfeit, shall be punished with imprisonment for life, or with imprisonment of either description for a term which may extend towit

In [8]:
model = ChatOllama(model='gemma3:4b', temperature=0.8)

In [9]:
def augment_generate(query, model):
    context = retrieve_docs(query=query, db_dir=db_dir, embed_func=embeddings)
    prompt_t = ChatPromptTemplate.from_messages(
        messages=[
            ("system", "You are a helpful assistant who uses BNS(Bhartiya Nyaay Sanhita). Use the provided context to answer user query. BNS Context: {context}"),
            ("system", "You can reply within 150 words"),
            ("human", "{query}")
        ]
    )
    chain = prompt_t | model | StrOutputParser()
    res = chain.invoke({'query':query, 'context':context})
    return res

In [10]:
print("\nGenerating...\n")
res = augment_generate(
    query='What is considered counterfeiting of coins, Government stamps, or currency notes, and what is the punishment?',
    model=model
)

print(res)


Generating...

According to the Bharatiya Nyaya Sanhita (BNS), counterfeiting encompasses several actions related to coins, Government stamps, and currency notes (including bank notes).

**Counterfeiting Defined:**

*   **Coin:** Counterfeiting a coin means creating a fake coin or performing any part of the process to do so.
*   **Government Stamps/Currency Notes/Bank Notes:** This includes forging, creating fake versions, or knowingly participating in the process of creating a fake version of any Government stamp, currency note, or bank note.

**Punishment:**

The punishment varies depending on the offense:

*   **178 (Counterfeiting):** Imprisonment for life or up to 10 years with a fine.
*   **179 (Trading in Forged Items):** Imprisonment for life or up to 7 years with a fine, or both.
*   **181 (Making/Mending Instruments):** This section deals with creating or repairing tools used for forgery, and the punishment applies if this is done with the intent to forge.

**Important Note:

### 2. Evaluation of Baseline RAG.

To fix the asychronous TLE error of deepeval, we used synchronous configurations. But it still required batching because of TLE. The errors are not because of deepeval, these TLE errors because of the speed of local LLMs.

But now, we are able to evaluate much better.

**But it's very slow.**

In [11]:
from deepeval.models import OllamaModel
from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
from deepeval.metrics import FaithfulnessMetric, ContextualPrecisionMetric, ContextualRecallMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
from deepeval.evaluate import AsyncConfig
from evaluation_dataset import questions, expected_answers

In [12]:
# Setting AsyncConfig to false
# Running synchronously
async_config = AsyncConfig(run_async=False)

In [13]:
answers = []
contexts = []
for ind, que in enumerate(questions):
    print(f"Gathering context and generating answer for question - {ind + 1}")
    cont = retrieve_docs(query=que, db_dir=db_dir, embed_func=embeddings)
    ans = augment_generate(query=que, model=model)
    contexts.append(cont)
    answers.append(ans)

Gathering context and generating answer for question - 1
Gathering context and generating answer for question - 2
Gathering context and generating answer for question - 3
Gathering context and generating answer for question - 4
Gathering context and generating answer for question - 5
Gathering context and generating answer for question - 6
Gathering context and generating answer for question - 7
Gathering context and generating answer for question - 8
Gathering context and generating answer for question - 9
Gathering context and generating answer for question - 10
Gathering context and generating answer for question - 11
Gathering context and generating answer for question - 12
Gathering context and generating answer for question - 13
Gathering context and generating answer for question - 14
Gathering context and generating answer for question - 15
Gathering context and generating answer for question - 16
Gathering context and generating answer for question - 17
Gathering context and g

In [14]:
test_cases = []
for i in range(len(questions)):
    test_case = LLMTestCase(
        input=questions[i],
        actual_output=answers[i],
        retrieval_context=[contexts[i]],
        expected_output=expected_answers[i]
    )
    test_cases.append(test_case)

print(f"Number of test cases:  {len(test_cases)}")

Number of test cases:  20


In [15]:
judge_model = OllamaModel(
    model='mistral',
    base_url='http://localhost:11434',
    temperature=0
)

In [16]:
answer_relevancy = AnswerRelevancyMetric(threshold=0.5, model=judge_model)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.5, model=judge_model)
contextual_precision = ContextualPrecisionMetric(threshold=0.5, model=judge_model)
contextual_recall = ContextualRecallMetric(threshold=0.5, model=judge_model)
faithfulness = FaithfulnessMetric(threshold=0.5, model=judge_model)

In [24]:
# Batches
batch1 = test_cases[0:4]
batch2 = test_cases[4:8]
batch3 = test_cases[8:12]
batch4 = test_cases[12:16]
batch5 = test_cases[16:20]
print(f"Length of usual batches:  {len(batch5)}")

# Smaller batches in case deepeval timeout (Incase LLM is taking too much time)
# Use smaller batches of timeout error is not resolved with usual batches
sm_batch1 = test_cases[0:2]
sm_batch2 = test_cases[2:4]
sm_batch3 = test_cases[4:6]
sm_batch4 = test_cases[6:8]
sm_batch5 = test_cases[8:10]
sm_batch6 = test_cases[10:12]
sm_batch7 = test_cases[12:14]
sm_batch8 = test_cases[14:16]
sm_batch9 = test_cases[16:18]
sm_batch10 = test_cases[18:20]
print(f"Length of small batches: {len(sm_batch10)}")

Length of usual batches:  4
Length of small batches: 2


In [None]:
# Execute these function calls in separate cells
# The outputs of these function calls have been cleared for code readability
evaluate(test_cases=batch1, metrics=[answer_relevancy], async_config=async_config)
evaluate(test_cases=batch2, metrics=[answer_relevancy], async_config=async_config)
evaluate(test_cases=batch3, metrics=[answer_relevancy], async_config=async_config)
evaluate(test_cases=batch4, metrics=[answer_relevancy], async_config=async_config)
evaluate(test_cases=batch5, metrics=[answer_relevancy], async_config=async_config)

evaluate(test_cases=batch1, metrics=[contextual_relevancy], async_config=async_config)
evaluate(test_cases=batch2, metrics=[contextual_relevancy], async_config=async_config)
evaluate(test_cases=batch3, metrics=[contextual_relevancy], async_config=async_config)
evaluate(test_cases=batch4, metrics=[contextual_relevancy], async_config=async_config)
evaluate(test_cases=batch5, metrics=[contextual_relevancy], async_config=async_config)

evaluate(test_cases=batch1, metrics=[contextual_precision], async_config=async_config)
evaluate(test_cases=batch2, metrics=[contextual_precision], async_config=async_config)
evaluate(test_cases=batch3, metrics=[contextual_precision], async_config=async_config)
evaluate(test_cases=batch4, metrics=[contextual_precision], async_config=async_config)
evaluate(test_cases=batch5, metrics=[contextual_precision], async_config=async_config)

evaluate(test_cases=batch1, metrics=[contextual_recall], async_config=async_config)
evaluate(test_cases=batch2, metrics=[contextual_recall], async_config=async_config)
evaluate(test_cases=batch3, metrics=[contextual_recall], async_config=async_config)
evaluate(test_cases=batch4, metrics=[contextual_recall], async_config=async_config)

# Only batch 5 was giving TLE
# So let's batch the batch 5 further
batch_5_b1 = batch5[:2]
batch_5_b2 = batch5[2:]
evaluate(test_cases=batch_5_b1, metrics=[contextual_recall], async_config=async_config)
evaluate(test_cases=batch_5_b2, metrics=[contextual_recall], async_config=async_config)

evaluate(test_cases=batch1, metrics=[faithfulness], async_config=async_config)
evaluate(test_cases=batch2, metrics=[faithfulness], async_config=async_config)
evaluate(test_cases=batch3, metrics=[faithfulness], async_config=async_config)
evaluate(test_cases=batch4, metrics=[faithfulness], async_config=async_config)
evaluate(test_cases=batch5, metrics=[faithfulness], async_config=async_config)

### 3. Logging the results and conclusions.

Note: The scores are in the form of percentage representing the passing rates according to the threshold.

**We can clearly see that there's some improvement with bigger chunk size and bigger k value.**

In [59]:
from helpers import Utility

In [60]:
Utility.log_experiment(
    id='basline-RAG',
    path='../logs/log.json',
    faithfulness=85.0,
    contextual_relevance=70.0,
    answer_relevance=100.0,
    contextual_precision=90.0,
    contextual_recall=95.0,
    commit_message="Evaluation of baseline RAG with 20 test cases",
    description='chunk-size:500, chunk-overlap:50, splitter:recursive char text, search-type:similarity with k=3, reranker:false, metadata-filtering:false, test-samples:20, rag-llm:Ollama-gemma3:4b, judge-llm:Ollama-mistral-7b, eval_tool:deep-eval'
)

Added succesfully!
