## Baseline RAG

**AIM:** To create a baseline and simple RAG system and evaluating to get an initial benchmark for improvements.

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
import os
import re

### 1. Vector db

In [2]:
db_dir = 'chroma_dbs/baseline_db'
ch_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
embeddings = OllamaEmbeddings(model='nomic-embed-text:v1.5')

In [3]:
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

def create_vector_db(
        splitter, embed_func, db_dir=db_dir,
        raw_data_path='./raw_data/BNS2023.pdf'):
    """Creates persistent vector database if it doens't exist."""
    if os.path.exists(db_dir):
        print("Vector database already exists!")
    else:
        print("Vector database doesn't exist.")
        print("-> Initializing vector db creation...")
        loader = PyPDFLoader(file_path=raw_data_path, mode='page')
        pages = loader.load()
        for page in pages:
            page.page_content = clean_text(page.page_content)
            
        chunks = splitter.split_documents(pages)       
        db = Chroma(
            persist_directory=db_dir,
            embedding_function=embed_func
        )
        print("-> Adding chunks to vector db...")
        db.add_documents(chunks)
        print("-> Chunks have been added to to vector db")

In [4]:
create_vector_db(
    splitter=ch_splitter,
    embed_func=embeddings
)

Vector database already exists!


### 2. Retrieval

In [5]:
def retrieve_docs(query, db_dir, embed_func=embeddings):
    """Retrieves chunks and combines them into single string!"""

    info_fetched = ""
    db = Chroma(
        persist_directory=db_dir, embedding_function=embed_func
    )
    docs = db.similarity_search(query=query, k=2)
    for doc in docs:
        info_fetched += doc.page_content

    return info_fetched

In [6]:
res = retrieve_docs(
    query="My neighbor hit me real hard for no strong reason",
    db_dir=db_dir,
    embed_func=embeddings
)
len(res)

393

### 3. Augmentation and Generation

In [7]:
model = ChatOllama(model='gemma3:4b', temperature=0.8)

In [8]:
def augment_generate(query, model):
    context = retrieve_docs(query=query, db_dir=db_dir, embed_func=embeddings)
    prompt_t = ChatPromptTemplate.from_messages(
        messages=[
            ("system", "You are a helpful assistant who uses BNS(Bhartiya Nyaay Sanhita). Use the provided context to answer user query. BNS Context: {context}"),
            ("system", "You can reply within 150 words"),
            ("human", "{query}")
        ]
    )
    chain = prompt_t | model | StrOutputParser()
    res = chain.invoke({'query':query, 'context':context})
    return res

In [9]:
print("\nGenerating...\n")
answer = augment_generate(
    query='What is the punishment for voluntarily hurting someone!',
    model=model
)
print(answer)


Generating...

According to the Bharatiya Nyay Sanhita (BNS), the punishment for voluntarily causing hurt to another person (let's call them Z) depends on the circumstances. 

If the hurt is caused by a blow that isn’t part of the act where A initially hurt Z, then A is liable for a single punishment for that act of hurting Z. 

However, if the hurt is grievous (as in the case of B), B is liable for punishment for both the grievous hurt and the initial hurt to Z. 

Furthermore, if A knew B was likely to resist and cause grievous hurt during the act, A will also be held liable for the grievous hurt.


### 4. Evaluating baseline.

**Preparing a smaller dataset and evaluating the baseline RAG.** Using **deepeval** package for the evaluation.

In [None]:
from deepeval.models import OllamaModel
from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
from deepeval.metrics import FaithfulnessMetric, ContextualPrecisionMetric, ContextualRecallMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

In [38]:
questions =  [
    'What consequence does a person face for printing or publishing the identity of a victim under Section 72?',
    'What conditions must be present for a woman’s death to be classified as a dowry death?',
    'What offence is committed when a man deceives a woman into believing she is lawfully married to him?',
    'What happens under the law if a person causes a miscarriage without the consent of the woman?',
    'What is the punishment for a person who, already serving a life sentence, commits murder?',
    'What are the punishments for voluntarily causing hurt or grievous hurt using dangerous weapons or substances?',
    'What are the punishments for voluntarily causing grievous hurt by using acid or other harmful means?',
    'What is the difference between wrongful restraint and wrongful confinement, and what is the punishment for wrongful restraint?',
    'What does the law mean by “fabricating false evidence”?',
    'What punishment is prescribed for threatening a person to give false evidence under the law?'
]

expected_answers = [
    'The person may be punished with imprisonment for a term that can extend to two years and may also be required to pay a fine for revealing the victim’s identity.',
    'A woman’s death is treated as a dowry death if it occurs within seven years of marriage under unnatural circumstances and it is shown that she was subjected to cruelty or harassment related to dowry demands by her husband or his relatives shortly before her death.',
    'It is an offence where a man, through deceit, makes a woman believe she is legally married to him and cohabits or has sexual relations with her based on that false belief, which is punishable under the law.',
    'Causing a miscarriage without the woman’s consent is considered a very serious criminal offence. The offender can be punished with imprisonment for life or with a long term of imprisonment that may extend up to ten years, and the court may also impose a fine, regardless of whether the woman was quick with child or not.',
    'If a person who is already serving a sentence of life imprisonment commits murder, the law considers this extremely serious. Such a person may be punished either with death or with imprisonment for the remainder of their natural life, meaning they would remain in prison for the rest of their life. This reflects the severity of committing murder while already convicted of a grave crime.',
    'If a person voluntarily causes hurt using dangerous weapons, fire, poison, explosives, harmful substances, or animals, they can be punished with imprisonment of up to three years, or a fine of up to twenty thousand rupees, or both. If the act causes grievous hurt by the same means, the punishment is more severe: the offender may face life imprisonment or imprisonment ranging from one to ten years, along with a fine. The law treats the use of dangerous means very seriously due to the high risk of death or serious injury.',
    'If a person causes serious injury, burns, maims, disfigures, or disables another using acid or other harmful means, they can be punished with imprisonment of ten years to life and a fine payable to the victim. If the act is attempted but not fully carried out, the punishment is five to seven years imprisonment along with a fine.',
    'Wrongful restraint occurs when a person voluntarily obstructs someone from moving in a direction they have a right to go, whereas wrongful confinement happens when a person is prevented from moving beyond certain limits. The punishment for wrongful restraint is simple imprisonment of up to one month, or a fine of up to five thousand rupees, or both.',
    'Fabricating false evidence occurs when a person creates a false circumstance, makes a false entry in any book or record, or produces a document or electronic record containing false information, with the intention that it may be used as evidence in a judicial proceeding, before a public servant, or an arbitrator. The goal is to cause someone forming an opinion on that evidence to reach a wrong or misleading conclusion that affects the outcome of the proceeding.',
    'If a person threatens another with injury to their person, reputation, or property in order to make them give false evidence, the offence is punishable with imprisonment of up to seven years, or with a fine, or with both. If such false evidence results in the conviction and severe punishment of an innocent person, the offender faces the same punishment that was imposed on the innocent person.'
]

In [40]:
questions

['What consequence does a person face for printing or publishing the identity of a victim under Section 72?',
 'What conditions must be present for a woman’s death to be classified as a dowry death?',
 'What offence is committed when a man deceives a woman into believing she is lawfully married to him?',
 'What happens under the law if a person causes a miscarriage without the consent of the woman?',
 'What is the punishment for a person who, already serving a life sentence, commits murder?',
 'What are the punishments for voluntarily causing hurt or grievous hurt using dangerous weapons or substances?',
 'What are the punishments for voluntarily causing grievous hurt by using acid or other harmful means?',
 'What is the difference between wrongful restraint and wrongful confinement, and what is the punishment for wrongful restraint?',
 'What does the law mean by “fabricating false evidence”?',
 'What punishment is prescribed for threatening a person to give false evidence under the la

In [41]:
answers = []
contexts = []
for ques in questions:
    ans = augment_generate(query=ques, model=model)
    cont = retrieve_docs(query=ques, db_dir=db_dir, embed_func=embeddings)
    answers.append(ans)
    contexts.append(cont)

In [44]:
judge_model = OllamaModel(
    model='mistral',
    base_url='http://localhost:11434',
    temperature=0
)

In [46]:
answer_relevancy = AnswerRelevancyMetric(threshold=0.5, model=judge_model)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.5, model=judge_model)
contextual_precision = ContextualPrecisionMetric(threshold=0.5, model=judge_model)
contextual_recall = ContextualRecallMetric(threshold=0.5, model=judge_model)
faithfulness = FaithfulnessMetric(threshold=0.5, model=judge_model)

In [47]:
metrics_set1 = [answer_relevancy, contextual_relevancy]
metrics_set2 = [contextual_precision, contextual_recall]
metrics_set3 = [faithfulness]
test_cases = []

for i in range(len(questions)):
    test_case = LLMTestCase(
        input=questions[i],
        actual_output=answers[i],
        retrieval_context=[contexts[i]],
        expected_output=expected_answers[i]
    )
    test_cases.append(test_case)

print(f"Number of test cases:  {len(test_cases)}")

Number of test cases:  10


In [None]:
# Batches of test cases to avoid Timeout error
batch1 = test_cases[:2]
batch2 = test_cases[2:4]
batch3 = test_cases[4:6]
batch4 = test_cases[6:8]
batch5 = test_cases[8:10]

**README:**

The cell outputs of few of the cells which are mentioned below has been cleared. Because the output descriptions are really long and it will sort of bloat the notebook.

But there's no need to worry. Evaluation metrics has been noted down.

**Why evaluating in batches?**  
Because it was giving TLE errors. It might be because of async mode.  
This evaluation was for testing purpose only because the evaluation is not considered that honest with just 10 samples in the test dataset. However, we got some nice insights about how our baseline RAG is performing.

**Important note:**  
All the evaluate function calls are mentioned within one cell for code redability. But during the evaluation these were called separately in different cells so that results can be seen and understood differently.

And since batching was performed, the results of each evaluate functions were aggregated. And we will log aggregated results using our utiliy functions from the `helpers` package.

In [None]:
evaluate(test_cases=batch1, metrics=metrics_set1)
evaluate(test_cases=batch2, metrics=metrics_set1)
evaluate(test_cases=batch3, metrics=metrics_set1)
evaluate(test_cases=batch4, metrics=metrics_set1)
evaluate(test_cases=batch5, metrics=metrics_set1)

evaluate(test_cases=batch1, metrics=metrics_set2)
evaluate(test_cases=batch2, metrics=metrics_set2)
evaluate(test_cases=batch3, metrics=metrics_set2)
evaluate(test_cases=batch4, metrics=metrics_set2)
evaluate(test_cases=batch5, metrics=metrics_set2)

evaluate(test_cases=batch1, metrics=metrics_set3)
evaluate(test_cases=batch2, metrics=metrics_set3)
evaluate(test_cases=batch3, metrics=metrics_set3)
evaluate(test_cases=batch4, metrics=metrics_set3)
evaluate(test_cases=batch5, metrics=metrics_set3)

### 5. Logging the test results.

**Next AIM:**  
Create a dataset with 20 samples for a completely honest evaluation of this basline RAG system. We will create a dataset package to use the same set of dataset in different experiments.

**Also we will try** if we can perform the evaluation without getting TLE errors by running the evaluation in synchronous mode. We will use the entire dataset all at once without batching. But if it doesn't work we will go back to batch evaluation.

In [1]:
from helpers import Utility

In [4]:
Utility.initialize_json_file(path='../logs/log.json')
Utility.log_experiment(
    id='basline-eval-1',
    path='../logs/log.json',
    answer_relevance=100.0,
    contextual_relevance=60.0,
    contextual_precision=100.0,
    contextual_recall=80.0,
    faithfulness=80.0,
    description='chunk-size:600, chunk-overlap:20, splitter:recursive char text, search-type:similarity with k=2, reranker:false, metadata-filtering:false, test-samples:10, rag-llm:Ollama-gemma3:4b, judge-llm:Ollama-mistral-7b',
    commit_message='The first evaluation - to see how it performs with dataset having only 10 samples'
)

Succesfully initialized/emptied the json file at path - ../logs/log.json
Added succesfully!
