<a href="https://colab.research.google.com/github/bks1984/CRAFT-pytorch/blob/master/medical_assistant_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

1. **Critical Care Protocols:** "What is the protocol for managing sepsis in a critical care unit?"

2. **General Surgery:** "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"

3. **Dermatology:** "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"

4. **Neurology:** "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"


### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## Installing and Importing Necessary Libraries and Dependencies

In [3]:
# Install required libraries
!pip install -q langchain_community==0.3.27 \
              langchain==0.3.27 \
              chromadb==1.0.15 \
              pymupdf==1.26.3 \
              tiktoken==0.9.0 \
              datasets==4.0.0 \
              evaluate==0.4.5 \
              langchain_openai==0.3.30

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/449.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m225.3/449.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m449.8/449.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-chroma 1.0.0 requires chromadb<2.0.0,>=1.0.20, but you have chromadb 1.0.15 which is incompatible.
langchain-chroma 1.0.0 requires langchain-core<2.0.0,>=1.0.0, but you have langchain-core 0.3.79 which is incompatible.[0m[31m
[0m

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [4]:
# Import core libraries
import os                                                                       # Interact with the operating system (e.g., set environment variables)
import json                                                                     # Read/write JSON data
import requests                                                                 # Make HTTP requests (e.g., API calls); ignore type checker

# Import libraries for working with PDFs and OpenAI
from langchain.document_loaders import PyMuPDFLoader                            # Load and extract text from PDF files
# from langchain_community.document_loaders import PyPDFLoader                    # Load and extract text from PDF files
from openai import OpenAI                                                       # Access OpenAI's models and services

# Import libraries for processing dataframes and text
import tiktoken                                                                 # Tokenizer used for counting and splitting text for models
import pandas as pd                                                             # Load, manipulate, and analyze tabular data

# Import LangChain components for data loading, chunking, embedding, and vector DBs
from langchain.text_splitter import RecursiveCharacterTextSplitter              # Break text into overlapping chunks for processing
from langchain.embeddings.openai import OpenAIEmbeddings                        # Create vector embeddings using OpenAI's models  # type: ignore
from langchain.vectorstores import Chroma                                       # Store and search vector embeddings using Chroma DB  # type: ignore

from datasets import Dataset                                                    # Used to structure the input (questions, answers, contexts etc.) in tabular format
from langchain_openai import ChatOpenAI                                         # This is needed since LLM is used in metric computation

## Question Answering using LLM

### OpenAI API Calling and Downloading and Loading the model



In [5]:
# Load the JSON file and extract values
file_name = "config.json"                                                       # Name of the configuration file
with open(file_name, 'r') as file:                                              # Open the config file in read mode
    config = json.load(file)                                                    # Load the JSON content as a dictionary
    OPENAI_API_KEY = config.get("OPENAI_API_KEY")                               # Extract the API key from the config
    OPENAI_API_BASE = config.get("OPENAI_API_BASE")                             # Extract the OpenAI base URL from the config

# Store API credentials in environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY                                   # Set API key as environment variable
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE                                 # Set API base URL as environment variable

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_API_BASE)               # Create an instance of the OpenAI client

In [6]:
# Define a function to get a response
def ask_llm(user_prompt, max_tokens=512, temperature=0, top_p=0.95):          # Complete the code to set default paramenters
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="gpt-4o-mini",                                                    # Complete the code by specifying the model to be used.
        messages=[
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [7]:
question_1 = "What is the protocol for managing sepsis in a critical care unit?"
base_prompt_response_1=ask_llm(question_1)
base_prompt_response_1

"Managing sepsis in a critical care unit involves a systematic approach that includes early recognition, prompt intervention, and ongoing monitoring. The following is a general protocol based on current guidelines, such as those from the Surviving Sepsis Campaign:\n\n### 1. **Early Recognition**\n   - **Identify Symptoms**: Look for signs of infection (fever, chills, tachycardia, tachypnea) and organ dysfunction (altered mental status, hypotension, oliguria).\n   - **Use Screening Tools**: Utilize tools like the qSOFA (quick Sequential Organ Failure Assessment) or SIRS (Systemic Inflammatory Response Syndrome) criteria to identify patients at risk.\n\n### 2. **Initial Assessment**\n   - **Obtain Vital Signs**: Monitor blood pressure, heart rate, respiratory rate, and temperature.\n   - **Assess Organ Function**: Evaluate renal function (urine output, creatinine), liver function (bilirubin, liver enzymes), and coagulation status (platelets, INR).\n\n### 3. **Immediate Interventions**\n 

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [8]:
question_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?" #Complete the code to define the question #2
base_prompt_response_2=ask_llm(question_2) #Complete the code to pass the user input
base_prompt_response_2

"Common symptoms of appendicitis include:\n\n1. **Abdominal Pain**: Typically starts around the navel and then moves to the lower right abdomen.\n2. **Loss of Appetite**: A sudden decrease in appetite is common.\n3. **Nausea and Vomiting**: Often follows the onset of abdominal pain.\n4. **Fever**: A low-grade fever may develop.\n5. **Constipation or Diarrhea**: Changes in bowel habits can occur.\n6. **Abdominal Swelling**: In some cases, the abdomen may become swollen.\n\nAppendicitis cannot be effectively treated with medication alone. The standard treatment is surgical removal of the appendix, known as an **appendectomy**. This can be performed using two main techniques:\n\n1. **Open Appendectomy**: A larger incision is made in the lower right abdomen to remove the appendix.\n2. **Laparoscopic Appendectomy**: This is a minimally invasive procedure where several small incisions are made, and the appendix is removed with the aid of a camera and special instruments.\n\nLaparoscopic appe

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [9]:
question_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?" #Complete the code to define the question #3
base_prompt_response_3=ask_llm(question_3) #Complete the code to pass the user input
base_prompt_response_3

'Sudden patchy hair loss, often referred to as alopecia areata, can manifest as localized bald spots on the scalp or other areas of the body. Here are some effective treatments and solutions, as well as potential causes behind this condition:\n\n### Possible Causes:\n1. **Autoimmune Disorders**: The immune system mistakenly attacks hair follicles, leading to hair loss.\n2. **Genetics**: A family history of alopecia or other autoimmune conditions can increase the risk.\n3. **Stress**: Physical or emotional stress can trigger hair loss in some individuals.\n4. **Hormonal Changes**: Changes in hormones, such as those occurring during pregnancy or menopause, can contribute.\n5. **Nutritional Deficiencies**: Lack of essential nutrients, such as iron, zinc, or vitamins, can affect hair health.\n6. **Infections**: Fungal infections like tinea capitis can cause patchy hair loss.\n7. **Other Medical Conditions**: Conditions like thyroid disease or vitiligo can also lead to hair loss.\n\n### Eff

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [10]:
question_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?" #Complete the code to define the question #4
base_prompt_response_4=ask_llm(question_4) #Complete the code to pass the user input
base_prompt_response_4

"The treatment for a person who has sustained a physical injury to brain tissue, such as a traumatic brain injury (TBI), can vary widely depending on the severity of the injury, the specific areas of the brain affected, and the resulting impairments. Here are some common approaches to treatment:\n\n1. **Emergency Care**: \n   - Immediate medical attention is crucial. This may involve stabilizing the patient, monitoring vital signs, and performing imaging studies (like CT or MRI scans) to assess the extent of the injury.\n\n2. **Surgical Interventions**: \n   - In some cases, surgery may be necessary to relieve pressure on the brain, remove blood clots (hematomas), or repair skull fractures.\n\n3. **Medication**: \n   - Medications may be prescribed to manage symptoms such as pain, seizures, or inflammation. Corticosteroids may be used to reduce swelling in the brain.\n\n4. **Rehabilitation**: \n   - **Physical Therapy**: To improve mobility and strength.\n   - **Occupational Therapy**:

### Storing the generated outputs from the base prompt


In [11]:
# Create the DataFrame
result_df = pd.DataFrame({
    "questions": [question_1, question_2, question_3, question_4],
    "base_prompt_responses": [base_prompt_response_1, base_prompt_response_2, base_prompt_response_3, base_prompt_response_4]})

# Display the DataFrame
result_df.head()

Unnamed: 0,questions,base_prompt_responses
0,What is the protocol for managing sepsis in a ...,Managing sepsis in a critical care unit involv...
1,"What are the common symptoms for appendicitis,...",Common symptoms of appendicitis include:\n\n1....
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, often referred to as ..."
3,What treatments are recommended for a person w...,The treatment for a person who has sustained a...


**Observations:**

1.The base LLM responses are clinically reasonable across critical care, surgery,dermatology, and neurology but lack citations, making them unsuitable for medical decision‑making without RAG.

2.Generated answers are concise and structured but omit protocol depth such as dosage, contraindications, and clinical caveats typically found in medical manuals.

3.The outputs demonstrate general medical knowledge rather than manual‑grounded evidence, confirming the need for retrieval from trusted clinical guidelines.

## Question Answering using LLM with Prompt Engineering

In the next step, we will use prompt engineering to check the effect of a more detailed and well-engineered prompt on the output of the model.

In [122]:
system_prompt = """
You are a helpful medical research assistant. Provide concise and accurate answers based on medical knowledge.
""" #system prompt

### Defining the function to Generate a Response From the LLM

In [123]:
# Define a function to get a response from the OpenAI chat model
def response(system_prompt, user_prompt, max_tokens=512, temperature=0, top_p=0.95):  # set default paramenters
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="gpt-4o-mini",                                                    # specifying the model to be used.
        messages=[
            {"role": "system", "content": system_prompt},                       # System prompt sets the assistant's behavior
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output (0 = deterministic)
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [124]:
response_with_prompt_eng_1=response(system_prompt,question_1)
response_with_prompt_eng_1

'The management of sepsis in a critical care unit typically follows the Surviving Sepsis Campaign guidelines. Here’s a concise protocol:\n\n1. **Early Recognition**: Identify sepsis using clinical criteria (e.g., suspected infection plus organ dysfunction).\n\n2. **Immediate Resuscitation**:\n   - **Fluid Resuscitation**: Administer intravenous fluids (30 mL/kg of crystalloids within the first 3 hours).\n   - **Vasopressors**: If hypotension persists after fluid resuscitation, initiate norepinephrine to maintain mean arterial pressure (MAP) ≥ 65 mmHg.\n\n3. **Antibiotic Therapy**:\n   - Administer broad-spectrum antibiotics within 1 hour of sepsis recognition. Adjust based on culture results and local antibiograms.\n\n4. **Source Control**:\n   - Identify and manage the source of infection (e.g., drainage of abscess, removal of infected devices).\n\n5. **Monitoring**:\n   - Continuously monitor vital signs, urine output, and laboratory parameters (e.g., lactate levels, complete blood c

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [125]:
response_with_prompt_eng_2=response(system_prompt,question_2)
response_with_prompt_eng_2

"Common symptoms of appendicitis include:\n\n1. Abdominal pain, often starting near the belly button and then moving to the lower right abdomen.\n2. Loss of appetite.\n3. Nausea and vomiting.\n4. Fever.\n5. Constipation or diarrhea.\n6. Abdominal swelling.\n\nAppendicitis cannot be effectively treated with medication alone; it typically requires surgical intervention. The standard surgical procedure for treating appendicitis is an appendectomy, which involves the removal of the inflamed appendix. This can be performed as an open surgery or laparoscopically, depending on the case and the surgeon's preference."

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [126]:
response_with_prompt_eng_3=response(system_prompt,question_3)
response_with_prompt_eng_3

'Sudden patchy hair loss, often referred to as alopecia areata, can manifest as localized bald spots on the scalp. Here are effective treatments and potential causes:\n\n### Treatments:\n1. **Corticosteroids**: Topical or intralesional corticosteroids can reduce inflammation and promote hair regrowth.\n2. **Minoxidil (Rogaine)**: Over-the-counter topical solution that may stimulate hair growth.\n3. **Immunotherapy**: Treatments like diphencyprone (DPCP) can provoke an allergic reaction to stimulate hair regrowth.\n4. **Anthralin**: A topical medication that can help in some cases by irritating the skin.\n5. **JAK Inhibitors**: Oral medications like tofacitinib and ruxolitinib have shown promise in clinical trials for alopecia areata.\n6. **Light Therapy**: Phototherapy can be beneficial for some patients.\n7. **Supportive Care**: Counseling and support groups can help manage the psychological impact of hair loss.\n\n### Possible Causes:\n1. **Autoimmune Response**: The immune system mi

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [127]:
response_with_prompt_eng_4=response(system_prompt,question_4)
response_with_prompt_eng_4

'Treatment for a person who has sustained a physical injury to brain tissue, such as a traumatic brain injury (TBI), typically involves a multidisciplinary approach and may include the following:\n\n1. **Emergency Care**: Immediate medical attention may involve stabilizing the patient, ensuring adequate oxygenation, and managing intracranial pressure.\n\n2. **Surgery**: In some cases, surgical intervention may be necessary to remove hematomas, repair skull fractures, or relieve pressure on the brain.\n\n3. **Medications**: \n   - **Analgesics** for pain management.\n   - **Anticonvulsants** to prevent seizures.\n   - **Diuretics** to reduce swelling.\n   - **Corticosteroids** may be used to decrease inflammation.\n\n4. **Rehabilitation**: \n   - **Physical therapy** to improve mobility and strength.\n   - **Occupational therapy** to assist with daily living activities.\n   - **Speech therapy** for communication and swallowing difficulties.\n   - **Neuropsychological therapy** to addres

### Storing the generated outputs from the structured prompts

In [128]:
# creating a dataframe
result_df['responses_with_prompt_eng'] = [response_with_prompt_eng_1, response_with_prompt_eng_2, response_with_prompt_eng_3, response_with_prompt_eng_4]

# Display the DataFrame
result_df.head()

Unnamed: 0,questions,base_prompt_responses,responses_with_prompt_eng,responses_with_RAG
0,What is the protocol for managing sepsis in a ...,Managing sepsis in a critical care unit involv...,The management of sepsis in a critical care un...,Answer:\nThe protocol for managing sepsis in a...
1,"What are the common symptoms for appendicitis,...",Common symptoms of appendicitis include:\n\n1....,Common symptoms of appendicitis include:\n\n1....,Answer:\nThe common symptoms of appendicitis i...
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, often referred to as ...","Sudden patchy hair loss, often referred to as ...",Answer:\nThe effective treatment for sudden pa...
3,What treatments are recommended for a person w...,The treatment for a person who has sustained a...,Treatment for a person who has sustained a phy...,Answer:\nInitial treatment for a person who ha...


**Observations**:

1.The base LLM responses are clinically reasonable across critical care, surgery,dermatology, and neurology but lack citations, making them unsuitable for medical decision‑making without RAG.

2.Generated answers are concise and structured but omit protocol depth such as dosage, contraindications, and clinical caveats typically found in medical manuals.

3.The outputs demonstrate general medical knowledge rather than manual‑grounded evidence, confirming the need for retrieval from trusted clinical guidelines.

4.The system consistently returns answers for all four use‑case questions, validating the basic inference pipeline before adding context‑grounding.

5.The notebook setup correctly loads configs and models, but embedding/RAG steps are not yet grounded in medical source documents.


## Data Preparation for RAG

### Loading the Data

In [19]:
manual_pdf_path = "/content/medical_diagnosis_manual.pdf" #Complete the code to define the file name

In [20]:
pdf_loader = PyMuPDFLoader(manual_pdf_path)

In [21]:
manual = pdf_loader.load()

### Data Overview

#### Checking the first 5 pages

In [22]:
for i in range(5):
    print(f"Page Number : {i+1}",end="\n")
    print(manual[i].page_content,end="\n")

Page Number : 1
biplobsinha25@gmail.com
9X5AUD3EIR
ant for personal use by biplobsinha25@g
shing the contents in part or full is liable
Page Number : 2
biplobsinha25@gmail.com
9X5AUD3EIR
This file is meant for personal use by biplobsinha25@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Page Number : 3
Table of Contents
1
Front    ................................................................................................................................................................................................................
1
Cover    .......................................................................................................................................................................................................
2
Front Matter    ......................................................................................................................................................................................

### Data Chunking

#### Chunk the PDF into Manageable Text Sections Using a Token-Based Splitter

In [39]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size= 500, #Complete the code to define the chunk size
    chunk_overlap= 50 #Complete the code to define the chunk overlap
)

#### Split the Loaded PDF into Chunks for Further Processing

In [40]:
document_chunks = pdf_loader.load_and_split(text_splitter)

#### Check the Number of Chunks Created

In [41]:
len(document_chunks)

8875

### Generate Vector Embeddings for Text Chunks Using OpenAI

In [42]:
# Initialize the OpenAI Embeddings model with API credentials
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,                                                     # Your OpenAI API key for authentication
    openai_api_base=OPENAI_API_BASE                                             # The OpenAI API base URL endpoint
)

# Generate embeddings (vector representations) for the first two document chunks
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)      # Embedding for chunk 0
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)      # Embedding for chunk 1

# Check and print the dimension (length) of the embedding vector
print("Dimension of the embedding vector ", len(embedding_1))                   # Typically 1536 or 2048 depending on model

Dimension of the embedding vector  1536


In [43]:
# Verify if both embeddings have the same dimension (should be True)
len(embedding_1) == len(embedding_2)

# Return/display the two embedding vectors for further inspection or use
embedding_1, embedding_2

([-0.013844738714396954,
  0.015309492126107216,
  0.008478669449687004,
  -0.020055856555700302,
  -0.012936308979988098,
  0.016126373782753944,
  -0.027478214353322983,
  -0.03104150854051113,
  -0.020407961681485176,
  -0.02126709558069706,
  0.04222434014081955,
  0.019605163484811783,
  -0.00669702235609293,
  0.019520658999681473,
  0.0016249610343948007,
  0.017971400171518326,
  0.024281106889247894,
  0.014830630272626877,
  0.012971519492566586,
  0.004387218505144119,
  -0.01829533651471138,
  0.03095700405538082,
  -0.02168961986899376,
  0.01354897115379572,
  -0.022633260115981102,
  -0.023943087086081505,
  0.028985220938920975,
  -0.034759730100631714,
  -0.024731801822781563,
  -0.01809815689921379,
  0.01074622105807066,
  -0.010196938179433346,
  -0.011823659762740135,
  -0.019576994702219963,
  -0.008013891987502575,
  0.00029576756060123444,
  0.02042204514145851,
  -0.004249898251146078,
  0.02191496640443802,
  0.01597144827246666,
  0.006816737819463015,
  -0.0

### Vector Database

#### Setup Vector Database Directory

LangChain is used here to help orchestrate the various components of the Retrieval Augmented Generation (RAG) system. It provides tools and abstractions for:

Loading and splitting documents: Making it easy to load the PDF manual and break it into smaller, manageable chunks.
Creating embeddings: Interfacing with embedding models (like OpenAI's) to convert text into numerical vectors.
Vector databases: Simplifying the process of storing these embeddings in a vector database (Chroma) and performing similarity searches to retrieve relevant information.
Essentially, LangChain helps connect these different pieces together to build the RAG pipeline for question answering.



In [44]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Define vector DB directory
out_dir = "Chroma"
os.makedirs(out_dir, exist_ok=True)

# Create the embedding function
embedding_function = OpenAIEmbeddings(
    model="text-embedding-3-small"
)



#### Create Vector Database from Documents

In [45]:
# Building the vector store and saving it to disk for future use
# Process documents in smaller batches to avoid exceeding the token limit
import time # Import the time module

batch_size = 100  # Adjust batch size as needed
for i in range(0, len(document_chunks), batch_size):
    batch_chunks = document_chunks[i:i + batch_size]
    if i == 0:
      vectorstore = Chroma.from_documents(
          batch_chunks,                                                            # Documents to index
          embedding_model,                                                            # Embedding model for converting text to vectors
          persist_directory=out_dir                                                   # Save vector DB files here
      )
    else:
      vectorstore.add_documents(batch_chunks)

    time.sleep(0.3) # Add a 1-second delay between batches to mitigate rate limiting

print("Vector store created and documents added.")

Vector store created and documents added.


#### Load Vector Database

In [46]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)


In [47]:
retriever.invoke("What are the common symptoms of appendicitis?")


[Document(metadata={'moddate': '2025-11-05T06:16:39+00:00', 'trapped': '', 'format': 'PDF 1.7', 'keywords': '', 'page': 173, 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'creationdate': '2012-06-15T05:44:40+00:00', 'modDate': 'D:20251105061639Z', 'creationDate': 'D:20120615054440Z', 'creator': 'Atop CHM to PDF Converter', 'total_pages': 4114, 'author': '', 'source': '/content/medical_diagnosis_manual.pdf', 'subject': '', 'file_path': '/content/medical_diagnosis_manual.pdf', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)'}, page_content="Etiology\nAppendicitis is thought to result from obstruction of the appendiceal lumen, typically by lymphoid\nhyperplasia, but occasionally by a fecalith, foreign body, or even worms. The obstruction leads to\ndistention, bacterial overgrowth, ischemia, and inflammation. If untreated, necrosis, gangrene, and\nperforation occur. If the perforation is contained by the omentum, an appendiceal abscess results.\nSymptoms and S

#### Explore Vector Database and Perform Searches

In [48]:
vectorstore.embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7b00310bd190>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7b0032b7a330>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base='https://api.openai.com/v1', openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

**Instructions:**

In this step, the vector database has already been loaded into memory using the Chroma store, and the embedding function is attached. The next logical action is to query the vector database to verify that similarity search is working correctly.

Therefore, in the following cell, we should perform a similarity search test by passing a clinical question to the vector store and retrieving the top-k most relevant document chunks based on the embeddings stored earlier.

This will confirm:

- my embeddings were generated correctly

- my Chroma vector store loaded properly

- my retrieval step is functioning

- The RAG pipeline is ready for the answer-generation step

In [49]:
vectorstore.similarity_search("What is the protocol for managing sepsis in a critical care unit?",k=3) #Complete the code to pass a query and an appropriate k value

[Document(metadata={'file_path': '/content/medical_diagnosis_manual.pdf', 'trapped': '', 'creationdate': '2012-06-15T05:44:40+00:00', 'total_pages': 4114, 'page': 2456, 'keywords': '', 'author': '', 'format': 'PDF 1.7', 'creator': 'Atop CHM to PDF Converter', 'moddate': '2025-11-05T06:16:39+00:00', 'subject': '', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'source': '/content/medical_diagnosis_manual.pdf', 'creationDate': 'D:20120615054440Z', 'modDate': 'D:20251105061639Z'}, page_content="Parenteral antibiotics should be given after specimens of blood, body fluids, and wound sites have been\ntaken for Gram stain and culture. Very prompt empiric therapy, started immediately after suspecting\nsepsis, is essential and may be lifesaving. Antibiotic selection requires an educated guess based on the\nsuspected source, clinical setting, knowledge or suspicion of causative organisms and of sensitivity\npatterns c

### Retrieval and Response Generation using Vector Search

#### Convert Vector Database into a Retriever and Retrieve Relevant Documents

In [50]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)
retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7b0032b79a90>, search_kwargs={'k': 3})

### System and User Prompt Template

Prompts guide the model to generate accurate responses. Here, we define two parts:

    1. The system message describing the assistant's role.
    2. A user message template including context and the question.

In [51]:
#define the system message
qna_system_message = """
You are an AI clinical assistant designed to support healthcare professionals in quickly reviewing authoritative medical literature.
Your task is to provide evidence-based, concise, and context-grounded responses strictly based on the excerpts provided from
medical manuals, guidelines, and research papers.

User input will include the necessary clinical context for you to answer the question. The context begins with the token: ###Context

### When crafting your response:
- Use ONLY the information given in the provided context.
- Provide clear, clinically accurate answers grounded in the supplied medical text.
- Avoid adding assumptions, interpretations, or general medical knowledge not present in the context.
- When relevant, include the name of the medical manual or research paper, as well as section or page numbers, if they appear in the context.
- If the answer cannot be found in the context, respond strictly with:
  "Sorry, this is out of my knowledge base."

### Response Formatting Requirements:
Answer:
[A concise answer using only the information in the context]

Source:
[Cite the specific source(s) mentioned in the context, including page/section if available]

If the context is empty or irrelevant, your full response must be:
"Sorry, this is out of my knowledge base."
"""

In [52]:
#define the user message
qna_user_message_template = """
###Context
Below are relevant excerpts taken from standard medical manuals, clinical guidelines, and research papers that relate to the healthcare question:

{context}

###Question
{question}

Please answer using ONLY the context provided above.
"""

### Response Function

In [56]:
def generate_rag_response(user_input,k=3,max_tokens=500,temperature=0,top_p=0.95):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.invoke(input=user_input)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    # Generate the response
    try:
        response = client.chat.completions.create(
        model="gpt-4o-mini",   # specifying the model to be used.
        messages=[
            {"role": "system", "content": qna_system_message},
            {"role": "user", "content": user_message}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
        )
        # Extract and print the generated text from the response
        response = response.choices[0].message.content.strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

## Question Answering using RAG

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [57]:
response_with_rag_1 = generate_rag_response(question_1)
response_with_rag_1

'Answer:\nThe protocol for managing sepsis in a critical care unit includes the following steps: \n1. Obtain specimens of blood, body fluids, and wound sites for Gram stain and culture before starting parenteral antibiotics.\n2. Initiate very prompt empiric antibiotic therapy immediately after suspecting sepsis, which may include gentamicin or tobramycin plus a 3rd-generation cephalosporin (e.g., cefotaxime or ceftriaxone), or ceftazidime plus a fluoroquinolone if Pseudomonas is suspected. Vancomycin should be added if resistant staphylococci or enterococci are suspected, and if there is an abdominal source, include a drug effective against anaerobes (e.g., metronidazole).\n3. Change the antibiotic regimen based on culture and sensitivity results when available, continuing antibiotics for at least 5 days after shock resolves and evidence of infection subsides.\n4. Drain abscesses and surgically excise necrotic tissues as necessary.\n5. Monitor and manage blood glucose levels with a con

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [58]:
response_with_rag_2 = generate_rag_response(question_2)
response_with_rag_2

"Answer:\nThe common symptoms of appendicitis include epigastric or periumbilical pain followed by nausea, vomiting, and anorexia, with pain shifting to the right lower quadrant. Classic signs include right lower quadrant tenderness at McBurney's point, Rovsing sign, psoas sign, and obturator sign. Appendicitis cannot be cured via medicine; the treatment is surgical removal, specifically an open or laparoscopic appendectomy.\n\nSource:\nThe Merck Manual of Diagnosis & Therapy, 19th Edition, Chapter 11. Acute Abdomen & Surgical Gastroenterology, pages 163."

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [59]:
response_with_rag_3 = generate_rag_response(question_3)
response_with_rag_3

'Answer:\nThe effective treatment for sudden patchy hair loss, known as alopecia areata, is not specified in the provided context. However, it is noted that alopecia areata is thought to be an autoimmune disorder affecting genetically susceptible individuals. Possible causes include systemic illnesses, particularly those that cause high fever, systemic lupus, endocrine disorders, and nutritional deficiencies. \n\nSource:\nThe Merck Manual of Diagnosis & Therapy, 19th Edition, Chapter 86. Hair Disorders, pages 848-849.'

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [60]:
response_with_rag_4 = generate_rag_response(question_4)
response_with_rag_4

'Answer:\nInitial treatment for a person who has sustained a physical injury to brain tissue includes ensuring a reliable airway and maintaining adequate ventilation, oxygenation, and blood pressure. Surgery may be needed for severe injuries to monitor and treat intracranial pressure, decompress the brain, or remove hematomas. Subsequently, many patients require rehabilitation, which should be planned early and may involve a team approach including physical, occupational, and speech therapy, as well as cognitive therapy for those with severe cognitive dysfunction.\n\nSource:\nThe Merck Manual of Diagnosis & Therapy, 19th Edition, Chapter 324. Traumatic Brain Injury, and Chapter 350. Rehabilitation.'

### Storing the RAG system outputs


In [61]:
# Add the results to a new column in the DataFrame
result_df['responses_with_RAG'] = [response_with_rag_1, response_with_rag_2, response_with_rag_3, response_with_rag_4]

# Display the DataFrame
result_df.head()

Unnamed: 0,questions,base_prompt_responses,responses_with_prompt_eng,responses_with_RAG
0,What is the protocol for managing sepsis in a ...,Managing sepsis in a critical care unit involv...,The management of sepsis in a critical care un...,Answer:\nThe protocol for managing sepsis in a...
1,"What are the common symptoms for appendicitis,...",Common symptoms of appendicitis include:\n\n1....,Common symptoms of appendicitis include:\n\n1....,Answer:\nThe common symptoms of appendicitis i...
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, often referred to as ...","Sudden patchy hair loss, often referred to as ...",Answer:\nThe effective treatment for sudden pa...
3,What treatments are recommended for a person w...,The treatment for a person who has sustained a...,Treatment for a person who has sustained a phy...,Answer:\nInitial treatment for a person who ha...


## Output Evaluation

#### **Defining required System Prompts**

In [93]:
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.

Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {groundedness_score:4}
"""

In [63]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {relevance_score:4}
"""

In [64]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

#### **Definig the LLM-as-a-Judge Evaluation function**

In [80]:
def generate_ground_relevance_response(user_input,response,  max_tokens=500, temperature=0, top_p=0.95):  # Complete the code to set default paramenters
    global qna_user_message_template

    context_for_query = [doc.page_content for doc in retriever.invoke(input=user_input)]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=response)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=response)}
                [/INST]"""

    response_1 = client.chat.completions.create(
            model="gpt-3.5-turbo",   # Complete the code by specifying the model to be used.
            messages=[
                {"role": "user", "content": groundedness_prompt}
                ],
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p
            )

    response_2 = client.chat.completions.create(
            model="gpt-3.5-turbo",   # Complete the code by specifying the model to be used.
            messages=[
                {"role": "user", "content": relevance_prompt}
                ],
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p
            )

    return response_1.choices[0].message.content,response_2.choices[0].message.content

#### **Evaluation 1: Base Prompt Response Evaluation**

In [135]:
# Question 1
ground_1,rel_1 = generate_ground_relevance_response(user_input=result_df.questions[0], response=result_df.base_prompt_responses[0], max_tokens=516)
print(ground_1,end="\n\n")
print(rel_1)

To evaluate the answer as per the metric, the following steps need to be taken:
1. Identify if the information provided in the answer is directly derived from the context given.
2. Check if there are any additional details or information in the answer that is not present in the context.
3. Ensure that the answer does not include any information that is not mentioned in the context.

Explanation:
The AI-generated answer provides a detailed protocol for managing sepsis in a critical care unit. It includes steps for early recognition, initial assessment, immediate interventions, monitoring and support, and ongoing management. The information provided in the answer aligns with the context given, which outlines the protocol for managing sepsis in a critical care unit. The answer does not introduce any new information that is not present in the context.

Therefore, the answer follows the metric of being derived only from the information presented in the context.

Based on the evaluation crit

In [136]:
# Question 2
ground_2,rel_2 = generate_ground_relevance_response(user_input=result_df.questions[1], response=result_df.base_prompt_responses[1], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_2,end="\n\n")
print(rel_2)

To evaluate the answer based on the metric provided, we need to follow these steps:
1. Identify if the information provided in the answer is solely derived from the context given.
2. Check if any additional information not present in the context is included in the answer.
3. Determine if the answer directly addresses the question asked without introducing new information.

Explanation:
The AI-generated answer provides a list of common symptoms of appendicitis and details the standard treatment procedures for appendicitis, including open appendectomy and laparoscopic appendectomy. The information presented in the answer aligns with the details provided in the context regarding the symptoms, diagnosis, and treatment of appendicitis. The answer does not introduce any new information that is not present in the context.

Therefore, the answer follows the metric of being derived only from the information presented in the context.

Based on the evaluation criteria:
The answer demonstrates a c

In [137]:
# Question 3
ground_3,rel_3 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.base_prompt_responses[2], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_3,end="\n\n")
print(rel_3)

To evaluate the answer as per the metric, the following steps need to be followed:
1. Identify if the information provided in the answer is solely derived from the context given.
2. Check if the answer includes any additional information not present in the context.
3. Verify if the treatments and causes mentioned in the answer align with the information provided in the context regarding sudden patchy hair loss and alopecia areata.

Explanation:
The AI-generated answer provides information on the possible causes and effective treatments for sudden patchy hair loss, specifically focusing on alopecia areata. It mentions autoimmune disorders, genetics, stress, hormonal changes, nutritional deficiencies, infections, and other medical conditions as possible causes. For effective treatments, it lists topical corticosteroids, minoxidil, intralesional corticosteroid injections, immunotherapy, oral medications, light therapy, nutritional support, stress management, and cosmetic solutions like wi

In [138]:
# Question 4
ground_4,rel_4 = generate_ground_relevance_response(user_input=result_df.questions[3], response=result_df.base_prompt_responses[3], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_4,end="\n\n")
print(rel_4)

To evaluate the answer based on the metric provided, we need to ensure that the information presented in the context is the sole source for generating the answer. Here are the steps to evaluate the answer:

1. Identify if the information provided in the answer is directly sourced from the context given.
2. Check if the details mentioned in the answer align with the information presented in the context.
3. Verify if there are no additional external sources or information used to formulate the answer.

Explanation:
The AI-generated answer provides a comprehensive list of treatments recommended for a person who has sustained a physical injury to brain tissue, specifically a traumatic brain injury (TBI). The answer includes emergency care, surgical interventions, medication, rehabilitation, psychological support, lifestyle modifications, supportive care, long-term management, assistive devices, and alternative therapies. Each treatment option is explained in detail, covering various aspect

**Evaluation 1: Base Prompt Response Evaluation**

In [139]:
# Create a DataFrame to store the base prompt evaluation results
base_prompt_evaluation_df = pd.DataFrame({
    "question": [question_1, question_2, question_3,question_4],
    "base_prompt_response": [result_df.base_prompt_responses[0], result_df.base_prompt_responses[1], result_df.base_prompt_responses[2],result_df.base_prompt_responses[3]],
    "groundedness_score": [ground_1[-2], ground_2[-2], ground_3[-2],ground_4[-2]],
    "relevance_score": [rel_1[-2], rel_2[-2], rel_3[-2],rel_4[-2]]
})

base_prompt_evaluation_df['groundedness_score'] = pd.to_numeric(base_prompt_evaluation_df['groundedness_score'], errors='coerce')
base_prompt_evaluation_df['relevance_score'] = pd.to_numeric(base_prompt_evaluation_df['relevance_score'], errors='coerce')

# Display the DataFrame
display(base_prompt_evaluation_df)

Unnamed: 0,question,base_prompt_response,groundedness_score,relevance_score
0,What is the protocol for managing sepsis in a ...,Managing sepsis in a critical care unit involv...,5,3.0
1,"What are the common symptoms for appendicitis,...",Common symptoms of appendicitis include:\n\n1....,5,3.0
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, often referred to as ...",5,5.0
3,What treatments are recommended for a person w...,The treatment for a person who has sustained a...,5,


#### **Evaluation 2: Prompt Engineering Response Evaluation**

In [130]:
# Question 1
ground_1,rel_1 = generate_ground_relevance_response(user_input=result_df.questions[0], response=result_df.responses_with_prompt_eng[0], max_tokens=516)
print(ground_1,end="\n\n")
print(rel_1)

To evaluate the answer based on the metric provided, which states that the answer should be derived only from the information presented in the context, the following steps need to be followed:

1. Identify the key points and information provided in the context regarding the protocol for managing sepsis in a critical care unit.
2. Compare the details mentioned in the AI-generated answer with the information presented in the context.
3. Determine if the answer provided by the AI system aligns with the specific details and guidelines outlined in the context.

Explanation:
The AI-generated answer provides a detailed protocol for managing sepsis in a critical care unit, including steps for early recognition, resuscitation, antibiotic therapy, source control, monitoring, supportive care, reassessment, corticosteroid consideration, glucose control, and communication. These steps are not directly mentioned in the context provided.

Therefore, the AI-generated answer does not strictly adhere to

In [131]:
# Question 2
ground_2,rel_2 = generate_ground_relevance_response(user_input=result_df.questions[1], response=result_df.responses_with_prompt_eng[1], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_2,end="\n\n")
print(rel_2)

Steps to evaluate the answer:
1. Identify if the information provided in the answer is solely based on the context.
2. Check if the symptoms of appendicitis, treatment options, and surgical procedures mentioned in the answer are supported by the context.
3. Evaluate if any additional information not present in the context is included in the answer.

Explanation:
The answer provided lists common symptoms of appendicitis and explains that appendicitis cannot be cured with medicine alone, requiring surgical intervention in the form of an appendectomy. It also mentions the two types of surgical procedures for appendicitis treatment - open surgery and laparoscopic surgery. All the information provided in the answer is directly derived from the context given.

The symptoms mentioned in the answer align with the symptoms described in the context, such as abdominal pain, loss of appetite, nausea, vomiting, fever, constipation or diarrhea, and abdominal swelling. The answer also correctly state

In [132]:
# Question 3
ground_3,rel_3 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.responses_with_prompt_eng[2], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_3,end="\n\n")
print(rel_3)

To evaluate the answer as per the metric, the following steps need to be followed:
1. Identify if the information provided in the answer is directly derived from the context.
2. Check if any additional information not present in the context is included in the answer.
3. Ensure that the treatments and possible causes mentioned in the answer are supported by the context provided.

Explanation:
The AI-generated answer provides information on effective treatments and possible causes for sudden patchy hair loss, specifically focusing on alopecia areata. The context mentions various treatments and causes related to hair loss, including corticosteroids, minoxidil, immunotherapy, genetics, stress, hormonal changes, and nutritional deficiencies. The answer aligns with the context by including treatments like corticosteroids, minoxidil, immunotherapy, and JAK inhibitors, which are supported by the context. The possible causes mentioned in the answer, such as autoimmune response, genetics, stress

In [133]:
# Question 4
ground_4,rel_4 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.responses_with_prompt_eng[2], max_tokens=516)  #Complete the code to calculate the groundedness and relevance score
print(ground_4,end="\n\n")
print(rel_4)

To evaluate the answer as per the metric, the following steps need to be followed:
1. Identify if the information provided in the answer is directly derived from the context.
2. Check if the treatments and possible causes mentioned in the answer are supported by the information given in the context.
3. Ensure that no additional information beyond what is provided in the context is included in the answer.

Explanation:
The AI-generated answer provides information on effective treatments and possible causes for sudden patchy hair loss, specifically focusing on alopecia areata. The context mentions various treatments for different types of hair loss, including corticosteroids, minoxidil, immunotherapy, anthralin, JAK inhibitors, and light therapy. It also discusses possible causes such as autoimmune response, genetics, stress, hormonal changes, nutritional deficiencies, and infections. The answer aligns with the information presented in the context and does not introduce any new informati

In [134]:
# Create a DataFrame to store the prompt engineering evaluation results
prompt_engineering_evaluation_df = pd.DataFrame({
    "question": [question_1, question_2, question_3,question_4],
    "prompt_engg_response": [result_df.responses_with_prompt_eng[0], result_df.responses_with_prompt_eng[1], result_df.responses_with_prompt_eng[2],result_df.responses_with_prompt_eng[3]],
    "groundedness_score": [ground_1[-2], ground_2[-2], ground_3[-2],ground_4[-2]],
    "relevance_score": [rel_1[-2], rel_2[-2], rel_3[-2],rel_4[-2]]
})

prompt_engineering_evaluation_df['groundedness_score'] = pd.to_numeric(prompt_engineering_evaluation_df['groundedness_score'], errors='coerce')
prompt_engineering_evaluation_df['relevance_score'] = pd.to_numeric(prompt_engineering_evaluation_df['relevance_score'], errors='coerce')

# Display the DataFrame
display(prompt_engineering_evaluation_df)

Unnamed: 0,question,prompt_engg_response,groundedness_score,relevance_score
0,What is the protocol for managing sepsis in a ...,The management of sepsis in a critical care un...,2,3
1,"What are the common symptoms for appendicitis,...",Common symptoms of appendicitis include:\n\n1....,5,5
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, often referred to as ...",5,5
3,What treatments are recommended for a person w...,Treatment for a person who has sustained a phy...,5,5


#### **Evaluation 3: RAG Response Evaluation**

In [140]:
# Question 1
ground_1,rel_1 = generate_ground_relevance_response(user_input=result_df.questions[0], response=result_df.responses_with_RAG[0], max_tokens=516)  #Complete the code to calculate the groundedness and relevance
print(ground_1,end="\n\n")
print(rel_1)

To evaluate the answer as per the metric, the following steps need to be followed:
1. Identify if the information provided in the answer is directly derived from the context.
2. Check if any additional information not present in the context is included in the answer.
3. Verify if the steps mentioned in the answer align with the protocol for managing sepsis in a critical care unit as described in the context.

Explanation:
The AI-generated answer provides a detailed protocol for managing sepsis in a critical care unit based on the information presented in the context. It includes steps such as obtaining specimens for culture, initiating prompt empiric antibiotic therapy, changing antibiotic regimen based on culture and sensitivity results, draining abscesses, monitoring blood glucose levels, and providing supportive care. All the steps mentioned in the answer are directly derived from the context provided.

Therefore, the answer follows the metric completely as it only uses the informat

In [141]:
# Question 2
ground_2,rel_2 = generate_ground_relevance_response(user_input=result_df.questions[1], response=result_df.responses_with_RAG[1], max_tokens=516)  #Complete the code to calculate the groundedness and relevance
print(ground_2,end="\n\n")
print(rel_2)

Steps to evaluate the answer:
1. Identify if the information provided in the answer is directly from the context.
2. Check if the symptoms of appendicitis and the treatment mentioned in the answer are supported by the context.
3. Ensure that no additional information beyond what is provided in the context is included in the answer.

Explanation:
The answer directly mentions the common symptoms of appendicitis as described in the context, such as epigastric or periumbilical pain, nausea, vomiting, and anorexia, with pain shifting to the right lower quadrant. It also includes classic signs like right lower quadrant tenderness at McBurney's point, Rovsing sign, psoas sign, and obturator sign. The answer correctly states that appendicitis cannot be cured via medicine and requires surgical removal, specifically an open or laparoscopic appendectomy. The source of the information is also mentioned.

Therefore, the answer strictly adheres to the metric by deriving all the information solely fr

In [142]:
# Question 3
ground_3,rel_3 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.responses_with_RAG[2], max_tokens=516)  #Complete the code to calculate the groundedness and relevance
print(ground_3,end="\n\n")
print(rel_3)

To evaluate the answer based on the metric provided, we need to follow these steps:
1. Identify if the information provided in the answer is solely derived from the context given.
2. Check if the answer directly addresses the question asked.
3. Determine if the answer includes any additional information not present in the context.

Explanation:
The AI-generated answer correctly identifies that the effective treatment for sudden patchy hair loss, specifically alopecia areata, is not specified in the provided context. It mentions that alopecia areata is thought to be an autoimmune disorder affecting genetically susceptible individuals and lists possible causes such as systemic illnesses, high fever, systemic lupus, endocrine disorders, and nutritional deficiencies. This information is directly derived from the context provided, specifically from the section on alopecia areata and its possible causes.

Therefore, the answer adheres to the metric by providing information solely based on th

In [143]:
# Question 4
ground_4,rel_4 = generate_ground_relevance_response(user_input=result_df.questions[3], response=result_df.responses_with_RAG[3], max_tokens= 516)  #Complete the code to calculate the groundedness and relevance
print(ground_4,end="\n\n")
print(rel_4)

To evaluate the answer based on the metric provided, we need to follow these steps:
1. Identify if the information provided in the answer is solely derived from the context given.
2. Check if the answer includes any additional information not present in the context.
3. Ensure that the answer does not introduce any new concepts or details that are not mentioned in the context.
4. Verify that the answer directly relates to the question asked about treatments recommended for a person with a physical injury to brain tissue.

Explanation:
The AI-generated answer provides information on the initial treatment for a person with a physical injury to brain tissue, which includes ensuring a reliable airway, maintaining ventilation, oxygenation, and blood pressure. It also mentions the possibility of surgery for severe injuries and the need for subsequent rehabilitation involving a team approach. The answer directly relates to the context provided, which discusses rehabilitation and treatment for 

In [144]:
# Create a DataFrame to store the Rag evaluation results
Rag_evaluation_df = pd.DataFrame({
    "question": [question_1, question_2, question_3,question_4],
    "rag_response": [result_df.responses_with_RAG[0], result_df.responses_with_RAG[1], result_df.responses_with_RAG[2],result_df.responses_with_RAG[3]],
    "groundedness_score": [ground_1[-2], ground_2[-2], ground_3[-2],ground_4[-2]],
    "relevance_score": [rel_1[-2], rel_2[-2], rel_3[-2],rel_4[-2]]
})

Rag_evaluation_df['groundedness_score'] = pd.to_numeric(Rag_evaluation_df['groundedness_score'], errors='coerce')
Rag_evaluation_df['relevance_score'] = pd.to_numeric(Rag_evaluation_df['relevance_score'], errors='coerce')

# Display the DataFrame
display(Rag_evaluation_df)

Unnamed: 0,question,rag_response,groundedness_score,relevance_score
0,What is the protocol for managing sepsis in a ...,Answer:\nThe protocol for managing sepsis in a...,5,
1,"What are the common symptoms for appendicitis,...",Answer:\nThe common symptoms of appendicitis i...,5,5.0
2,What are the effective treatments or solutions...,Answer:\nThe effective treatment for sudden pa...,3,
3,What treatments are recommended for a person w...,Answer:\nInitial treatment for a person who ha...,5,5.0


In [146]:
print("Average scores for Base Prompt Evaluation:")
print(base_prompt_evaluation_df[['groundedness_score', 'relevance_score']].mean(numeric_only=True))
print("Average scores for Prompt engg Evaluation:")
print(prompt_engineering_evaluation_df[['groundedness_score', 'relevance_score']].mean(numeric_only=True))
print("\nAverage scores for RAG Response Evaluation:")
print(Rag_evaluation_df[['groundedness_score', 'relevance_score']].mean(numeric_only=True))

Average scores for Base Prompt Evaluation:
groundedness_score    5.000000
relevance_score       3.666667
dtype: float64
Average scores for Prompt engg Evaluation:
groundedness_score    4.25
relevance_score       4.50
dtype: float64

Average scores for RAG Response Evaluation:
groundedness_score    4.5
relevance_score       5.0
dtype: float64


## Actionable Insights and Business Recommendations

1. Implement Retrieval‑Augmented Generation using authoritative medical manuals (e.g., Merck Manual, Harrison’s) to ensure evidence‑based answers.
2. RAG is your best-performing method — with a perfect relevance score of 5.0, confirming that your retrieval pipeline is correctly matching medical queries to the right chunks.
3. Groundedness still needs improvement (4.5 vs 5.0) — meaning the answers are clinically correct but not always fully supported by retrieved evidence. This is now the main improvement area.
4. Base Prompt’s perfect groundedness is misleading — it’s “correct-looking” medical reasoning, not actually evidence-based. we should not rely on this for clinical use.
5. Chunking issues are visible because RAG groundedness is lower than Base Prompt.If RAG was fully optimized:
Groundedness should be higher than Base Prompt.
Instead:
✅ RAG groundedness = 4.5
✅ Base Prompt = 5.0
This confirms our chunk overlap or chunk size is suboptimal.

5. Chunking strategy needs refinement — increase chunk size to 800–1000 tokens with 150–200 token overlap, however due to limitation of tokens couldn't carried out the higher chunking. But to boost evidence density, we need to use higher chunking strategy with proper overlapping.

6. we need to add a cross-encoder reranker — to guarantee the most evidence-rich chunk is ranked #1, directly improving groundedness from 4.5 → 5.0.
7. Need to standardize a clinical answer template —
Assessment → Management → Red Flags → Evidence
This ensures the model maps retrieved facts into structured, grounded output.








###Here are the business recommendations:

1. Strengthen the RAG pipeline with hybrid retrieval, re-ranking, and strict evidence-only answer generation to ensure fully grounded medical outputs.

2. Expand the evaluation dataset to a comprehensive 30–50 question medical benchmark covering multiple specialties for reliable performance validation.

3. Introduce mandatory citation formatting and full traceability (chunk IDs, page numbers, document metadata) to increase clinical trust and regulatory readiness.

4. Implement safety guardrails, fallback responses, and PHI-safe logging to meet compliance standards and reduce legal or clinical risks.

5. Standardize structured outputs (Assessment → Management → Red Flags → Evidence) to align with clinician workflows and enhance usability.

6. Build monitoring dashboards for groundedness, relevance, retrieval quality, latency, and token cost to support continuous improvement and enterprise scalability.

<font size=6 color='#4682B4'>Power Ahead</font>
___