## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

**1. Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"

**2. Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"

**3. Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"

**4. Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"

**5. Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

**Revised Problem Statement:**

Healthcare professionals struggle to efficiently navigate the vast and complex medical literature, such as the Merck Manual, leading to information overload and delayed decision-making. This project develops a Retrieval-Augmented Generation (RAG)-based AI solution to streamline access to accurate, up-to-date medical knowledge. By leveraging advanced NLP techniques—including large language models, embeddings, and vector databases—the system will provide precise answers to critical medical queries, assess response quality, and deliver actionable insights. The goal is to enhance diagnostic accuracy, standardize care, and improve operational efficiency in healthcare settings.

**Steps to Create and Evaluate RAG:**
- **Install and Import Libraries:** Set up the Google Colab environment with necessary libraries for PDF processing, NLP, embeddings, vector databases, and LLMs.
- **Load PDF Data:** Extract text from the Merck Manual PDF using a robust PDF reader.
- **Exploratory Data Analysis (EDA):** Analyze the PDF structure, page count, and content distribution to understand the data.
- **Data Chunking:** Split the PDF text into manageable chunks for efficient retrieval.
- **Embedding Generation:** Use a SentenceTransformer model to create embeddings for the text chunks.
- **Vector Database Setup:** Store embeddings in a Chroma vector database with cosine similarity search.
- **Retriever Configuration:** Define a retriever to fetch relevant chunks based on query similarity.
- **Load LLM:** Download and configure the Mistral-7B-Instruct model for response generation.
- **Define Prompt Templates:** Create system and user prompts to guide the LLM in generating accurate responses.
- **Response Generation Function:** Build a function to combine retrieved context and LLM for question answering.
-**Question Answering with RAG:** Answer the five provided medical questions using the RAG pipeline.
- **Fine-Tune Parameters:** Experiment with at least five combinations of chunking, retriever, and LLM parameters (e.g., chunk size, k value, temperature, top_p).
- **Output Evaluation:** Use "**LLM-as-a-Judge**" to evaluate responses for groundedness and relevance with defined prompts.
- **Business Insights and Recommendations:** Summarize key takeaways and provide actionable recommendations for healthcare stakeholders.

## Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for GPU llama-cpp-python
!pip install huggingface_hub pandas tiktoken pymupdf langchain langchain-community chromadb sentence-transformers "numpy<2" -q
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install --no-cache-dir llama-cpp-python==0.2.77 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
# !CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

**To compare "Responses" from LLM and RAG:**
- We'll use the following metrics for comparison:

1. Semantic Similarity (Cosine similarity using Sentence-BERT embeddings)
  * Measures how close the meaning is to a reference.

2. BLEU Score (N-gram overlap for fluency)
  *  Evaluate fluency and overlap with a gold-standard answer.

3. ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation)
  *  Evaluate fluency and overlap with a gold-standard answer.

4. Readability (Flesch-Kincaid Grade Level)
  * Checks how easily understandable the response is.

In [None]:
# package for comparison matrics
!pip install sentence-transformers nltk rouge-score textstat tabulate

# Download NLTK data (first-time only)
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download punkt_tab for sentence tokenization

**Restart the Session before continuing**

In [None]:
#Libraries for processing dataframes,text
import json,os
import textstat # to compute readibility of the LLM responses
## For comparison of LLMs and RAG
from typing import Dict, List
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from tabulate import tabulate
# Download NLTK data (first-time only)
nltk.download('punkt')
##
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma # Chroma based on SQL Light

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

**Insights:**
* PyMuPDF is chosen for fast and accurate PDF text extraction.
* SentenceTransformers provides state-of-the-art embeddings for text similarity.
* Chroma is a lightweight vector database suitable for in-memory storage.
* LangChain simplifies RAG pipeline integration.
* Bitsandbytes enables 4-bit quantization for efficient LLM inference on Colab’s GPU.

## Question Answering using LLM

## **Loading the Large Language Model**

#### Downloading and Loading the model

In [None]:
# # Model configuration
# model_name_or_path = "TheBloke/Llama-2-13B-chat-GGUF"
# model_basename = "llama-2-13b-chat.Q5_K_M.gguf"
# model_path = hf_hub_download(
#     repo_id=model_name_or_path,
#     filename=model_basename
#     )
# or use this
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
    )

In [None]:
#uncomment the below snippet of code if the runtime is connected to GPU.
llm = Llama(
    model_path=model_path,
    n_ctx=2300,
    n_gpu_layers=38,
    n_batch=512
)

#uncomment the below snippet of code if the runtime is connected to CPU only.
#llm = Llama(
#    model_path=model_path,
#    n_ctx=1024,
#    n_cores=-2
#)

#### Response

In [None]:
# function to generate, process, and return the response from the LLM
def generate_llama_response(user_prompt):

    # System message
    system_message = """
    [INST]<<SYS>> Respond to the user question based on the user prompt<</SYS>>[/INST]
    """

    # Combine user_prompt and system_message to create the prompt
    prompt = f"{user_prompt}\n{system_message}"

    # Generate a response from the LLaMA model
    response = llm(
        prompt=prompt,
        max_tokens=1024,
        temperature=0.01,
        top_p=0.95,
        repeat_penalty=1.2,
        top_k=50,
        stop=['INST'],
        echo=False
    )

    # Extract and return the response text
    response_text = response["choices"][0]["text"]
    return response_text

    # def response(query,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
#     model_output = lcpp_llm(
#       prompt=query,
#       max_tokens=max_tokens,
#       temperature=temperature,
#       top_p=top_p,
#       top_k=top_k
#     )

#     return model_output['choices'][0]['text']

In [None]:
generate_llama_response("What treatment options are available for managing hypertension?") # testing with a random prompt

### Question Answering using LLM (without Prompt Engineering)

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_1llm = "What is the protocol for managing sepsis in a critical care unit?"
raw_llm_response_1 = generate_llama_response(user_input_1llm)
raw_llm_response_1

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_2llm = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
raw_llm_response_2 = generate_llama_response(user_input_2llm)
raw_llm_response_2

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_3llm = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
raw_llm_response_3 = generate_llama_response(user_input_3llm)
raw_llm_response_3

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_4llm = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
raw_llm_response_4 = generate_llama_response(user_input_4llm)
raw_llm_response_4

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_5llm = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
raw_llm_response_5 = generate_llama_response(user_input_5llm)
raw_llm_response_5

In [None]:
# Raw LLM responses
raw_llm_responses = {
    "Question-1": """1. Recognize and suspect sepsis early: Look out for signs of infection, such as fever or hypothermia, tachycardia or bradycardia, respiratory distress, altered mental status, and lactic acidosis. Use the Sequential Organ Failure Assessment (SOFA) score to assess organ dysfunction.
    2. Initiate resuscitation: Administer oxygen via high-flow nasal cannas or non-rebreather masks as needed. Start intravenous fluids, aiming for a target mean arterial pressure of 65 mmHg and a central venous pressure (CVP) between 8-12 cm H2O if the patient is mechanically ventilated.
    3. Administer antibiotics: Choose empiric antimicrobials based on local guidelines, suspected infection source, and microbiological data when available. Consider broad-spectrum coverage for gram-positive and gram-negative organisms.
    4. Monitor hemodynamic status closely: Use invasive monitoring techniques like arterial lines, central venous catheters, or pulmonary artery catheters to assess cardiac output, oxygen delivery, and consumption. Adjust fluid resuscitation accordingly.
    5. Provide adequate tissue perfusion: Maintain an adequate mean arterial pressure (MAP) and CVP to ensure sufficient organ blood flow. Consider using vasopressors if necessary.
    6. Optimize oxygenation: Use mechanical ventilation as needed, aiming for a target SpO2 of 94-98% or PaO2 > 60 mmHg. Monitor arterial and venous blood gases regularly to assess adequacy of respiratory support.
    7. Correct metabolic acidosis: Administer sodium bicarbonate if pH is <7.15 or base deficit > 6 mEq/L, but be cautious as it may worsen intracellular acidosis and increase calcium ion binding.
    8. Provide adequate nutrition: Initiate enteral feeding within 24 hours of ICU admission if possible; consider parenteral nutrition for those unable to tolerate enteral feeds. Aim for a caloric intake of 1.5-2 times the basal energy expenditure (BEE).
    9. Support organ function: Provide appropriate care based on specific organ dysfunction, such as renal replacement therapy for kidney failure or vasopressor support for cardiovascular instability.
    10. Monitor and manage complications: Be aware of potential complications like coagulopathy, myocardial depression, and secondary infections; address these issues promptly to minimize morbidity and mortality.""",

    "Question-2": """
 Appendicitis is a medical condition characterized by inflammation of the appendix. The following are common symptoms for appendicitis:

1. Abdominal pain, usually starting around the navel and moving to the lower right side.
2. Loss of appetite.
3. Nausea and vomiting.
4. Fever (often low-grade).
5. Diarrhea or constipation.
6. General discomfort, uneasiness, or malaise.
7. Abdominal swelling and rigidity.
8. Inability to pass gas or have a bowel movement.
9. Pain upon walking, coughing, or taking deep breaths.

It's important to note that these symptoms can also be indicative of other conditions, so it is crucial to consult with a healthcare professional for an accurate diagnosis. Appendicitis cannot typically be cured via medicine alone; surgery (appendectomy) is usually required to remove the inflamed appendix and prevent complications such as rupture or peritonitis.

During an appendectomy, the surgeon makes an incision in the abdomen and removes the appendix. The procedure can be performed using open surgery (a larger incision) or laparoscopically (smaller incisions with the use of a camera). After recovery from the surgery, most people return to their normal activities within 1-2 weeks.""",

    "Question-3": """1. Possible Causes:
   - Alopecia Areata: An autoimmune disorder that results in hair loss from certain areas of the scalp or body. It often presents as round patches of baldness.
   - Trauma or Stress: Physical trauma, emotional stress, or excessive pulling on the hair can cause sudden patchy hair loss.
   - Nutritional Deficiencies: Lack of essential nutrients like iron, zinc, and vitamin B12 may lead to hair thinning and patchiness.
   - Hormonal Imbalance: Sudden hormonal changes due to pregnancy, menopause, or thyroid issues can cause temporary hair loss in patches.
   - Medications: Certain medications like chemotherapy drugs, antidepressants, and blood thinners may lead to sudden patchy hair loss as a side effect.
   2. Treatments:
  - Alopecia Areata: Topical treatments such as minoxidil or corticosteroids can help stimulate hair growth in alopecia areata cases. In severe cases, systemic steroid therapy may be recommended.
  - Trauma or Stress: Avoiding excessive pulling on the hair and managing stress through relaxation techniques like meditation, yoga, or deep breathing exercises can promote healthy hair regrowth.
  - Nutritional Deficiencies: Consuming a balanced diet rich in essential nutrients or taking supplements as recommended by a healthcare professional may help address patchy hair loss caused by deficiencies.
  - Hormonal Imbalance: Treating the underlying hormonal issue, such as thyroid disorders or menopause symptoms, can often lead to improved hair growth and reduced patchiness.
  - Medications: Discuss alternative medications with your healthcare provider if you suspect that a medication is causing sudden patchy hair loss. In some cases, switching to an alternate treatment may be necessary.

It's essential to consult a healthcare professional or dermatologist for proper diagnosis and personalized treatment recommendations when dealing with sudden patchy hair loss.""",

    "Question-4": """1. Medical Evaluation and Stabilization: The first step is to ensure that the person's physical condition is stable, which may involve hospitalization for surgery or other medical interventions if necessary. This could include controlling bleeding, managing swelling or pressure in the brain, addressing any respiratory issues, etc.

2. Rehabilitation Program: Once a patient's condition has stabilized, they will typically begin a rehabilitation program to help them regain as much function and independence as possible. This may include physical therapy for motor skills, occupational therapy for daily living tasks, speech-language therapy for communication issues, cognitive rehab for memory or problem solving abilities, and psychological support for emotional wellbeing.

3. Medications: Depending on the specific injury and symptoms, various medications might be prescribed to help manage conditions like seizures, pain, depression, anxiety, or other complications that can arise after a brain injury.

4. Assistive Devices and Technology: In some cases, individuals with brain injuries may require assistive devices such as wheelchairs, walkers, communication aids, or adaptive equipment for daily living tasks to improve their quality of life and functional abilities.

5. Lifestyle Modifications: Making lifestyle changes can also be beneficial in managing the effects of a brain injury. This might include getting enough rest, maintaining a healthy diet, engaging in regular physical activity, avoiding alcohol and drugs, practicing stress management techniques, etc.

6. Support Groups: Connecting with others who have experienced similar injuries or conditions through support groups can provide valuable emotional and social connections for individuals dealing with the challenges of brain injury recovery.

7. Education and Training: Learning about brain injury and its impact on daily life is essential to help individuals adapt and cope effectively. This may involve attending educational workshops, reading books, watching videos, or working one-on-one with healthcare professionals or therapists.

8. Vocational Rehabilitation: For those who are able to return to work after a brain injury, vocational rehab can help identify suitable employment opportunities and provide training for new skills if necessary. This can be an essential part of regaining independence and financial stability following an injury.""",

    "Question-5": """1. Assessing the Injury: The first step is to assess the severity of the fracture. If you suspect a leg fracture while hiking, try not to move the person excessively and avoid putting weight on the injured leg. Check for signs of open wounds, swelling, deformities, numbness or tingling sensations, and inability to bear weight.
    2. Call for Help: If possible, call emergency services or have someone from your group contact them. Provide accurate location information if you're hiking off-trail.
    3. Immobilize the Fracture: Use a splint, sling, or other available materials to immobilize the fractured leg as best as possible. This will help prevent further damage and discomfort during transport. Be careful not to apply too much pressure on the injury site while securing the makeshift brace.
    4. Provide Comfort: Keep the person warm by covering them with a blanket or insulating material, if necessary. Offer water or other fluids to help maintain hydration and provide reassurance during the wait for medical assistance.
    5. Monitor Vital Signs: Check their pulse rate, breathing rate, and blood pressure regularly while waiting for emergency services. Keep track of any changes in their condition.
    6. Prepare for Evacuation: If you're hiking off-trail or far from a road, consider how to safely evacuate the person with their injury. This may involve carrying them on a stretcher or using other rescue techniques depending on your group size and resources.
    7. Post-Injury Care: After reaching medical help, follow the doctor's instructions for post-fracture care, which could include wearing a cast or brace, taking pain medication, undergoing physical therapy, and avoiding weight-bearing activities until fully healed. Remember that recovery time varies depending on the severity of the fracture and individual healing rates.
    8. Prevention: To minimize the risk of leg injuries while hiking, wear appropriate footwear with good traction, stay on marked trails whenever possible, avoid carrying heavy loads, and be aware of your surroundings to prevent falls or tripping hazards."""
}

**Insights:**
- The response seems generic and appears to be derived from different sources, however, the authenticity of the generated responses is still ambiguous.

###**Using latest Merck Manual answers using Google Gemini as Gold-Standard Answers:**

In [None]:
# Gold-standard answers (from Merck Manual via Gemini)
gold_answers = {
    "What is the protocol for managing sepsis in a critical care unit?": """
    The protocol for managing sepsis in a critical care unit involves immediate and aggressive treatment. This includes:
    - Antibiotics: Broad-spectrum antibiotics within one hour.
    - Intravenous Fluids: Crystalloids for fluid resuscitation.
    - Oxygen Support: Ensure adequate oxygenation.
    - Source Control: Identify and remove infection sources.
    - Vasopressors: Norepinephrine if fluids fail.
    - Blood Glucose Control: Insulin if needed.
    - Corticosteroids: Hydrocortisone for persistent hypotension.
    - ICU Admission: For septic shock/severe sepsis.
    - Antimicrobial Reevaluation: Daily de-escalation checks.
    """,

    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?": """
    Common symptoms of appendicitis:
    - Pain starting in upper abdomen/navel, shifting to right lower abdomen.
    - Nausea, vomiting, low-grade fever, loss of appetite.
    Appendicitis cannot be reliably cured with medicine alone. Surgical removal (appendectomy) is the recommended treatment.
    """,

    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?": """
    Sudden patchy hair loss (alopecia areata) is an autoimmune condition. Treatments include:
    - Corticosteroids (injected/topical/oral).
    - Minoxidil, Anthralin, JAK inhibitors (baricitinib).
    - Methotrexate for severe cases.
    - PUVA therapy (limited success).
    """,

    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?": """
    Treatments for Traumatic Brain Injury (TBI):
    - Mild: Observation, rest, pain relievers.
    - Moderate/Severe: ICU care, ICP monitoring, surgery for hematomas.
    - Rehabilitation: Physical, occupational, speech therapy.
    - Medications: Antiseizure drugs, stimulants, antidepressants.
    """,

    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?": """
    Precautions for leg fracture:
    - Immobilize with splint, control bleeding, elevate, ice.
    - Avoid weight-bearing, seek emergency help.
    Treatment:
    - Pain relief, reduction (manual/surgical), casting.
    - Physical therapy, gradual return to activity.
    - Nutrition (calcium/vitamin D).
    """
}

**How to Interpret Results:**
* Semantic Similarity > 0.8: Strong alignment with gold answer.
* BLEU > 0.3: Good lexical overlap.
* ROUGE-L > 0.5: Decent summary-level similarity.
* Readability > 60: Easily understandable (60-70 = standard, 80+ = very easy).

###**Comparing LLM responses (without Prompt Engineering) with Gold-Standard**

In [None]:
# Initialize models
model = SentenceTransformer('all-MiniLM-L6-v2')
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

def evaluate_response(gold: str, response: str) -> dict:
    # Semantic Similarity
    embeddings = model.encode([gold, response])
    semantic_sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

    # BLEU Score
    bleu = sentence_bleu([nltk.word_tokenize(gold)], nltk.word_tokenize(response))

    # ROUGE Score
    rouge = scorer.score(gold, response)

    # Readability
    readability = textstat.flesch_reading_ease(response)

    return {
        "Semantic Sim": round(semantic_sim, 3),
        "BLEU": round(bleu, 3),
        "Rouge-1": round(rouge['rouge1'].fmeasure, 3),
        "Rouge-L": round(rouge['rougeL'].fmeasure, 3),
        "Readability": round(readability, 1)
    }

In [None]:
# Evaluate Raw LLM
raw_results = {}
# Create a mapping from "Question-X" keys to the full question text
question_map = {
    "Question-1": "What is the protocol for managing sepsis in a critical care unit?",
    "Question-2": "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?", # Corrected the key
    "Question-3": "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Question-4": "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Question-5": "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
}

for q_key in raw_llm_responses:
    # Use the mapping to get the full question text
    full_question = question_map[q_key]
    raw_results[full_question] = evaluate_response(gold_answers[full_question], raw_llm_responses[q_key])

In [None]:
# Print Raw LLM Comparison
print("=== Raw LLM vs Gold Standard ===")
print(tabulate(
    [[q] + list(raw_results[q].values()) for q in raw_results],
    headers=["Question", "Semantic Sim", "BLEU", "Rouge-1", "Rouge-L", "Readability"],
    tablefmt="grid"
))

## Question Answering using LLM (with Prompt Engineering)

In [None]:
system_prompt = "You are an expert medical assistant tasked with answering questions based on the content of a medical manual. Use the provided context to give accurate, concise, and relevant answers. If the context does not contain enough information to answer the question fully, clearly state that and provide a general response based on your knowledge. Format your answers clearly and professionally."

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_1llmpmt = system_prompt+"\n"+ "What is the protocol for managing sepsis in a critical care unit?"
prompt_engineered_response_1 = generate_llama_response(user_input_1llmpmt)
prompt_engineered_response_1

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_2llmpmt = system_prompt+"\n"+ "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
prompt_engineered_response_2 = generate_llama_response(user_input_2llmpmt)
prompt_engineered_response_2

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_3llmpmt = system_prompt+"\n"+ "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
prompt_engineered_response_3 = generate_llama_response(user_input_3llmpmt)
prompt_engineered_response_3

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_4llmpmt = system_prompt+"\n"+ "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
prompt_engineered_response_4 = generate_llama_response(user_input_4llmpmt)
prompt_engineered_response_4

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_5llmpmt = system_prompt+"\n"+ "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
prompt_engineered_response_5 = generate_llama_response(user_input_5llmpmt)
prompt_engineered_response_5

**Insights:**
- The responses are derived from different sources, the responses are better than raw llm but still needs authentication.

###**Comparing LLM responses (with Prompt Engineering) with Gold-Standard**

In [None]:
prompt_engineered_responses = {
    "Question-1": """ In a critical care unit, managing sepsis involves immediate recognition, quick initiation of appropriate therapy, and ongoing assessment and adjustment. Here are general steps for managing sepsis in a critical care setting:

1. Recognition: Suspect sepsis based on clinical signs such as fever or hypothermia, tachycardia or bradycardia, respiratory distress, altered mental status, and lactic acidosis. Obtain blood cultures before starting antibiotics if possible.

2. Initial resuscitation: Administer high-flow oxygen via a nonrebreather mask to maintain adequate oxygenation. Start intravenous (IV) access with large bore catheters for fluid administration. Give 30 mL/kg of crystalloid solution over the first hour, and continue infusion at a rate sufficient to maintain mean arterial pressure (MAP) >65 mmHg or a MAP increase of ≥20% from baseline.

3. Antibiotics: Administer broad-spectrum antibiotics based on suspected infection site and microbial susceptibilities. Consider empiric coverage for common pathogens such as methicillin-resistant Staphylococcus aureus (MRSA), Pseudomonas aeruginosa, and Enterococcus species.

4. Source control: Identify and address the source of infection whenever possible. This may include surgical intervention for abscesses or drainage procedures for infected wounds.

5. Fluid management: Monitor fluid balance closely to avoid over-resuscitation while maintaining adequate tissue perfusion. Consider using colloids, blood products, or vasopressors if needed to maintain MAP and organ function.

6. Inotropic support: Administer inotropes such as dopamine, dobutamine, or milrinone to improve cardiac output when necessary.

7. Electrolyte replacement: Monitor electrolytes closely and replace as needed due to diuretic therapy, insensible losses, or ongoing metabolic demands.

8. Glucose control: Maintain strict glucose control (<150 mg/dL) using continuous insulin infusions if necessary.

9. Renal replacement therapy: Consider initiating renal replacement therapy in cases of severe sepsis or septic shock with persistent organ dysfunction despite adequate fluid resuscitation and vasopressor support.

10. Monitoring: Continuously monitor vital signs, laboratory values (including lactate levels), urine output, and clinical response to treatment. Adjust therapy as needed based on ongoing assessment.""",

    "Question-2": """Appendicitis is a medical condition characterized by inflammation of the appendix. Common symptoms include:

1. Abdominal pain, usually starting around the navel and then shifting to the lower right side.
2. Loss of appetite.
3. Nausea and vomiting.
4. Fever (often low-grade at first).
5. Constipation or diarrhea.
6. Abdominal swelling and rigidity.
7. Inability to pass gas or have a bowel movement.
8. Feeling sick, weak, or generally unwell.

Appendicitis cannot be cured via medicine alone as the inflammation can lead to rupture of the appendix if left untreated. If the appendix bursts, it may cause peritonitis – a serious infection in the abdominal cavity that requires immediate medical attention. The standard treatment for appendicitis is surgical removal of the affected organ called an appendectomy. This procedure can be performed as open surgery or laparoscopically (keyhole surgery). Both methods aim to remove the inflamed appendix while minimizing damage to surrounding tissues and ensuring a quick recovery.""",

    "Question-3": """There are several potential treatments for sudden, patchy hair loss, also known as alopecia areata. This condition is characterized by round or oval bald spots that develop suddenly on various areas of the scalp. The exact cause of alopecia areata is not fully understood, but it's believed to be an autoimmune disorder where the body attacks its own hair follicles.

    1. Topical treatments: Minoxidil (Rogaine) is a medication that can help stimulate hair growth in some people with alopecia areata. It comes as a liquid or foam and is applied directly to the scalp twice daily.

    2. Injections: Corticosteroid injections, such as triamcinolone acetonide (Kenalog), can be effective for treating small bald spots. The medication is injected into the affected area every 4-6 weeks until hair regrowth occurs.

    3. Systemic treatments: Oral corticosteroids like prednisone may be prescribed to treat extensive alopecia areata or when other treatments have not been effective. However, long-term use of oral steroids can lead to side effects and should only be considered under the supervision of a healthcare professional.

    4. Immunotherapy: Diphenylcyclopropenone (DPCP) is an immunotherapeutic agent that can help promote hair regrowth in some people with alopecia areata. It's applied to the scalp once or twice weekly and may cause skin irritation as a side effect.

    5. Alternative therapies: Some alternative treatments, such as acupuncture, herbal remedies (like saw palmetto), and essential oils (such as rosemary oil) have shown promise in promoting hair growth for some individuals with alopecia areata. However, more research is needed to confirm their effectiveness and safety.

    Possible causes of sudden patchy hair loss include:
    - Autoimmune disorders like alopecia areata or lupus erythematosus
    - Hormonal imbalances (such as thyroid issues)
    - Nutritional deficiencies, particularly in iron and protein
    - Stress or emotional trauma
    - Certain medications (like chemotherapy drugs, antidepressants, and beta-blockers)
    - Infections like ringworm of the scalp.

    It's essential to consult a healthcare professional for an accurate diagnosis and appropriate treatment plan if you experience sudden patchy hair loss.""",

    "Question-4": """Based on the context provided, there is no specific information about treatments for a person with a physical injury to brain tissue. However, I can provide some general information based on common medical practices and research findings.

For temporary impairment of brain function due to a traumatic brain injury (TBI), treatment may include:
1. Rest and relaxation: The injured person should get plenty of rest and avoid physical or mental exertion that could worsen their condition.
2. Medications: Depending on the symptoms, medications such as pain relievers, anti-inflammatory drugs, sedatives, or stimulants may be prescribed to manage various symptoms like headache, anxiety, depression, or sleep disturbances.
3. Rehabilitation therapy: Physical, occupational, and speech therapies can help improve motor skills, cognitive function, communication abilities, and overall quality of life.
4. Surgery: In some cases, surgery may be necessary to remove hematomas (blood clots) or repair skull fractures that could worsen the condition or increase the risk of complications.
5. Dietary modifications: A balanced diet rich in nutrients and antioxidants can help support brain health during recovery.
6. Support groups: Joining a support group for TBI survivors and their families can provide emotional and social benefits, as well as valuable information about coping strategies and resources.

For permanent impairment of brain function due to a traumatic injury or other conditions like stroke or degenerative diseases, treatment may focus on managing symptoms and maximizing functional abilities through:
1. Rehabilitation therapy: As mentioned above, rehabilitation therapies can help improve motor skills, cognitive function, communication abilities, and overall quality of life for individuals with permanent brain impairment as well.
2. Assistive devices: Devices like wheelchairs, walkers, or communication aids can make daily activities easier and safer for those with physical limitations due to brain injury.
3. Medications: Depending on the symptoms, medications may be prescribed to manage various conditions such as seizures, pain, depression, anxiety, or sleep disturbances.
4. Lifestyle modifications: Making lifestyle changes like getting regular exercise, practicing stress management techniques, and maintaining a healthy diet can help improve overall health and well-being for individuals with permanent brain impairment.
5. Support services: Accessing support services such as home care, transportation assistance, or financial aid can help make daily life easier and more manageable for those with permanent brain injuries.""",

    "Question-5": """1. Necessary Precautions:
   - Ensure that the person does not bear weight on the affected leg to prevent further damage or displacement of the fracture.
   - Apply a splint, sling, or brace to immobilize the leg and maintain alignment of the bones.
   - Keep the injured area elevated above heart level to reduce swelling and pain.
   - Protect the injury from direct sunlight and extreme temperatures.
   - Administer over-the-counter pain medication as needed for comfort, but avoid giving aspirin due to its blood thinning properties which may increase bleeding risk.

2. Treatment Steps:
   - Seek medical attention promptly if you suspect a fracture or cannot move the leg without significant pain.
   - The healthcare professional will assess the injury and determine the type of fracture, which could involve setting the bone with pins, plates, or screws in some cases.
   - Depending on the severity of the fracture, immobilization may be required using a cast, brace, or splint for several weeks to months.
   - Physical therapy and rehabilitation exercises will likely be recommended after healing to restore strength, flexibility, and mobility to the affected leg.

3. Considerations for Care and Recovery:
   - Provide emotional support and encouragement during recovery as it can be a lengthy process.
   - Ensure proper nutrition and hydration to promote optimal bone healing.
   - Encourage regular follow-up appointments with the healthcare professional to monitor progress and address any concerns or complications that may arise.
   - Make necessary modifications at home, such as installing grab bars in the bathroom or using a shower chair, to make daily activities easier during recovery."""
}

In [None]:
# Evaluate Prompt-Engineered LLM
prompt_results = {}
for q_key in prompt_engineered_responses:
  # Use the mapping to get the full question text
    full_question = question_map[q_key]
    prompt_results[full_question] = evaluate_response(gold_answers[full_question], prompt_engineered_responses[q_key])

In [None]:
# Print Prompt-Engineered LLM Comparison
print("\n=== Prompt-Engineered LLM vs Gold Standard ===")
print(tabulate(
    [[q] + list(prompt_results[q].values()) for q in prompt_results],
    headers=["Question", "Semantic Sim", "BLEU", "Rouge-1", "Rouge-L", "Readability"],
    tablefmt="grid"
))

**Insights:**
- The comparison showed better metric readings compared to the raw LLM responses.

###**Comparing Raw LLM vs LLM with Prompt Engineering**

In [None]:
# Comparative Analysis
print("\n=== Method Comparison (Average Scores) ===")
avg_raw = {
    k: np.mean([raw_results[q][k] for q in raw_results])
    for k in raw_results[full_question].keys()
}
avg_prompt = {
    k: np.mean([prompt_results[q][k] for q in prompt_results])
    for k in prompt_results[full_question].keys()
}

print(tabulate(
    [
        ["Raw LLM"] + list(avg_raw.values()),
        ["Prompt-Engineered"] + list(avg_prompt.values())
    ],
    headers=["Method", "Semantic Sim", "BLEU", "Rouge-1", "Rouge-L", "Readability"],
    tablefmt="grid"
))

###**Comparison Result**

In [None]:
# Determine which method performed better overall
better_method = {}
for metric in avg_raw:
    if avg_prompt[metric] > avg_raw[metric]:
        better_method[metric] = "Prompt-Engineered"
    else:
        better_method[metric] = "Raw LLM"

print("\n=== Better Performing Method by Metric ===")
print(tabulate(
    [[k, v] for k, v in better_method.items()],
    headers=["Metric", "Better Method"],
    tablefmt="grid"
))

**Insights:**
* Clearly the comparison showed the better method is LLM with prompt Engineering to get the responses.

##**Incorporating RAG on Merck Medical Manual**

## Data Preparation for RAG

### Loading the Data

In [None]:
# uncomment and run the following line if using Google Colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# loading data into a pandas dataframe
manual_pdf_path = '/content/drive/MyDrive/Colab Notebooks/UT_Austin_AI_ML/NLP_GenAI/Project_05_AI_Medical_Assistant/medical_diagnosis_manual.pdf'
pdf_loader = PyMuPDFLoader(manual_pdf_path)
manual = pdf_loader.load()

**Insights:**

* PyMuPDF extracts text reliably, preserving structure better than alternatives like pdfplumber.
* The extracted text may include headers, footers, or formatting artifacts, which will be addressed in chunking.

### Data Overview

#### Checking the first 5 pages

In [None]:
for i in range(5):
    print(f"Page Number : {i+1}",end="\n")
    print(manual[i].page_content,end="\n")

#### Checking the number of pages and Average words per page

In [None]:
num_pages = len(manual)
print(f"Number of pages in the PDF: {num_pages}")

In [None]:
# Word count analysis
total_words = 0
for doc in manual:
    total_words += len(doc.page_content.split())

avg_words_per_page = total_words / num_pages
print(f"\nAverage words per page: {avg_words_per_page:.2f}")

**Insights:**

* The EDA confirms the PDF has 4,114 pages, aligning with the data dictionary.
First and last pages may contain metadata (e.g., title, index), which should be excluded during chunking.
* Average words per page (484) suggests chunk sizes of greater or equal to **350** words for balanced retrieval.

**Why This Range:**
1. Semantic Coherence:
Chunks smaller than 350 words may lose context (e.g., breaking mid-paragraph).
Chunks larger than 550 words risk information dilution (irrelevant text in retrieval).
2. Retrieval Accuracy:
Smaller chunks (300-450 words) work better for precise question-answering.
Larger chunks (450-600 words) suit summarization or broad queries.
3. Empirical Evidence:
Most RAG implementations use 256 to 512 tokens (aprox 200 to 400 words per page average).
LLaMA/BERT tokenizers average 1.3 tokens per word, so 300 words aprox to 390 tokens.

### Data Chunking

In [None]:
# Split the PDF into chunks
text_splitter = RecursiveCharacterTextSplitter(
    #encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap=20)
chunks = text_splitter.split_documents(manual)


In [None]:
len(chunks)

In [None]:
chunks[0].page_content

In [None]:
chunks[2].page_content

In [None]:
chunks[3].page_content

In [None]:
chunks[-1000].page_content

In [None]:
chunks[-999].page_content

In [None]:
chunks[-1001].page_content

In [None]:
chunks[-2].page_content

In [None]:
chunks[-1].page_content

**Insights:**
* RecursiveCharacterTextSplitter ensures chunks respect sentence boundaries, improving semantic coherence.
* A chunk size of 512 with 50 overlap balances context retention and retrieval efficiency.
* Total Chunks = Total Pages×(1+ [(Words/Page−Chunk Size)/(Chunk Size - Overlap)]
* The number of chunks (31192 chunks for 4114 pages) is manageable for Chroma in Colab.

* As expected, there are some overlaps
  - If we increase the `chunk_overlap`, the overlapping length of the sentence will also increase.

**Key Notes:**
* Edge Cases:
  - Pages with ≤512 words → 1 chunk.
  - Pages with 512 to 1000 words → 2 chunks.
  - Average (484 words) consistently produces 1 chunk/page.
* Why This Works:
  - The overlap ensures context continuity between chunks.
  - Example for a 484-word page:
  - Chunk 1: Words 1 to 484.
- Chunk 2: Words 484 to 512.
* Total Words Processed:
- 4114 pages * 484 words/page = 1991176 words

### Embedding

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large') #

In [None]:
embedding_1 = embedding_model.embed_query(chunks[0].page_content)
embedding_2 = embedding_model.embed_query(chunks[1].page_content)

In [None]:
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

In [None]:
embedding_1,embedding_2

**Insights:**
* **https://huggingface.co/thenlper/gte-large**
* The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.
* The embedding dimension is fixed at 1024, suitable for Chroma's cosine similarity search.
* Embeddings capture semantic meaning, enabling accurate retrieval of relevant chunks.
* The embedding model provides a fixed-length (1024) vector for any number of chunks.  
* This is necessary because we want to compare them for similarity.

### Vector Database

In [None]:
out_dir = 'medical_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
vectorstore = Chroma.from_documents(
    chunks,
    embedding_model,
    persist_directory = out_dir
)

In [None]:
# Only reload if the object isn't in memory or after a runtime restart
if 'vectorstore' not in locals():
    vectorstore = Chroma(
        persist_directory=out_dir,
        embedding_function=embedding_model
    )

In [None]:
vectorstore.embeddings

In [None]:
vectorstore.similarity_search("Atrial Fibrillation, Pharynx, Telangiectases",k=3) # Checking if similar content is fetched,

**Insights:**
* From the retrieved chunks, we observe that all the chunks are related to the key terms [ 'Atrial Fibrillation', 'Pharynx', 'Telangiectases'].
* Chroma is ideal for in-memory vector storage in Colab, supporting cosine similarity for accurate retrieval.
* Cosine similarity ensures robust matching of query embeddings to chunk embeddings.

### Retriever

In [None]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k':2}
)

In [None]:
rel_docs = retriever.get_relevant_documents("how can polyps be prevented") # Check if related chunks were fetched from the document k=2 will fetch two relevant chunks
rel_docs

**Insights:**
- We can observe that the two relevant chunks contain the answer to the query.  
- If we increase the **`k`** value, there is a chance that we might find the answer in even more chunks.  
- This is a hyperparameter that we need to tune to get the best context.

## Defining the Response Generator

In [None]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"

In [None]:
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

In [None]:
#uncomment the below snippet of code if the runtime is connected to GPU.
model_output = Llama(
    model_path=model_path,
    n_ctx=2300,
    n_gpu_layers=20, # Reduced the number of GPU layers
    n_batch=512
)

#uncomment the below snippet of code if the runtime is connected to CPU only.
# model_output= Llama(
#     model_path=model_path,
#     n_ctx=1024,
#     n_cores=-2
# )

In [None]:
model_output("What is Atrial Fibrillation and how it can be prevented?")['choices'][0]['text']

- The response is incomplete and generic and appears to be derived from another source. Let's provide our own context and align the response with our needs.

**Insights:**

The specified GGUF model isn’t directly supported in Colab’s Hugging Face ecosystem, so we use Mistral-7B-Instruct-v0.2-GGUF with 4-bit quantization for efficiency.
Temperature=0.1 and top_p=0.98 balance creativity and factual accuracy.
max_new_tokens=512 ensures detailed responses without excessive length.

### System and User Prompt Template

Prompts guide the model to generate accurate responses. Here, we define two parts:

    1. The system message describing the assistant's role.
    2. A user message template including context and the question.

In [None]:
qna_system_message = """
You are a Medical Reference Assistant trained to provide precise, evidence-based answers from the context. Your responses must adhere strictly to the latest clinical guidelines and avoid speculation.

### Input Format:
- Context will follow the token: ###CONTEXT.
- The context contains references to specific portions of a document relevant to the user query.
- Questions will follow the token: ###QUESTION.

### Response Guidelines:
1. **Accuracy**:
   - Base answers ONLY on the provided context. Do not extrapolate or add external knowledge.
   - Cite the specific Merck Manual section (e.g., "Sepsis and Septic Shock, Merck Manual") when possible.

2. **Clarity**:
   - Use professional medical terminology but simplify complex concepts for non-specialists when appropriate.
   - Structure responses as:
     - **Brief Summary** (1-2 sentences).
     - **Key Details** (bulleted/numbered lists for protocols, symptoms, or treatments).
     - **Critical Considerations** (e.g., contraindications, red flags).

3. **Uncertainty Handling**:
   - If the context is insufficient, respond: "This information is not covered in the provided Merck Manual excerpt. Consult the latest edition or a specialist for further guidance."

4. **Safety**:
   - Flag urgent clinical scenarios (e.g., "Immediate ICU admission is required for septic shock").
   - Avoid treatment recommendations beyond the context’s scope.
   - Add a disclaimer (e.g., "This tool supplements but does not replace clinical judgment.")

### Example:
###QUESTION: What is the first-line treatment for uncomplicated hypertension?
###CONTEXT: [Excerpt from Merck Manual on hypertension...]
**Response:**
First-line antihypertensives include:
- Thiazide diuretics (e.g., hydrochlorothiazide)
- ACE inhibitors (e.g., lisinopril)
- Calcium channel blockers (e.g., amlodipine).
*Source: Hypertension, Merck Manual Professional Edition.*
"""

In [None]:
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.
{context}

###Question
{question}
"""

**Insights:**

The system prompt establishes the assistant’s role as a reliable medical expert.
The user prompt template incorporates retrieved context and the question, ensuring grounded responses.
The prompt emphasizes evidence-based answers, critical for medical applications.

### Response Function

In [None]:
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = model_output(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

**Insights:**

* The function retrieves relevant chunks, formats the prompt, and generates a response.
* The context is concatenated from the top 50 chunks, ensuring comprehensive input.
* The questions are stored in a list for batch processing in the next step.

## Question Answering using RAG without Fine-Tuning

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_rag1 = "What is the protocol for managing sepsis in a critical care unit?"
generate_rag_response(user_input_rag1)

- The answer is clear, concise, and focused, without any unnecessary information.  

- For queries like this, we expect a response of this nature.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_rag2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
generate_rag_response(user_input_rag2)

- The response got short due to token size.  
- Perhaps if we increase the **`max_tokens`**, we might be able to get more conprehensive answers.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_rag3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
generate_rag_response(user_input_rag3)

- Again short response due to token size.

- As expected, the model has done its job well. It has eliminated hallucination.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_rag4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
generate_rag_response(user_input_rag4)

- The responses were short due to only 128 tokens and a temperature value of 0 which makes sure only most relevant answers to be chosen.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_rag5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
generate_rag_response(user_input_rag5)

- We will address the token size during fine tuning.  

- As expected, the model has done its job well.

**Insights:**

Each question is processed independently, leveraging the RAG pipeline.
Responses are expected to be detailed and grounded in the Merck Manual context.
Since, responses are incomplete, fine-tuning in the next step will improve quality.

## Fine-tuning Parameters RAG

#### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_ragft1 = "What is the protocol for managing sepsis in a critical care unit?"
generate_rag_response(user_input_ragft1,  max_tokens=1000, temperature=0)

- When **`max_tokens`** is set to 1000, the model generated enough output, however 1000 tokens are overkill and may result in helucinations as the query could be answered with a limited number of tokens.  

- One of the reasons could be that the temperature is set to 0, making the model more deterministic and less creative.

#### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_ragft2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
generate_rag_response(user_input_ragft2,  max_tokens=350, temperature=0)

- If we compare it to the previous case, after decreasing the **`max_tokens`**, we still got the full response.
- The response is short due to 0 temperature set.

#### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_ragft3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
generate_rag_response(user_input_ragft3,  top_p=0.98, top_k=20, max_tokens=256, temperature=0.5)

- 256 Token are not enough to get full range of response, the best option is to take 512 tokens.
- The temperature of 0.5 may be slightly high.

#### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_ragft4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
generate_rag_response(user_input_ragft4,  top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

- 512 tokens, with 0.1 temperature with top_p = 0.98 and k = 50 seems to work great.  

#### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_ragft5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
generate_rag_response(user_input_ragft5,  top_p=0.98, top_k=50, max_tokens=350, temperature=0.1)

- token size of 350 is definately a small value. Again, 512 token size seems to be the best value.

**Insights:**

* According to the findings the following parameters will perform best:
1. top_p = 0.98
2. top_k = 50
3. max_tokens = 512
4. temperature = 0.1


### Generating response with the tunned parameters

In [None]:
user_input_ragtuned1 = "What is the protocol for managing sepsis in a critical care unit?"
generate_rag_response(user_input_ragtuned1,  top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

In [None]:
user_input_ragtuned2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
generate_rag_response(user_input_ragtuned2, top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

In [None]:
user_input_ragtuned3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
generate_rag_response(user_input_ragtuned3, top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

In [None]:
user_input_ragtuned4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
generate_rag_response(user_input_ragtuned4, top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

In [None]:
user_input_ragtuned5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
generate_rag_response(user_input_ragtuned5, top_p=0.98, top_k=50, max_tokens=512, temperature=0.1)

- The responses were generally better than using other hyperparameters values.

###**Comparing RAG Fine-Tuned responses with Gold-Standard**

In [None]:
rag_finetuned_responses = {
    "Question-1": """ **Response:**
Brief Summary: The management of septic shock in a critical care unit involves monitoring vital signs, fluid resuscitation, and appropriate antibiotic therapy.

Key Details:
1. Monitoring: Frequent assessment of systemic pressure, CVP or PAOP, pulse oximetry, ABGs, blood glucose, lactate, electrolyte levels, renal function, urine output, and sublingual PCO2 is necessary.
2. Fluid Resuscitation: Administer 0.9% saline until CVP reaches 8 mm Hg (10 cm H2O) or PAOP is high (60 to 65%).
3. Antibiotic Therapy: Use a regimen of gentamicin or tobramycin and a third-generation cephalosporin, such as cefotaxime or ceftriaxone. Alternatively, use ceftazidime plus a fluoroquinolone like ciprofloxacin. Monotherapy with maximal therapeutic doses of ceftazidime or imipenem is not recommended.

Critical Considerations:
- Keep the patient warm and control hemorrhage if necessary.
- Nothing should be given by mouth, and the head should be turned to one side to avoid aspiration if emesis occurs.
- Prognosis depends on the cause, preexisting or complicating illness, time between onset and diagnosis, and promptness and adequacy of therapy.""",

    "Question-2": """**Response:**
Symptoms of appendicitis include:
- Sudden onset of pain in the lower right abdomen.
- Loss of appetite.
- Nausea and vomiting.
- Fever (often low-grade at first).
- Abdominal swelling.
- Pain worsens with movement or pressure on the abdomen.

Appendicitis cannot be cured via medicine alone; surgical removal of the appendix (appendectomy) is required when diagnosed. This procedure can be performed either open or laparoscopically, depending on the severity and location of the inflammation. Antibiotics are administered intravenously before surgery to prevent infection spread. If the appendix has perforated, antibiotics should be continued until the patient's temperature and white blood cell count have normalized.

*Source: Acute Appendicitis, Merck Manual Professional Edition.*""",

    "Question-3": """**Response:**
Sudden patchy hair loss, also known as alopecia areata, is characterized by round, smooth patches of baldness on the scalp or other hair-bearing areas. The exact cause of this condition is unknown but is believed to be an autoimmune disorder that affects genetically susceptible individuals exposed to unclear environmental triggers.

Effective treatments for alopecia areata include:
1. Topical corticosteroids: Prescription-strength creams or ointments applied directly to the affected area can help reduce inflammation and promote hair regrowth.
2. Minoxidil (Rogaine): A topical medication that stimulates hair growth, available over the counter.
3. Anthralin: A prescription topical medication that slows down the immune system's attack on hair follicles.
4. Immunotherapy: Injections of diphencyprone or squaric acid dibutylester can help stimulate an immune response and promote hair regrowth.
5. Psoralen plus ultraviolet A (PUVA): A light therapy treatment that involves taking a medication called psoralen before exposure to UV-A light, which helps slow down the immune system's attack on hair follicles.
6. Salicylic acid suspension: A 10% salicylic acid solution in mineral oil can be applied to the scalp at bedtime and washed off the next morning with a tar shampoo. This treatment may help enhance penetration and promote hair regrowth.

It is important to note that treatments for alopecia areata may not work for everyone, and their effectiveness can vary depending on the severity and location of the bald spots. In severe cases, systemic corticosteroids or other immunosuppressive medications may be prescribed by a healthcare professional.

Hair loss due to other causes should be treated accordingly based on the underlying disorder. If you are concerned about sudden patchy hair loss, it is recommended that you consult with a healthcare professional for an accurate diagnosis and appropriate treatment options.

*Source: Alopecia Areata, Merck Manual Professional Edition.*""",

    "Question-4": """**Response:**
Brief Summary:
For individuals with brain injuries causing temporary or permanent impairments, the primary focus is on optimizing ventilation, oxygenation, and brain perfusion. Depending on the severity of the injury, treatments may include observation for mild cases or urgent surgical intervention for intracranial hematomas to prevent further damage from brain shift, compression, and herniation.

Key Details:
- Mild injuries: Discharge and observation.
- Moderate to severe injuries: Optimization of ventilation, oxygenation, and brain perfusion.

Critical Considerations:
- Intracranial hematomas may require urgent surgical evacuation.
- Small intracerebral hematomas rarely require surgery.
- Early neurosurgical consultation is mandatory for intracranial hematomas.
- Not all hematomas require surgical removal.
- Brain function impairments can lead to disability in social relations and employment due to the fixed size of the cranial vault.
- Posttraumatic anosmia and acute traumatic blindness seldom resolve after 3 to 4 months.
- Hemiparesis and aphasia usually resolve at least partially, except in the elderly.

*Source: Traumatic Brain Injury, Merck Manual Professional Edition.*""",

    "Question-5": """**Response:**
Brief Summary: For a person with a suspected fractured leg during a hiking trip, the following steps should be taken:
1. Apply ice and compression using an elastic bandage to reduce swelling and pain.
2. Elevate the injured leg above heart level for gravity-assisted fluid drainage.
3. Prescribe analgesics and NSAIDs as necessary for pain relief.
4. Consider using crutches for initial mobility support.
5. For suspected open fractures, provide sterile wound dressings, tetanus prophylaxis, and broad-spectrum antibiotics.

Key Details:
1. Ice and compression: Apply ice wrapped in a plastic bag to the injured area and use an elastic bandage for compression. Do not wrap too tightly to avoid causing swelling in the distal extremity.
2. Elevation: Keep the injured leg elevated above heart level to facilitate fluid drainage, which reduces swelling and pain.
3. Pain relief: Prescribe analgesics and non-steroidal anti-inflammatory drugs (NSAIDs) as needed for pain management.
4. Mobility support: Use crutches initially if walking is painful.
5. Open fractures: For suspected open fractures, provide sterile wound dressings, tetanus prophylaxis, and broad-spectrum antibiotics.

Critical Considerations:
1. Immediate medical attention: If there are signs of an unstable fracture or long bone fracture, seek immediate medical care for proper assessment and treatment.
2. Follow-up care: Ensure the person receives appropriate follow-up care to monitor their progress and address any complications.
3. Proper immobilization: Ensure the injured leg is properly immobilized during transportation and until a healthcare professional can assess it.
4. Avoid weight bearing: Encourage the person to avoid putting weight on the injured leg until a healthcare professional advises otherwise.
5. Rehabilitation: Provide resources for rehabilitation exercises and physical therapy to help restore strength and mobility once the fracture has healed."""
}

In [None]:
# Evaluate Fine-Tuned RAG Responses
rag_results = {}
for q_key in rag_finetuned_responses:
  # Use the mapping to get the full question text
    full_question = question_map[q_key]
    rag_results[full_question] = evaluate_response(gold_answers[full_question], rag_finetuned_responses[q_key])


# Print RAG Fine-Tuned Comparison
print("\n=== RAG Fine-Tuned vs Gold Standard ===")
print(tabulate(
    [[q] + list(rag_results[q].values()) for q in rag_results],
    headers=["Question", "Semantic Sim", "BLEU", "Rouge-1", "Rouge-L", "Readability"],
    tablefmt="grid"
))

- RAG Fine Tuned has performed better than all the models.

###**Comparing Raw LLM vs LLM with Prompt Engineering Vs RAG Fine-Tuned**

In [None]:
# 1. Generate Comparison Table
headers = ["Method", "Semantic Sim", "BLEU", "Rouge-1", "Rouge-L", "Readability"]

# Calculate average scores for each method
avg_raw = {
    k: np.mean([raw_results[q][k] for q in raw_results])
    for k in headers[1:]  # Skip "Method" column
}

avg_prompt = {
    k: np.mean([prompt_results[q][k] for q in prompt_results])
    for k in headers[1:]
}

avg_rag = {
    k: np.mean([rag_results[q][k] for q in rag_results])
    for k in headers[1:]
}

# Create comparison table
comparison_table = [
    ["Raw LLM"] + list(avg_raw.values()),
    ["Prompt-Engineered"] + list(avg_prompt.values()),
    ["RAG Fine-Tuned"] + list(avg_rag.values())
]

print("=== Method Comparison (Average Scores) ===")
print(tabulate(comparison_table, headers=headers, tablefmt="grid", floatfmt=".3f"))

# 2. Determine the Best Performing Method
metrics = headers[1:]  # All metrics except "Method"

# Initialize a dictionary to store the best method for each metric
best_methods = {}

for metric in metrics:
    # Get scores for all methods for this metric
    scores = {
        "Raw LLM": avg_raw[metric],
        "Prompt-Engineered": avg_prompt[metric],
        "RAG Fine-Tuned": avg_rag[metric]
    }

    # Find the method with the highest score for this metric
    best_method = max(scores.items(), key=lambda x: x[1])[0]
    best_methods[metric] = best_method

In [None]:
import matplotlib.pyplot as plt
# Extract method names
methods = [row[0] for row in comparison_table]

# Separate metrics into two groups based on their scale
# Group 1: Smaller scale metrics
small_metric_names = headers[1:5] # Semantic Sim, BLEU, Rouge-1, Rouge-L
raw_llm_small_scores = comparison_table[0][1:5]
prompt_engineered_small_scores = comparison_table[1][1:5]
rag_fine_tuned_small_scores = comparison_table[2][1:5]

# Group 2: Large scale metric (Readability)
large_metric_name = headers[5] # Readability
raw_llm_large_score = comparison_table[0][5]
prompt_engineered_large_score = comparison_table[1][5]
rag_fine_tuned_large_score = comparison_table[2][5]


# Function to add labels on top of bars
def autolabel(rects, ax_obj):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax_obj.annotate(f'{height:.3f}',
                        xy=(rect.get_x() + rect.get_width() / 2, height),
                        xytext=(0, 3),  # 3 points vertical offset
                        textcoords="offset points",
                        ha='center', va='bottom', fontsize=8, color='black')

# --- Plot 1: Smaller Scale Metrics ---
fig1, ax1 = plt.subplots(figsize=(10, 6)) # Adjust figure size for better readability
x1 = np.arange(len(small_metric_names))
width = 0.25 # Width of the bars

rects1_1 = ax1.bar(x1 - width, raw_llm_small_scores, width, label='Raw LLM', color='#1f77b4', edgecolor='black', linewidth=0.7)
rects1_2 = ax1.bar(x1, prompt_engineered_small_scores, width, label='Prompt-Engineered', color='#ff7f0e', edgecolor='black', linewidth=0.7)
rects1_3 = ax1.bar(x1 + width, rag_fine_tuned_small_scores, width, label='RAG Fine-Tuned', color='#2ca02c', edgecolor='black', linewidth=0.7)

autolabel(rects1_1, ax1)
autolabel(rects1_2, ax1)
autolabel(rects1_3, ax1)

ax1.set_xlabel('Evaluation Metrics (Smaller Scale)', fontsize=12, labelpad=10)
ax1.set_ylabel('Average Score', fontsize=12, labelpad=10)
ax1.set_title('Method Comparison: Semantic Similarity, BLEU, Rouge-1, Rouge-L', fontsize=16, pad=20)
ax1.set_xticks(x1)
ax1.set_xticklabels(small_metric_names, rotation=0, ha='center', fontsize=10)
ax1.set_ylim(bottom=0, top=1.0) # Set a more appropriate y-limit for these metrics
ax1.grid(axis='y', linestyle='--', alpha=0.7)
ax1.legend(fontsize=10, frameon=True, shadow=True, fancybox=True, loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout(rect=[0, 0, 0.88, 1]) # Adjust rect to make space for legend outside

# --- Plot 2: Readability Metric ---
fig2, ax2 = plt.subplots(figsize=(7, 6)) # Adjust figure size
x2 = np.arange(1) # Only one bar group for Readability
width = 0.25

# Create list of scores for readability for easier plotting with autolabel
readability_scores = [raw_llm_large_score, prompt_engineered_large_score, rag_fine_tuned_large_score]

rects2_1 = ax2.bar(x2 - width, readability_scores[0], width, label='Raw LLM', color='#1f77b4', edgecolor='black', linewidth=0.7)
rects2_2 = ax2.bar(x2, readability_scores[1], width, label='Prompt-Engineered', color='#ff7f0e', edgecolor='black', linewidth=0.7)
rects2_3 = ax2.bar(x2 + width, readability_scores[2], width, label='RAG Fine-Tuned', color='#2ca02c', edgecolor='black', linewidth=0.7)

autolabel(rects2_1, ax2)
autolabel(rects2_2, ax2)
autolabel(rects2_3, ax2)

ax2.set_xlabel('Evaluation Metric', fontsize=12, labelpad=10)
ax2.set_ylabel('Average Score', fontsize=12, labelpad=10)
ax2.set_title('Method Comparison: Readability Score', fontsize=16, pad=20)
ax2.set_xticks(x2)
ax2.set_xticklabels([large_metric_name], rotation=0, ha='center', fontsize=10)
ax2.set_ylim(bottom=0) # Start y-axis from 0, or adjust based on minimum value for better focus
ax2.grid(axis='y', linestyle='--', alpha=0.7)
ax2.legend(fontsize=10, frameon=True, shadow=True, fancybox=True, loc='upper left', bbox_to_anchor=(1, 1))
plt.tight_layout(rect=[0, 0, 0.75, 1]) # Adjust rect to make space for legend outside

plt.show()

###**Comparison Result**

In [None]:
# 3. Print Best Methods Summary
print("\n=== Best Performing Method by Metric ===")
best_methods_table = [[metric, method] for metric, method in best_methods.items()]
print(tabulate(best_methods_table, headers=["Metric", "Best Method"], tablefmt="grid"))

# 4. Overall Winner (Method with most 'wins' across metrics)
from collections import Counter
overall_winner = Counter(best_methods.values()).most_common(1)[0][0]
print(f"\nOverall Best Method: {overall_winner} (wins {Counter(best_methods.values())[overall_winner]} out of {len(metrics)} metrics)")

- Metrics comparison also suggest that RAG performed better in all the matircs' categories except Semantic Similarity which was also not bad.
- Over ALL RAG Fine Tuned has easily supassed the other methods.

## Output Evaluation using LLM-As-A-Judge Method

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation.

- We are using the same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

In [None]:
groundedness_rater_system_message  = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""

In [None]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""

In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

### Defining the Evaluation Function

In [None]:
def generate_ground_relevance_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=3)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    answer =  response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            echo=False
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
user_input_relevance1 = "What is the protocol for managing sepsis in a critical care unit?"
ground,rel = generate_ground_relevance_response(user_input_relevance1, max_tokens=512)

print(ground,end="\n\n")
print(rel)

- We got a score of 4 overall ie The metric is followed mostly.  
- This means that both the retrieval and augmentation parts are good but may need to improve the hyperparameters.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
user_input_relevance2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
ground,rel = generate_ground_relevance_response(user_input_relevance2, max_tokens=512)

print(ground,end="\n\n")
print(rel)

- We got an overall score of 3: following the metric to a good extent, which is not great.   
- This means that both the retrieval and augmentation parts needs to be further fine tuned to get better parameters.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
user_input_relevance3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
ground,rel = generate_ground_relevance_response(user_input_relevance3,max_tokens=512)

print(ground,end="\n\n")
print(rel)

- It got a perfect score because the response is both grounded in the context and relevant to the query.  
- This means that both the retrieval and augmentation parts are good.

### Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
user_input_relevance4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
ground,rel = generate_ground_relevance_response(user_input_relevance4, max_tokens=512)

print(ground,end="\n\n")
print(rel)

- It again got a perfect score because the response is both grounded in the context and relevant to the query.  
- This means that both the retrieval and augmentation parts are good.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
user_input_relevance5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
ground,rel = generate_ground_relevance_response(user_input_relevance5, max_tokens=512)

print(ground,end="\n\n")
print(rel)

- It got an overall mostly followed score 4, for some reason the ground truth was not perfectly matched the context and query. May require a bit of fine tuning.   
- This means that both the retrieval and augmentation parts performed averaged.

**Insights**:
- **Groundedness** ensures responses align with the Merck Manual, critical for medical accuracy.
- **Relevance** confirms the RAG system addresses the question directly.
- Scores (3-4 for groundedness) highlight areas for improvement, such as missing context or minor hallucinations.
- Fine-tuned responses with smaller chunks may score higher for groundedness due to precise retrieval.

## Actionable Insights and Business Recommendations

**Observations:**
*   RAG enhances diagnostic accuracy by grounding responses in real medical text.

* Prompt engineering significantly improves clarity and reduces hallucinations.

* Increasing k in retrieval gives more context but may introduce irrelevant data—ideal k found to be in between 20 - 50.

**Recommendations:**
* Clinical Integration: Deploy RAG as a microservice accessible via internal interfaces.

* Continual Updates: Schedule ingestion of updated manuals quarterly for reliability.

* On-Device RAG: For emergency mobile/remote access, lightweight distilled models with quantized vector stores.

* Training Staff: Train clinical staff on AI interpretation boundaries to avoid blind trust.

* Compliance: Ensure HIPAA compliance and audit trails in AI-assisted decision-making.

**Key Takeaways**:
1. **Reduced Information Overload**: The RAG system efficiently retrieves relevant information from a 4,114-page manual, reducing the time clinicians spend searching for answers.
2. **Improved Decision-Making**: Accurate responses to critical questions (e.g., sepsis protocol) support timely interventions, potentially saving lives.
3. **Standardized Care**: By grounding responses in the Merck Manual, the system ensures consistent, evidence-based answers, reducing variability in care practices.
4. **Scalable Solution**: The modular RAG pipeline can integrate additional medical texts or be deployed in clinical settings with minimal reconfiguration.

**Business Recommendations**:
1. **Integrate with Clinical Systems**: Embed the RAG system into electronic health record platforms to provide real-time decision support during patient consultations.
2. **Expand Knowledge Base**: Incorporate additional trusted sources to enhance coverage and robustness.
3. **User Training**: Train healthcare providers on using the system via intuitive interfaces (e.g., mobile apps) to maximize adoption.
4. **Continuous Evaluation**: Implement feedback loops with clinicians to refine prompts and fine-tune parameters based on real-world usage.
5. **Regulatory Compliance**: Ensure HIPAA compliance and FDA validation for clinical deployment to maintain trust and legal adherence.

**Conclusion**:
The RAG-based AI solution effectively addresses healthcare challenges by providing a scalable, accurate, and efficient tool for accessing medical knowledge. By reducing information overload and supporting evidence-based decision-making, it has the potential to improve patient outcomes and standardize care practices, delivering significant value to healthcare organizations.

### Converting to HTML

In [None]:
!pip install nbconvert

In [None]:
!jupyter nbconvert --clear-output --inplace NLP_RAG_Medical_Assistant_AdnanNasir.ipynb

In [None]:
!jupyter nbconvert NLP_RAG_Medical_Assistant_AdnanNasir.ipynb --to html

In [None]:
from IPython.display import Javascript

# Clears widget states from notebook metadata
Javascript('''
    const cells = Jupyter.notebook.get_cells();
    for (let cell of cells) {
        if (cell.output_area?.outputs) {
            for (let output of cell.output_area.outputs) {
                if (output.data?.['application/vnd.jupyter.widget-state+json']) {
                    delete output.data['application/vnd.jupyter.widget-state+json'];
                }
            }
        }
    }
''')

# Save the notebook after clearing
Javascript('Jupyter.notebook.save_checkpoint();')

<font size=6 color='blue'>Power Ahead</font>
___