# Problem Statement

Business Context

The healthcare professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.



Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges.

The objective is to understand issues like information overload and apply AI techniques to streamline decision-making.

To analyze its impact on diagnostics and patient outcomes, evaluate its potential to standardize care practices, and create a functional prototype demonstrating its feasibility and effectiveness.

Data Description

The Merck Manuals are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.



# Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q


# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
# For installing the libraries & downloading models from HF Hub

!pip install -q \
 langchain \
 langchain-community \
 chromadb \
 sentence-transformers \
 pymupdf \
 tiktoken


In [None]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd
import numpy as np

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama


# Downloading and Loading the model

In [None]:
# Downloading and Loading the model
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)
# uncomment the below snippet of code if the runtime is connected to GPU.
llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    n_gpu_layers=32,
    n_batch=128
)

# uncomment the below snippet of code if the runtime is connected to CPU only.
# llm = Llama(
#    model_path=model_path,
#    n_ctx=1024,
#    n_cores=-2
# )



# Question & Answering using LLM

In [None]:
#Response
def response(query,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    model_output = llm(
      prompt=query,
      max_tokens=max_tokens,
      temperature=temperature,
      top_p=top_p,
      top_k=top_k
    )

    return model_output['choices'][0]['text']


Question & Answering using LLM


In [None]:
#Query 1: What is the protocol for managing sepsis in a critical care unit?
llm("What is the protocol for managing sepsis in a critical care unit?")


In [None]:
#Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
llm("What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?")


In [None]:
#Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
llm("What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?")


In [None]:
#Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
llm("What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?")



In [None]:
#Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?
llm("What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?")

Observations for raw llm answers:

The model has generated a clinically grounded answers.The responses were incomplete, it was cut off mid length due to token length. The temperature is 0 so the answer is focused . The top p value is high so it has creative component also. So the answer is balance of creativity and relevance.  No hallucinations were seen. Some repetition was seen  in the answer.We can alter parameters and the prompt to get refined response.
Answer 1 explains symptoms and management of symptoms briefly. Answer 2 gives detail of symptoms of appendicitis. The answer 3 gives information on causes of hair loss. The answer 4 gives management of patient suffering brain injury. The answer 5 includes assessment and precaution patient who has fractured leg.

# Question Answering using LLM with Prompt Engineering

In this section we will be using 5 different prompting techniques with llm parameters. The temperature controls randomness of response; higher temperature gives creative/random and lower temperature gives more focused output.
The top p value it limits sampling to top tokens whose cumulative probability adds up to p higher values gives diverse and lower values gives more focused response.The repeat penalty it penalizes repeated tokens. The higher values have less repetition and lower values more likely to repeat words.The max token helps in adjusting the length of an answer. The higher values give detailed response and lower values concise response.

Query 1:

What is the protocol for managing sepsis in a critical care unit?


In [None]:
prompt = """Example 1:
Q: How is diabetic ketoacidosis managed in a hospital setting?
A: Management includes fluid resuscitation with isotonic saline, insulin therapy to reduce blood glucose, and close monitoring of electrolytes, especially potassium. Regular blood gas analysis and continuous cardiac monitoring are essential.

Example 2:
Q: What is the treatment protocol for acute myocardial infarction in an emergency unit?
A: The protocol includes administration of oxygen, aspirin, nitroglycerin, morphine (as needed), and beta-blockers. Rapid ECG, troponin testing, and preparation for reperfusion therapy such as PCI or thrombolytics are critical.

Example 3:
Q: What is the protocol for managing sepsis in a critical care unit?
A: The management of sepsis includes """

# Generate response
output = llm(
    prompt,
    max_tokens=300,
    temperature=0,
    top_p=0,
    repeat_penalty=1,
    stop=['Example 4:']
)

# Print result
print(output['choices'][0]['text'].strip())


In this prompting technique we are using few short prompting, by providing a few task examples to guide the model's output. The top p value and temperature value is set at 0 so that answer is focused and concise. The repeat penalty is high so there are no repeating phrases. The max token helps in increasing length of answer.We can adjust top p, temperature, max token values to get diverse and detailed answers.

Query 2:

What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
prompt= """Let’s analyze this step by step:
Step 1: Identify the clinical symptoms of appendicitis.
Step 2: Explain the progression and risks of untreated appendicitis.
Step 3: Determine when medical treatment may be used.
Step 4: Describe when surgery is necessary and what procedure is standard.
Step 5: Compare types of surgery and recovery outcomes."""


# Generate response
output = llm(
    prompt,
    max_tokens=600,
    temperature=0.3,
    top_p=0.5,
    repeat_penalty=0.8,

)

# Print result
print(output['choices'][0]['text'].strip())


Here we are using chain of thought prompting, in this prompting we encourage model to use step-by-step reasoning. The temperature and top p value are slightly increased to provide consistent and contextually appropriate response. The repeat penalty is low so lot of repetition of phrases can be seen.

Query 3:

What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
prompt= """You are a dermatologist providing a structured answer to a medical student.

Please write a clinically accurate and structured response to:

“What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?”

Organize your answer as follows:
1. Clinical Description & diagnosis
2. Differential diagnosis
3. Treatment Plan (based on cause)

Keep the clarity of language and educational for students"""


# Generate response
output = llm(
    prompt,
    temperature=0.7,
    top_p=0.9,
    repeat_penalty=1,
    max_tokens=600

)

# Print result
print(output['choices'][0]['text'].strip())

In this meta prompting technique it focuses on the structural and syntactical aspects of problems, prioritizing the general format and pattern over specific content details. The temperature and p value is raised to get diverse response. The repeat penalty is high so no reiteration of phrases can be seen.This combination has to be checked for possiblity of hallucination.

Query 4:

What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
prompt= """Your are neuro-rehablitation therapist assessing patient suffering brain injury.
Think: What types of brain injuries can lead to temporary or permanent impairment?
Act: Describe how to assess the patient using imaging  and neurological scoring.
Think: What interventions are needed for acute stabilization and prevention of secondary brain injury?
Act: Outline specific treatments such as surgical decompression, ICP monitoring, or pharmacological neuroprotection.
Think: What should be considered for rehabilitation and recovery?
Act: Detail cognitive, physical, and psychosocial therapy approaches.
Conclude by summarizing a multidisciplinary treatment plan."""

# Generate response
output = llm(
    prompt,
    max_tokens=600,
    temperature=0,
    top_p=0,
    repeat_penalty=1.2
)

# Print result
print(output['choices'][0]['text'].strip())


In this ReAct prompting, prompts guide the model to reason through a problem first, then decide which actions are necessary to reach the best solution. The top p and temperature is 0 to get to give steady and meaningful response. Here the repeat penalty is high so there is no repetition of phrases.

Query 5:

What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?


In [None]:
prompt= """Using your general knowledge of orthopedics, emergency response, and rehabilitation medicine, explain what a person should do if they fracture their leg during a hiking trip.
Include considerations for:
On-site first aid
Transport and evacuation
Hospital-based treatment options for management of pain and inflammation management
Rehabilitation and recovery milestones
Provide a clear, logical explanation grounded in clinical and survival principles."""

# Generate response
output = llm(
    prompt,
    max_tokens=600,
    temperature=0.9,
    top_p=0.95,
    repeat_penalty=1
)

# Print result
print(output['choices'][0]['text'].strip())


In this the generate knowledge type, the model generates factual knowledge before answering a question. Here temperature and top p value is raised to get a creative response. The repeat penalty is high so no reiteration of phrases can be seen. This combination has high risk of hallucination especially with medical information avoid using this combination.



Conclusion:

If all llm parameters temperature, top p and repeat penalty are set 0 no response is seen.So the combinations with very high temperature has possiblity of deviation from the topic and also increases risk of hallucination. The high top p value allows for creativity. If repeat penalty is set high less repetition of phrases can be seen in the response.

# Question and answering with RAG

# Data Preparation for RAG

In [None]:
#Mount on google drive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
filepath = '/content/drive/MyDrive/Colab Notebooks/data/medical_diagnosis_manual.pdf'

In [None]:
#Loading the data
from langchain.document_loaders import PyMuPDFLoader

pdf_loader = PyMuPDFLoader(filepath)

medical_diagnosis_manual = pdf_loader.load()



#Data Overview

In [None]:
#Checking the first 5 pages
for i in range(5):
    print(f"Page Number : {i+1}",end="\n")
    print(medical_diagnosis_manual[i].page_content,end="\n")

In [None]:
#checking the number of pages
len(medical_diagnosis_manual)

There are 4114 pages.

# Data Chunking

In [None]:
#data chunking
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=256,
    chunk_overlap= 50
)

In [None]:
# length of document checks
document_chunks = pdf_loader.load_and_split(text_splitter)
len(document_chunks)

In [None]:
#Checking the chunking
document_chunks[0].page_content

In [None]:
document_chunks[-2].page_content

In [None]:
document_chunks[-1].page_content

# Embedding Model

In [None]:
#install sentence transformer
!pip install sentence-transformers

In [None]:
#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

In [None]:
#Checking embedding model
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

# Vector database

In [None]:
# Vector Database
import os
out_dir = '/content/drive/My Drive/medical_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    persist_directory=out_dir)

In [None]:
vectorstore.persist()

In [None]:
#after restart run this
vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

In [None]:
vectorstore.embeddings

In [None]:
#Checking vector database
vectorstore.similarity_search("Hair loss",k=3)

#Retriever

In [None]:
#Defining retriever function
retriever = vectorstore.as_retriever(search_type='similarity',search_kwargs={'k': 3}
)


The k value is 3 helps in retieving top 3 most similiar document.

In [None]:
#checking retriever
rel_docs = retriever.get_relevant_documents("Hair loss")
rel_docs

Defining the Response Generator


In [None]:
qna_system_message = """
"You are a medical information expert. Answer strictly based on the content of the provided medical manual.Do not use any external sources.
User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Just answer the question directly, as if you are the expert providing the answer yourself.
Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.If the answer is not found in the manual, respond "I don't know".

If the answer is not found in the manual, respond "I don't know".
"""
qna_user_message_template = """
###Context
{context}

###Question
{question}
"""

In [None]:
# Response Function
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

Question Answering using RAG

In [None]:
#Question Answering using RAG
#Query 1: What is the protocol for managing sepsis in a critical care unit?
user_input="What is the protocol for managing sepsis in a critical care unit?"
print(generate_rag_response(user_input))

In [None]:
#Query 2:What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
user_input = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
print(generate_rag_response(user_input))

In [None]:
#Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
user_input = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
print(generate_rag_response(user_input))


In [None]:
#Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
user_input = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
print(generate_rag_response(user_input))

In [None]:
#Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?
user_input = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
print(generate_rag_response(user_input))

Observations for RAG answers:

All the answers are clinically relevant. The k=3 retrieves top 3 documents. The context is concise and relevant.The k value can be increase to get more context. The max tokens=128 provides concise answer can be altered to get detailed response. The temperature=0 gives focused answer it can be increased to get creative answer.The top p value can be reduced to get focused answer. No hallucination or repetition of phrases is seen.

#Fine-tuning Parameters


In this section we will be fine tuning retriver and llm parameter by using 5 combination.
k is number of top documents retrieved.The higher k value more content retrieved but may increase noise and lower k value less content gives more focused but may miss useful information.The search_type is the method used to retrieve results.Similarity: Retrieves based on cosine similarity helps in best matching chunks and MMR (Maximal Marginal Relevance): It balances similarity and diversity of results reduces duplication. fetch_k is: Number of documents initially fetched before re-ranking. Higher fetch_k values increases pool for better ranking and lower fetch_k values Faster, can miss better matches.The lambda_mult is balancing factor between relevance and diversity (0 to 1).
Higher values 1.0 helps in prioritizes similarity more and lower values helps prioritizes diversity more.

In [None]:
# Combination 1

# System message for the LLM to strictly follow
qna_system_message = """
You are a medical information expert. Answer strictly based on the content of the provided medical manual. Do not use any external sources. If the information is not present in the manual, respond with I don’t know.

User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the manual, respond "I don't know".
"""

# Define the 5 medical questions
queries = [
    "Q1) What is the protocol for managing sepsis in a critical care unit?",
    "Q2) What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Q3) What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Q4) What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Q5) What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

# Setup retriever
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 2}
)


# QA prompt format
def format_prompt(context, query):
    return f"""{qna_system_message}

###Context
{context}

###Question
{query}
"""

# Run all queries
for query in queries:
    docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in docs])
    prompt = format_prompt(context, query)

    answer = llm(prompt, temperature=0, max_tokens=300, top_p=0)

    print(f"\n Query: {query}")
    print(f" Answer: {answer}")

This is simple combination with k = 2 helps in retrieval top two similiar document.  This value is very low the top 2 will be insufficient for detailed answers it can lead to missing important information. The temperature and top p value is 0 and max token is less so a focused and concise response is seen.

In [None]:
#Combination 2


# System message for the LLM to strictly follow
qna_system_message = """
You are a medical information expert. Answer strictly based on the content of the provided medical manual. Do not use any external sources. If the information is not present in the manual, respond with I don’t know.

User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
This context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the manual, respond "I don't know".
"""

# Define the 5 medical questions
queries = [
    "Q1) What is the protocol for managing sepsis in a critical care unit?",
    "Q2) What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Q3) What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Q4) What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Q5) What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]


# Setup retriever
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 6}
)



# QA prompt format
def format_prompt(context, query):
    return f"""{qna_system_message}

###Context
{context}

###Question
{query}
"""

# Run all queries
for query in queries:
    docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in docs])
    prompt = format_prompt(context, query)

    answer = llm(prompt, temperature=0, max_tokens=500, top_p=0)

    print(f"\n Query: {query}")
    print(f" Answer:\n{answer}")

In this combination k is 6 so top 6 semantically similiar documents are retrieved. It retrieves more documents for comprehensive clinical questions but may include some irrelevant info.The temperature and p value is 0 to keep answer focused and concise.Here max token higher to get little more detailed answer.It retrieves more documents for comprehensive clinical questions but may include some irrelevant info.The temperature and p value is 0 to keep answer focused and concise.

In [None]:
#combination 3


# System message for the LLM to strictly follow
qna_system_message = """
You are a medical information expert. Answer strictly based on the content of the provided medical manual. Do not use any external sources. If the information is not present in the manual, respond with I don’t know.

User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the manual, respond "I don't know".
"""

# Define the 5 medical questions
queries = [
    "Q1) What is the protocol for managing sepsis in a critical care unit?",
    "Q2) What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Q3) What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Q4) What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Q5) What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]


# Setup retriever
retriever = vectorstore.as_retriever(
    search_type='mmr',
    search_kwargs={"k": 4, "fetch_k": 10, "lambda_mult": 0.5}
)

# QA prompt format
def format_prompt(context, query):
    return f"""{qna_system_message}

###Context
{context}

###Question
{query}
"""

# Run all queries
for query in queries:
    docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in docs])
    prompt = format_prompt(context, query)

    answer = llm(prompt, temperature=0.5, max_tokens=600, top_p=0.6)

    print(f"\n Query: {query}")
    print(f" Answer:\n{answer}")



In this combination the value of k is minimal neither too high or low k=4, fetch k is 10, lambda mult is 0.5. MMR helps to retrieve diverse yet relevant information. The top p and temperature is also balanced. This allows for controlled variablity. Higher fetch k is 8 to 12 is helpful ensures to avoid missing relevant but rare information. Balanced lambda mult 0.5 is good for factual answers but avoids repeating the same points. The top p value and temperature value is balanced to get controlled yet varied response.

In [None]:
#combination 4

# System message for the LLM to strictly follow
qna_system_message = """
You are a medical information expert. Answer strictly based on the content of the provided medical manual. Do not use any external sources. If the information is not present in the manual, respond with I don’t know.

User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the manual, respond "I don't know".
"""

# Define the 5 medical questions
queries = [
    "Q1) What is the protocol for managing sepsis in a critical care unit?",
    "Q2) What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Q3) What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Q4) What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Q5) What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

# Setup retriever
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 3}
)


# QA prompt format
def format_prompt(context, query):
    return f"""{qna_system_message}

###Context
{context}

###Question
{query}
"""

# Run all queries
for query in queries:
    docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in docs])
    prompt = format_prompt(context, query)

    answer = llm(prompt, temperature=0.2, max_tokens=500, top_p=0.2)

    print(f"\n Query: {query}")
    print(f" Answer:\n{answer}")

This combination has minimal value of k it ensures retrieval of fewer documents. The value of k is low can cause missing out important information. Here the temperature and p value is slightly increased. This combination focuses on generating factual answers by reducing randomness.

In [None]:
# Combination 5

# System message for the LLM to strictly follow
qna_system_message = """
You are a medical information expert. Answer strictly based on the content of the provided medical manual. Do not use any external sources. If the information is not present in the manual, respond with I don’t know.

User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the manual, respond "I don't know".
"""

# Define the 5 medical questions
queries = [
    "Q1) What is the protocol for managing sepsis in a critical care unit?",
    "Q2) What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Q3) What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "Q4) What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "Q5) What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

# Setup retriever
retriever = vectorstore.as_retriever(
    search_type='mmr',
    search_kwargs={'k': 5,"fetch_k": 8, "lambda_mult": 0.8}
)


# QA prompt format
def format_prompt(context, query):
    return f"""{qna_system_message}

###Context
{context}

###Question
{query}
"""

# Run all queries
for query in queries:
    docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in docs])
    prompt = format_prompt(context, query)

    answer = llm(prompt, temperature=0.8, max_tokens=300, top_p=0.9)


    print(f"\n Query: {query}")
    print(f" Answer:\n{answer}")

Combination 5 High Diversity MMR λ=0.8, k=5, fetch k = 8
It allows diversity in retrieved chunks. The temperature and top p value is high which causes answer to be deviate from factual information. It increases randomness.This combination has possiblity of hallucination in response.This combination is less reliable for medical use case.


Conclusion:

Best combination:

MMR with a balanced lambda mult (0.5) effectively retrieves a good mix of relevant and diverse information from the manual, which is crucial for providing comprehensive medical answers.
The k=4 and fetch k=10 settings likely provide enough context to generate detailed responses without overwhelming the model.
The moderate temperature and top_p values strike a balance between generating focused, factual answers and allowing for some flexibility in phrasing, while minimizing the risk of hallucination.
We observed, this combination produced clinically relevant answers devoid of significant hallucinations or repetitions.
The combination 4 low temperature for strict factual answers and combination 3 seems to offer the best overall performance for a medical RAG system where both comprehensiveness and accuracy are important.

It is generally better to avoid very high temperatures and low k values to ensure reliability and accuracy.



# Output Evaluation

We will be using the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We illustrate this evaluation based on the answeres generated to the question from the previous section.
The same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

In [None]:
#Defining the Evaluation Prompts
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric in 2 lines.
2. Next, evaluate the extent to which the metric is followed.
3. Use the previous information to rate the answer using the evaluaton criteria and assign a score.

Example Output:
Groundedness:
Steps
Rating:
Explanation: The answer is directly supported by ......
"""


In [None]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric in 2 lines.
2. Next, evaluate the extent to which the metric is followed.
3. Use the previous information to rate the context using the evaluaton criteria and assign a score.

Example Output:
Relevance:
Steps
Rating:
Explanation: The answer is directly supported by .......

"""


In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""


In [None]:
#Response function
def generate_ground_relevance_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=3)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    answer =  response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

In [None]:
#Query 1:  What is the protocol for managing sepsis in a critical care unit?
user_input = "What is the protocol for managing sepsis in a critical care unit?"
ground,rel = generate_ground_relevance_response(user_input)
print(ground)
print(rel)



Observation

For the answer 1 llm gave a groundedness rating as 5, explaining that the answer was directly derived from the context, specifically mentioning the fluid resuscitation details.The llm gave a relevance rating as 5, stating the context was highly relevant and provided a detailed description of the protocol.

In [None]:
#Query 2:What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
user_input = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
ground,rel = generate_ground_relevance_response(user_input)
print(ground)
print(rel)

Observation

For the answer 2 llm gave a groundedness rating as 5, as it identifies the symptoms of appendicitis and states that surgery is the standard treatment.The llm gave a relevance rating as 4, provided a description of the symptoms and treatment options but did not explicitly state why one mark was deducted.

In [None]:
#Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?
user_input = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
ground,rel = generate_ground_relevance_response(user_input)
print(ground)
print(rel)

Observation

For the answer 3 llm gave groundedness rating as 4 and relevance 5 as it covers detailed information of Alopecia Areata.It does not mention why 1 mark was deducted may be some minor details of the answer was not fully supported by provided context chunks.

In [None]:
#Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?
user_input = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
ground,rel = generate_ground_relevance_response(user_input)
print(ground)
print(rel)

Observation

For answer 4 the llm rates groundedness 3 as information about spinal cord injuries and immobilization, which is not directly related to the question or context. The relevance rating is 3 suggest answer might not have fully addressed all parts of the question based on the available context or  included some irrelevant information that has reduced its relevance score.

In [None]:
#Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?
user_input = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
ground,rel = generate_ground_relevance_response(user_input)
print(ground)
print(rel)

Observation

For the answer 5 llm gives groundedness rating 4 it mentions the answer misess application of ice and compression in first 24-48 hrs. It can be llm groundedness rating  for this specific point was incorrect as RAG answer mentions application of ice and compression first 24 to 48 hours. The llm rates relevance as 5 as answer addresses all important aspects of question.

# Actionable Insights and Business Recommendations

Observation:

First we used questions to get raw llm answer. Next by altering prompt and llm parameters the performance was monitored.There was good improvement in performance. Then we answered the question based on RAG. Next we did fine tuning retreiver and llm parameters to see the performance. Further we evaluated the RAG response by using llm as judge. The answers were given considerably good rating.

Here best combination is combination 3 MMR with balanced parameter we can deploy model for medical use case.

MMR (Maximal Marginal Relevance) helps minimize redundancy and ensures retrieval of diverse but relevant chunks of context.
Balanced lambda mult 0.5 finds a balanced between relevance and diversity.
Higher fetch k and k helps in  retrieval of information retrieved for detailed questions.
Moderate temperature keeps answers focused and reduces noise.
The top p ensures more controlled variablity and max tokens to get detailed response important for medical questions.


Actionable insights:

Retrieval parameter k is critical to retrieve more documents.

The chunk overlap ensures coherence.

The max tokens higher values yield detailed responses, while simple queries result in concise outputs despite large token limits due to prompt design and zero temperature.

We can tailor prompts to get nuanced response.

The temperature and top p value settings can be altered to control response length and creativity.

We can continuously adjust RAG parameters based on specific use cases for optimal performance.

We can further tune groundedness and relevance prompts in evaluations to ensure reliable and contextually accurate outputs.

We can further fine-tune parameters to get creative and relevant response.


Recommendations:

In medical use cases it is better to avoid
-High temperatures 0.9 can hallucinate
-Low k may miss important context
-Low max_tokens risks truncating complex medical explanations

We can include domain expert medical practitioner to validate responses.

It is better to include a disclaimer that it is not a substitute for professional medical advice.
