### **Problem Statement**

In this course, we are going to build and evaluate an **advanced multi-PDF Retrieval-Augmented Generation (RAG) system** specifically designed for the **healthcare domain**.

We used multiple research papers, including:

- **_“Health Care System in India : An Overview”_**  
- **_“Digital Transformation in Healthcare Industry: A Survey”_**

We applied key RAG techniques such as:

- **Chunking**
- **Embedding**
- **Multi-document retrieval**
- **Context-based response generation**

Now, we are focusing on evaluating our system’s performance. Using **DeepEval**, we assess:

-  **Retrieval relevance and precision**  
-  **Answer accuracy and relevancy**  
-  **Hallucination detection and faithfulness**

This comprehensive process—from multi-PDF ingestion to evaluation—helps us build **reliable and trustworthy AI systems** in healthcare. These systems must support **clinical decision-making** where **safety, accuracy, and contextual grounding** are critical.

Our goal is to ensure the RAG system works effectively in real-world scenarios involving **MIoT** and **ICE platforms** by delivering **accurate, safe, and context-aware responses**.



### DeepEval Usage Disclaimer

Before using **DeepEval**, please be aware of the following:

- **Telemetry**: Basic usage data (e.g., number of tests, metrics used) may be collected.  
   _No personal data is shared._

- **To disable telemetry**:  
  Export the following in your environment:  
  `DEEPEVAL_TELEMETRY_OPT_OUT="YES"`
              or 
  `os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"`

- **Cache Files**:  
  DeepEval creates local cache files like `.deep-eval-cache` in the working directory.  
   



In [6]:
!pip install openai dotenv langchain langchain_openai langchain_community deepeval

Collecting langchain
  Using cached langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain_openai
  Using cached langchain_openai-0.3.18-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain_community
  Using cached langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting deepeval
  Using cached deepeval-3.0.2-py3-none-any.whl.metadata (16 kB)
Collecting langchain-core<1.0.0,>=0.3.58 (from langchain)
  Using cached langchain_core-0.3.63-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Using cached langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith<0.4,>=0.1.17 (from langchain)
  Using cached langsmith-0.3.43-py3-none-any.whl.metadata (15 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached sqlalchemy-2.0.41-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting requests<3,>=2 (from langchain)
  Using cached requests-2.32.3-py3-none-any.whl.

In [7]:
import os
import openai
import warnings
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document
from langchain.chains import RetrievalQA
from langchain_community.retrievers import BM25Retriever
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
)



warnings.filterwarnings("ignore")


In [8]:
os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"

### Create Model Client and Set Up Authentication


In [16]:
load_dotenv('UAIS_NEW.env')
print("Environment variables loaded successfully.")

AZURE_OPENAI_ENDPOINT = os.environ["MODEL_ENDPOINT"]
OPENAI_API_VERSION = os.environ["API_VERSION"]
EMBEDDINGS_DEPLOYMENT_NAME = os.environ["EMBEDDINGS_MODEL_NAME"]
CHAT_DEPLOYMENT_NAME = os.environ["CHAT_MODEL_NAME"]
subscription_key = os.environ["AZURE_OPENAI_API_KEY"]


chat_client = openai.AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=OPENAI_API_VERSION,
        azure_deployment=CHAT_DEPLOYMENT_NAME
        
    )

Environment variables loaded successfully.


We create the `chat_model` using `AzureChatOpenAI` to connect with Azure’s GPT model.

In [18]:
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment=EMBEDDINGS_DEPLOYMENT_NAME,
    openai_api_version=OPENAI_API_VERSION,
    model=EMBEDDINGS_DEPLOYMENT_NAME,
    api_key=subscription_key)

chat_model = AzureChatOpenAI(
    openai_api_version=OPENAI_API_VERSION,
    azure_deployment=CHAT_DEPLOYMENT_NAME,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    # azure_ad_token=token,
    # default_headers={"projectId": PROJECT_ID},
)

### Purpose of Custom Wrapper for AzureChatOpenAI

The `DeepEval` library does not natively support the `AzureChatOpenAI` class from the **LangChain** library.  
To enable compatibility, we created a custom wrapper class called `AzureChatModelWrapper` that conforms to the `DeepEvalBaseLLM` interface expected by DeepEval.

This custom wrapper:

- Passes the **Azure model instance** (`AzureChatOpenAI`) to DeepEval in a compatible format
- Implements required methods like `generate`, `a_generate`, and `get_model_name`
- Allows DeepEval to **invoke and evaluate** responses using the Azure-hosted GPT model

By doing this, we ensure **seamless integration** between **Azure OpenAI services** and the **DeepEval evaluation framework**, enabling reliable testing and metric computation.


In [19]:
# Wrap AzureChatOpenAI in a compatible wrapper
class AzureChatModelWrapper(DeepEvalBaseLLM):
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        return self.model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        return (await self.model.ainvoke(prompt)).content

    def get_model_name(self):
        return "azure-gpt4o-mini"

### Wrapping AzureChatOpenAI for DeepEval


In [20]:
# Wrap it for DeepEval
wrapped_model = AzureChatModelWrapper(chat_model)


### Function: `load_pdfs_only`

This function loads and prints content information from multiple PDF files using `PyMuPDFLoader`.

#### Purpose:
It helps **prepare and organize content** from multiple PDFs so you can use it in downstream tasks like chunking, embedding, or retrieval in a RAG pipeline.


In [21]:
# Function to load and print PDF content info
def load_pdfs_only(pdf_paths):
    all_documents = []
    for path in pdf_paths:
        loader = PyMuPDFLoader(path)
        documents = loader.load()
        print(f"Loaded {len(documents)} chunks from {path}")
        all_documents.extend(documents)
    print(f"Total loaded documents from all PDFs: {len(all_documents)}")
    return all_documents

In [22]:
# Define a function to chunk documents using RecursiveCharacterTextSplitter.
def chunk_documents(documents, chunk_size=600, chunk_overlap=100):
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    return splitter.split_documents(documents)



### Function: `store_embeddings`

This function manages the storage and reuse of document embeddings using **Chroma**, When you work with **multiple PDF files**.

#### Purpose:
- It checks whether a vector store already exists in the specified directory.
- If it does, it **loads and reuses** the existing vector store.
- If it doesn't, it **creates a new vector store** from the provided documents and **persists** it.

#### Why This Matters:
When you work with **multiple PDF files**, it becomes essential to handle storage efficiently. Regenerating embeddings for previously processed files wastes time and resources.

By writing and using this function, you ensure the system:
- **Avoids redundant computation**
- **Maintains embedding consistency**
- **Keeps the vector database optimized for retrieval**

This approach supports scalable and performant RAG systems that operate across large or growing document sets.


In [23]:
# Define a function to create and store embeddings in a local ChromaDB vector store.


def store_embeddings(persist_directory,docs=None):
    """Create or use existing vector store for embeddings"""
    
    # Check if vector store already exists
    if os.path.exists(persist_directory) and os.path.isdir(persist_directory):
        print(f"Loading existing vector store from {persist_directory}")
        # Load existing vector store
        vector_store = Chroma(
            persist_directory=persist_directory,
            embedding_function=embeddings
        )
    else:
        # Create new vector store
        print(f"Creating new vector store in {persist_directory}")
        vector_store = Chroma.from_documents(
            docs,
            embedding=embeddings,
            persist_directory=persist_directory
        )
        vector_store.persist()
    
    return vector_store

### Function: `get_processed_document_name`

This function retrieves the names (paths) of all documents already embedded and stored in a **Chroma vector store**.

#### Purpose:
- Loads the vector store from the specified directory.
- Extracts and inspects metadata from stored documents.
- Gathers a set of unique source file paths that have already been processed.

By doing this, we ensure efficient ingestion by recognizing previously processed PDFs and maintaining a clean, duplication-free embedding workflow.

This approach becomes especially valuable as the system evolves—whether handling multiple files or incremental updates—by preserving consistency and avoiding unnecessary reprocessing.



In [24]:
def get_processed_document_name(persist_directory):
# Load the vector store to retrieve document IDs
    vectorstore = Chroma(
            persist_directory=persist_directory,
            embedding_function=embeddings
        )
        
    # Extract metadata from all documents in the store
    all_metadatas = vectorstore.get()["metadatas"]
    
    # Create a set of source file paths from metadata
    processed_sources = set()
    for metadata in all_metadatas:
        if metadata and "source" in metadata:
            processed_sources.add(metadata["source"])
    
    return processed_sources
    
        

### Function: `filter_new_pdfs`

This function identifies and separates **new PDF files** from those that have already been embedded and stored in the **Chroma vector store**.

#### Purpose:
- Retrieves the list of previously processed document paths using their metadata.
- Compares incoming PDF paths against the stored records.
- Returns only the PDFs that have not yet been embedded.
- Provides a clear message indicating whether new files are detected.

This step ensures the pipeline stays lean and avoids redundant work—particularly beneficial as your document set grows or evolves over time. By automatically distinguishing new content, the system stays responsive and efficient, even as the dataset scales.



In [25]:
def filter_new_pdfs(pdf_paths, persist_directory):
    """Filter out PDFs that have already been processed."""
    processed_sources = get_processed_document_name(persist_directory)
    
    # Find PDFs that haven't been processed yet
    new_pdfs = [path for path in pdf_paths if path not in processed_sources]
    
    if new_pdfs:
        print(f"Found {len(new_pdfs)} new PDFs to process: {new_pdfs}")
    else:
        print("No new PDFs to process.")
        
    return new_pdfs

### Function: `retrieve_chunks`

This function retrieves the **top-k most relevant document chunks** from a Chroma vector store using semantic similarity search.

#### Purpose:
- Performs a similarity search to find chunks related to the input query.
- Filters out duplicate chunks to ensure uniqueness.
- Returns the top-k most relevant, unique chunks.

This is a key step in the **retrieval** stage of a RAG pipeline, ensuring only high-quality, diverse content is passed to the generator.


In [26]:
# Define function to retrieve top_k semantically relevant documents from ChromaDB using vector search.
def retrieve_chunks(query,vectorstore, top_k=5):
    results = vectorstore.similarity_search(query, k=top_k*2)  # fetch more to be safe
    unique_results = []
    seen_contents = set()

    for doc in results:
        if doc.page_content not in seen_contents:
            unique_results.append(doc)
            seen_contents.add(doc.page_content)
        if len(unique_results) >= top_k:
            break

    return unique_results


### Guiding the Model to Use Only Retrieved Context During Evaluation

Earlier, our `generate_answer` function used a **general prompt** that allowed the model to answer using the retrieved context and its own knowledge.

But now, in the **evaluation phase**, our focus shifts to **strict control**: we want to measure how well the model performs when it's **only allowed to use the retrieved context**.

To support this, we revise the prompt to include:

> **"Generate an answer strictly based on the above context; do not use your own knowledge. If the query is not covered in the context, respond with: 'This query is not as per the PDF.'"**

### Why This Matters

- It isolates the model’s behavior based on context alone.  
- It prevents answers from being influenced by pre-trained knowledge.  
- It enables **fair and measurable evaluation** using DeepEval metrics such as **faithfulness**, **hallucination**, and **contextual precision**.

By refining the prompt this way, we ensure the model is tested under realistic and controlled retrieval-based generation conditions.


In [27]:
def generate_answer(query, top_chunks, model_name=CHAT_DEPLOYMENT_NAME):
    context = "\n\n".join([doc.page_content for doc in top_chunks])
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {query}\n"
        f"Answer (generate an answer strictly based on the above context; do not use your own knowledge. "
        f"If the query is not covered in the context, respond with: 'This query is not as per the PDF.'):"
    )
    
    response = chat_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        model=model_name
    )
    
    gpt_output = response.choices[0].message.content
    return gpt_output

### Function: `pdf_chatbot_pipeline`

This function defines the complete **multi-PDF RAG pipeline**, integrating all key stages:

#### Pipeline Overview:
1. **PDF Handling**  
   - Loads new PDFs only if they haven’t been processed before.
   - Avoids redundant embedding by checking existing vector store.

2. **Chunking & Embedding**  
   - Chunks the document content.
   - Embeds the chunks into a Chroma vector store (either by creating or updating it).

3. **Retrieval**  
   - Performs semantic search using the user’s query to fetch top relevant chunks.

4. **Response Generation**  
   - Passes the retrieved context to the model to generate a grounded answer.

#### Final Output:
Returns a dictionary with:
- `context`: Retrieved document chunks  
- `question`: User's original query  
- `AI_generated_response`: Final answer generated from context

This pipeline ensures scalable, **efficient** interaction with multiple documents using **retrieval-augmented generation**.


In [30]:
def pdf_chatbot_pipeline(file_path, user_query,persist_directory):
    """
    Full pipeline: Load → Chunk → Embed → Retrieve → Generate
    Returns a dictionary with context, question, and AI-generated response.
    """

    
    # Check if vector store already exists
    if os.path.exists(persist_directory) and os.path.isdir(persist_directory):
        print(f"Using existing embeddings from {persist_directory}")
        # Find any new PDFs that haven't been processed yet
        new_pdfs = filter_new_pdfs(pdf_paths, persist_directory)
        
        if new_pdfs:
            # Process only the new PDFs
            print(f"Processing {len(new_pdfs)} new PDFs...")
            raw_docs = load_pdfs_only(new_pdfs)
            chunks = chunk_documents(raw_docs)
            
            # Load existing vector store and add new documents
            vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=embeddings
            )
            
            # Add new documents to the existing vector store
            vectorstore.add_documents(chunks)
            vectorstore.persist()
            print(f"Added {len(chunks)} new chunks to existing vector store")
        else:
            # Just load the existing vector store
            vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=embeddings
            )
    else:
        # Load and process PDFs only if no existing vector store
        print(f"No existing embeddings found. Processing PDFs...")
        raw_docs = load_pdfs_only(pdf_paths)
        chunks = chunk_documents(raw_docs)
        vectorstore = store_embeddings(docs=chunks, persist_directory=persist_directory)

    # Retrieve relevant chunks based on the user query
    retrieved = retrieve_chunks(user_query, vectorstore)

    # Generate the answer using retrieved chunks
    answer = generate_answer(user_query, retrieved)

    # Format and return the response
    return {
        'context': retrieved,
        'question': user_query,
        'AI_generated_response': answer
    }




> **Important Note:**  
>  
> This pipeline **intelligently checks** whether embeddings for the given PDF are already stored in the specified **Chroma vector store** (`persist_directory`).  
> - If embeddings are found, it **reuses them** to prevent unnecessary recomputation.  
> - If the PDF is new, the pipeline will **process and append** its embeddings to the existing vector store.  

You can also **customize the storage location** by modifying the `persist_directory` parameter. This allows you to manage different sets of documents within the **same vector database**, but organized under **different collections**, offering flexible and scalable document handling.


In [35]:
!pip install pymupdf chromadb

Collecting chromadb
  Using cached chromadb-1.0.12-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting fastapi==0.115.9 (from chromadb)
  Using cached fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Using cached onnxruntime-1.22.0-cp313-cp313-macosx_13_0_universal2.whl.metadata (4.5 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Using cached opentelemetry_instrumentation_fastapi-0.54b1-py3-none-any.whl.metadata (2.2 kB)
Collecting tokenizers>=0.13.2 (from chromadb)
  Using cached tokenizers-0.21.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Using cached pypika-0.48.9-py2.py3-none-any.whl
Collecting overrides

In [36]:
# Example usage
pdf_paths = ['digital_transformation.pdf',"HealthCareSectorinindia-AnOverview.pdf"]
response = pdf_chatbot_pipeline(pdf_paths, "What are the main challenges facing the healthcare sector in India today?",persist_directory="tk.db")
print(response)
print("\n AI-Generated Response:\n")
print("-" * 80)
print(response["AI_generated_response"])


No existing embeddings found. Processing PDFs...
Loaded 16 chunks from digital_transformation.pdf
Loaded 20 chunks from HealthCareSectorinindia-AnOverview.pdf
Total loaded documents from all PDFs: 36
Creating new vector store in tk.db
{'context': [Document(metadata={'title': '', 'source': 'HealthCareSectorinindia-AnOverview.pdf', 'producer': 'Adobe PDF Library 10.0', 'page': 2, 'moddate': '2020-03-22T00:36:54+05:30', 'total_pages': 20, 'modDate': "D:20200322003654+05'30'", 'author': 'Admin', 'creationdate': '2020-03-22T00:36:44+05:30', 'subject': '', 'keywords': '', 'creationDate': "D:20200322003644+05'30'", 'format': 'PDF 1.5', 'trapped': '', 'file_path': 'HealthCareSectorinindia-AnOverview.pdf', 'creator': 'Acrobat PDFMaker 10.1 for Word'}, page_content='Large companies and affluent individuals have started five star hospitals which dominate the \nspace for high end market. The private sector has made tremendous progress, but on the flip side it is \nalso responsible for increasing i

### Handling Queries Not Covered in the Context

If the query isn't covered in the context, the model should respond:

> **"This query is not as per the PDF."**

This confirms the prompt prevents hallucination and keeps answers grounded.



In [37]:
response_option = pdf_chatbot_pipeline(pdf_paths, "theory of relativity",persist_directory="multi_pdf_eval.db")
print(response_option)
print("\n AI-Generated Response:\n")
print("-" * 80)
print(response_option["AI_generated_response"])

No existing embeddings found. Processing PDFs...
Loaded 16 chunks from digital_transformation.pdf
Loaded 20 chunks from HealthCareSectorinindia-AnOverview.pdf
Total loaded documents from all PDFs: 36
Creating new vector store in multi_pdf_eval.db
{'context': [Document(metadata={'creationDate': "D:20200322003644+05'30'", 'subject': '', 'creationdate': '2020-03-22T00:36:44+05:30', 'producer': 'Adobe PDF Library 10.0', 'modDate': "D:20200322003654+05'30'", 'trapped': '', 'creator': 'Acrobat PDFMaker 10.1 for Word', 'title': '', 'source': 'HealthCareSectorinindia-AnOverview.pdf', 'moddate': '2020-03-22T00:36:54+05:30', 'total_pages': 20, 'file_path': 'HealthCareSectorinindia-AnOverview.pdf', 'author': 'Admin', 'page': 3, 'keywords': '', 'format': 'PDF 1.5'}, page_content='patient, pulse and diagnosis and clinical history. \nYoga is a science as well an art of healthy living physically, mentally, morally and spiritually. \nYoga is believed to be founded by saints and sages of India several 

### Purpose of the Metadata and Content Extractor Function

This function is designed to **process a RAG-style context**, which typically includes a list of `Document` objects.

Each `Document` object usually contains:
- `metadata`: Information such as source, chunk number, or position.
- `page_content`: The actual text content of the chunk.

### What the Function Does

- It **extracts** relevant fields (`metadata` and `page_content`) from each document.
- It **cleans and formats** the metadata for easier use.
- It **returns a list** of dictionaries where each entry represents a chunk with its metadata and content.

### Why It's Useful

This is especially helpful when:
- You want to visualize or inspect what the retriever has returned.
- You need to use this data for evaluation or debugging.
- You want to track which parts of the source documents contributed to the final answer.


In [38]:
def extract_rag_metadata(docs):
   
    extracted = []
    for doc in docs:
        meta = doc.metadata
        chunk_data = {
            "file_path": meta.get("file_path", "N/A"),
            "source": meta.get("source", "N/A"),
            "page": meta.get("page", "N/A"),
            "chunk": doc.page_content.strip()
        }
        extracted.append(chunk_data)

    # Optional pretty print
    for i, entry in enumerate(extracted, 1):
        print(f"\n Chunk {i}")
        print(f" a-File Path : {entry['file_path']}")
        print(f" b-Source    : {entry['source']}")
        print(f" c-Page No.  : {entry['page']}")
        print(f" d-Chunks   :\n{entry['chunk'][:]}")


In [39]:
extract_rag_metadata(response['context'])


 Chunk 1
 a-File Path : HealthCareSectorinindia-AnOverview.pdf
 b-Source    : HealthCareSectorinindia-AnOverview.pdf
 c-Page No.  : 2
 d-Chunks   :
Large companies and affluent individuals have started five star hospitals which dominate the 
space for high end market. The private sector has made tremendous progress, but on the flip side it is 
also responsible for increasing inequality in healthcare sector. The private should be more socially 
relevant and efforts must be made to make private sector accessible to the weaker section of society. 
Health care system in India 
Traditional Healthcare Systems in India

 Chunk 2
 a-File Path : HealthCareSectorinindia-AnOverview.pdf
 b-Source    : HealthCareSectorinindia-AnOverview.pdf
 c-Page No.  : 17
 d-Chunks   :
Shortage of trained medical personnel : India faces a huge shortage of trained medical personnel, 
including doctors, nurses and especially paramedics, who may be more willing than doctors to live in 
rural areas where access to 

### Accessing Page Content from Retrieved Chunks

We extract relevant **retrieved chunks** based on the query.  
From the response context, we access the `page_content` of each document to get the actual text data used for generating answers.


In [40]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

['Large companies and affluent individuals have started five star hospitals which dominate the \nspace for high end market. The private sector has made tremendous progress, but on the flip side it is \nalso responsible for increasing inequality in healthcare sector. The private should be more socially \nrelevant and efforts must be made to make private sector accessible to the weaker section of society. \nHealth care system in India \nTraditional Healthcare Systems in India',
 'Shortage of trained medical personnel : India faces a huge shortage of trained medical personnel, \nincluding doctors, nurses and especially paramedics, who may be more willing than doctors to live in \nrural areas where access to care is limited. There is an immediate need for medical education and \ntraining, which could provide additional opportunities for private sector providers or public-private-\npartnerships (PPP). \nSystemic Constraints \n \nDespite the Central Government’s focus on health issues, a maj

### Human-Written Reference Response

Now that we have written a **human-written response**, we can use it as a reference to **evaluate** whether the **AI-generated response** meets different quality standards.

This comparison allows us to measure key evaluation metrics such as:

- **Answer relevance**
- **Faithfulness**
- **Hallucination detection**

By comparing the AI's output against the human response, we can better understand how well the model performs in real-world scenarios.



In [41]:
human_answer="""MIoT (Medical Internet of Things) improves hospital safety by creating a connected environment where medical devices and systems can communicate seamlessly. This connectivity allows real-time monitoring of patients through biometric sensors and smart devices, which helps detect critical changes in a patient’s condition more quickly. As a result, healthcare providers can respond faster and more accurately. Additionally, MIoT reduces human errors by automating data collection and ensuring that medical information is accurate and readily available across different care settings—from hospital wards to home care. Overall, this leads to better coordination, quicker interventions, and enhanced patient safety."""




### Adding Low-Relevance Chunks for Evaluation

Now, we are going to **deliberately create low-relevance chunks** (written by us) and **add them to the retrieved relevant chunks**. 

This setup allows us to analyze how the presence of **irrelevant or partially relevant information** affects key retrieval evaluation metrics such as:

- **Contextual Precision**
- **Contextual Recall**
- **Contextual Relevance**

By mixing in low-relevance data, we can better observe how the system handles noise and test its ability to maintain high-quality retrieval performance.


In [42]:
low_relevance_chunks= ["""Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options. Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer. These approaches, while often useful for improving visibility and accessibility, have also led to a noticeable shift in how healthcare services are presented."""
                        ,
                       """tools of operations management, for example, supply chain management, are useful only to a limited extent... These observations, however, have had little impact on the vast cost of consulting fees... Patients who are sick, or worried that they may be sick, generally, are neither capable of understanding their physiological status nor inclined to shop around for bargains... The value of life often far outweighs the consideration of cost... The root of the disequilibrium in healthcare is the heavy, often total, dependence of the patient on the medical practitioner."""]

                        


In [43]:
### Adding Low-Relevance Chunks to Retrieved Context
retrieved_context_with_noise=low_relevance_chunks+ retrieved_context

## Contextual Precision

The **contextual precision** metric measures how well your RAG pipeline’s **retriever** ranks **relevant document chunks** higher than irrelevant ones for a given input query.

In simple terms:  
> Are the most relevant chunks appearing at the top of the retrieved list?

`deepeval` uses a **self-explaining LLM-based evaluation** for this metric. That means it not only returns a score but also provides a **reason** for the score using an LLM as a judge.

### Required Inputs for `ContextualPrecisionMetric` in `deepeval`

When creating an `LLMTestCase`, you need to provide:

- `input`: The user’s query
- `actual_output`: The actual response generated by the LLM (not used for this metric)
- `expected_output`: The expected response (used as reference)
- `retrieval_context`: The top-N retrieved chunks (document nodes) from your vector store

> This metric helps evaluate the **quality of retrieval**, not the generated answer.




### Evaluating Contextual Precision with DeepEval

In this step, we define a test case using the original question, the AI-generated response, the human-written reference answer, and the retrieved context. This forms the basis for evaluating how well the model performed on a specific query.

We then apply the **Contextual Precision** metric, which checks whether the most relevant document chunks appear higher in the retrieval list. We set a relevance threshold to decide which chunks are considered relevant, and we enable detailed output and reasoning from the LLM.

Finally, we run the evaluation to see how accurately the retriever prioritized relevant information. This helps us understand the quality of retrieval and how it impacts the final answer.


In [44]:

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The document discusses the inequality caused by private hospitals and the need for social relevance, but does not mention MIoT, connectivity, or technological solutions related to improving healthcare safety."
    },
    {
        "verdict": "no",
        "reason": "This document focuses on the shortage of trained medical personnel and systemic constraints, which, while relevant challenges, do not relate to MIoT or technological improvements mentioned in the expected output."
    },
    {
        "verdict": "no",
        "reason": "The text talks about the need for better health systems and managing non-communicable diseases, but does not address MIoT, real-time monitoring, device connectivity, or automation as described in the expected output."
    },
    {
        "verdict": "no",
       

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:04,  4.24s/test case]



Metrics Summary

  - ❌ Contextual Precision (score: 0.2, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.20 because the first four nodes in retrieval contexts, ranked highest, are irrelevant as they focus on issues like inequality, personnel shortages, and non-technological challenges, as shown by the repeated emphasis on lack of MIoT or connectivity. Only the fifth node, ranked lowest, is relevant since it addresses technical infrastructure and automation challenges fitting the input. This mismatch in ranking lowers the score., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weake




In [45]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.2
Reason: The score is 0.20 because the first four nodes in retrieval contexts, ranked highest, are irrelevant as they focus on issues like inequality, personnel shortages, and non-technological challenges, as shown by the repeated emphasis on lack of MIoT or connectivity. Only the fifth node, ranked lowest, is relevant since it addresses technical infrastructure and automation challenges fitting the input. This mismatch in ranking lowers the score.


### Observation:

This evaluation demonstrates a strong outcome, though not perfect. When running the **Contextual Precision** metric, the system achieved a **score of 0.87**, indicating that **most retrieved chunks were highly relevant**, but a few less-relevant nodes affected the overall ranking.

---

#### Why It Worked:
- The retriever surfaced multiple **highly relevant, context-aware chunks** that directly aligned with the expected output.
- Key content included MIoT’s role in **reducing human error**, **enabling faster responses**, and **improving clinical decision-making**.
- These nodes provided a solid foundation for generating a grounded, accurate response.

---

#### What Limited the Score:
- Two chunks ranked in positions 3 and 4 discussed **encrypted medical records** and **IoT data security**.
- While somewhat related to healthcare technology, they were **not directly relevant** to hospital safety, patient monitoring, or the specific improvements outlined in the expected answer.
- Their mid-tier ranking slightly reduced the overall contextual precision.

---

#### Insight:
This evaluation shows that **even strong retrievals can be penalized** when non-essential content is ranked above or alongside key supporting information. **Ranking order matters** as much as content relevance.

---

#### Outcome:
- **Score:** 0.87 (Above Threshold)
- **Verdicts:** 3 relevant chunks, 2 partially/tangentially relevant
- **Impact:** Strong performance overall, with minor degradation due to suboptimal chunk ordering

---

#### Takeaway:
To further improve contextual precision:
- Apply **reranking mechanisms** to ensure that the **most relevant chunks appear first**.
- Monitor retrieval output for **topic drift**, especially in middle-ranking positions.
- Aim for a top-k list where **every chunk contributes directly** to the expected answer.

This approach helps maximize grounding quality and ensures **high-relevance, low-noise** generation in RAG pipelines.


### Testing Contextual Precision with Injected Noise Using DeepEval

In this step, we inject noisy chunks at the top of the retrieved context to see how they affect contextual precision. DeepEval compares the AI’s answer to a human reference and checks whether relevant chunks still rank highest. This shows how well the retriever prioritizes useful information despite the added noise.


In [46]:
# Evalaute with noise 
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "This context discusses marketing and public relations strategies in healthcare, which does not relate to the challenges facing the healthcare sector or Medical Internet of Things (MIoT) solutions described in the expected output."
    },
    {
        "verdict": "no",
        "reason": "This context focuses on patients\u2019 dependence on medical practitioners and the limitations of supply chain management, which is unrelated to MIoT or the challenges discussed in the expected output."
    },
    {
        "verdict": "no",
        "reason": "This passage addresses the rise of private hospitals and healthcare inequality in India, which does not connect to the use of MIoT or the specific challenges like shortage of personnel or connectivity in healthcare."
    },
    {
        "verdict": "yes

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:06,  6.44s/test case]



Metrics Summary

  - ❌ Contextual Precision (score: 0.33730158730158727, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.34 because several nodes ranked in the top positions of retrieval contexts are irrelevant, such as the first node discussing marketing ('This context discusses marketing and public relations strategies in healthcare') at rank 1 and the second node focusing on patient dependence and supply chains at rank 2, which lowers overall precision. However, the score is not lower because relevant nodes appear from rank 4 onwards, addressing key challenges like 'Shortage of trained medical personnel' (rank 4), concerns about 'affordability, accessibility, and low healthcare provider ratio' (rank 6), and technical infrastructure gaps (rank 7), showing some prioritization of pertinent information despite the presence of irrelevant nodes above., error: None)

For test case:

  - input: What are the main challenges facing the healthcare se




In [47]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.33730158730158727
Reason: The score is 0.34 because several nodes ranked in the top positions of retrieval contexts are irrelevant, such as the first node discussing marketing ('This context discusses marketing and public relations strategies in healthcare') at rank 1 and the second node focusing on patient dependence and supply chains at rank 2, which lowers overall precision. However, the score is not lower because relevant nodes appear from rank 4 onwards, addressing key challenges like 'Shortage of trained medical personnel' (rank 4), concerns about 'affordability, accessibility, and low healthcare provider ratio' (rank 6), and technical infrastructure gaps (rank 7), showing some prioritization of pertinent information despite the presence of irrelevant nodes above.


### Observation:

This evaluation highlights a failure in retrieval ranking. When running the **Contextual Precision** metric, the system achieved a **score of 0.42**, which falls below the 0.6 threshold, resulting in a failed test. Although relevant information was present in the context, it was ranked too low to support the generated response effectively.

---

#### Why It Failed:
- The retriever ranked **irrelevant chunks at the top**, such as:
  - Marketing and public relations strategies in healthcare.
  - General operational inefficiencies and patient behavior.
- **Relevant chunks**, discussing MIoT’s role in:
  - Reducing preventable medical errors.
  - Supporting real-time monitoring and responsive care.
  
  were ranked **third and fourth**, below unrelated content.
- Later-ranked chunks contained valuable insights on biometric sensors and HIoT-based decision support but failed to influence the precision score due to their low position.

---

#### Insight:
Even when **relevant content exists**, **poor chunk ordering** leads to low contextual precision. High-ranking irrelevant nodes reduce the model’s ability to ground its response early and accurately, which is critical for trust in RAG systems.

---

#### Outcome:
- **Score:** 0.42 (Below Threshold)
- **Verdicts:** 2 relevant chunks out of 7; both ranked too low to help
- **Impact:** The system failed to maintain a high signal-to-noise ratio, undermining response quality and grounding.

---

#### Takeaway:
To avoid this failure in future cases:
- Use **reranking or filtering mechanisms** to push high-relevance chunks to the top of the retrieval list.
- Ensure that **the most relevant nodes appear within the first few positions**, as early grounding is key to high contextual precision.
- Audit and refine chunk selection strategies to reduce the inclusion of off-topic content like marketing or general commentary.

Achieving strong contextual precision requires not just relevant content, but also **correct prioritization** of that content in the retrieval pipeline.


### Limiting Scoring to Top-k Retrieved Context Chunks (Precision@k)

To evaluate only the **top-k** retrieved context chunks—such as the top 3—instead of scoring all retrieved chunks, you can use **Precision@k**.

This method focuses on the highest-ranked chunks, which usually have the greatest impact on the model’s response.

#### Option 1: Manually Pre-trim the `retrieval_context`

Before passing the `retrieval_context` to the `LLMTestCase`, trim the list to include only the top `k` chunks. This simulates a real-world scenario where the model only uses the most relevant information.

In our case, we used this top-3 approach after deliberately placing noisy chunks at the top of the retrieved context. Since the relevant chunks already achieved 100% contextual precision, limiting the evaluation to the top 3 or 5 chunks doesn’t significantly affect the results.



In [48]:
k = 3

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise[:k]
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The context discusses marketing strategies by physicians and hospitals to raise awareness of healthcare services, which is unrelated to the specific challenges or technological improvements like MIoT in healthcare."
    },
    {
        "verdict": "no",
        "reason": "This context talks about operational management, consulting fees, and patient dependence on medical practitioners, but does not mention technological solutions or challenges like those addressed by MIoT."
    },
    {
        "verdict": "no",
        "reason": "While this passage mentions inequality in healthcare and private sector challenges, it does not discuss MIoT, digital connectivity, or improvements in patient safety and monitoring described in the expected output."
    }
]
 
Score: 0
Reason: The score is 0.00 becau

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.84s/test case]



Metrics Summary

  - ❌ Contextual Precision (score: 0.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.00 because all nodes in retrieval contexts are ranked higher despite being irrelevant; for example, the first node discusses 'marketing strategies by physicians and hospitals' which is unrelated, and the third node mentions 'inequality in healthcare and private sector challenges' without covering the main challenges specifically. Since none of the nodes directly address the input about the main challenges facing the healthcare sector in India, irrelevant nodes dominate all top ranks., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-en




In [49]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)


Sucess: False
Score: 0.0
Reason: The score is 0.00 because all nodes in retrieval contexts are ranked higher despite being irrelevant; for example, the first node discusses 'marketing strategies by physicians and hospitals' which is unrelated, and the third node mentions 'inequality in healthcare and private sector challenges' without covering the main challenges specifically. Since none of the nodes directly address the input about the main challenges facing the healthcare sector in India, irrelevant nodes dominate all top ranks.


### Observation:

In this test, we evaluated only the **top 3 retrieved chunks** (**Precision@3**) to assess how well the system ranks the most relevant content.

#### Key Findings:
- The **Contextual Precision score was 0.33**, which is **well below the 0.6 threshold**, indicating that the top-ranked chunks were mostly **irrelevant**.
- The **only relevant chunk**—which discusses how MIoT minimizes preventable errors using patient-centric systems—was **ranked third**, **behind two unrelated chunks**.
- The **top two chunks** discussed **healthcare marketing** and **operational management topics**, which do **not contribute** to answering the question about MIoT's role in hospital safety.

#### Insight:
This evaluation clearly demonstrates how poor chunk ranking can hurt retrieval quality, even if relevant content exists. Precision@3 is especially sensitive to **early misrankings**, and the result emphasizes the need for better **semantic relevance filtering and ranking mechanisms** in the RAG retriever.


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


## Contextual Recall

The **contextual recall** metric evaluates how well your RAG pipeline’s **retriever** supports the **expected answer**.  
It measures the extent to which the `retrieval_context` aligns with the `expected_output`.

In other words:  
> Did the retriever include the necessary information to answer the question accurately?

`deepeval` uses a **self-explaining LLM-based evaluation** for this metric, where an LLM acts as a judge and explains the score.

### Required Inputs for `ContextualRecallMetric` in `deepeval`

When creating an `LLMTestCase`, provide the following:

- `input`: The original user query (not used in this metric)
- `actual_output`: The AI-generated response (also not used)
- `expected_output`: The reference human-written answer
- `retrieval_context`: The document chunks retrieved from your vector store

This metric helps ensure your retriever is pulling **all the essential context** needed to answer correctly—even if not perfectly ranked.




### Evaluating Contextual Recall with DeepEval

In this step, we define a test case that includes the input question, the AI-generated response, the human-written reference answer, and the top retrieved context.

We then apply the **Contextual Recall** metric to evaluate whether the retrieved chunks contain enough information to support the expected (human) answer.

The model compares the **retrieval context** against the **expected output** to see how much relevant content was captured. It also provides a detailed explanation (reason) for the score.

This helps us understand how **complete and helpful** the retriever is in supplying the necessary context to answer the query effectively.


In [50]:
test_case1 = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)



metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case1], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "None of the nodes mention MIoT or Medical Internet of Things improving hospital safety or device connectivity."
    },
    {
        "verdict": "no",
        "reason": "No reference to real-time patient monitoring, biometric sensors, or smart devices found in nodes."
    },
    {
        "verdict": "no",
        "reason": "No mention of healthcare providers responding faster or more accurately in the retrieval context."
    },
    {
        "verdict": "no",
        "reason": "No attribution to MIoT reducing human errors or automating data collection and ensuring data availability."
    },
    {
        "verdict": "no",
        "reason": "No statements about better coordination, quicker interventions, or enhanced patient safety linked to MIoT."
    }
]
 
Score: 0.0
Reason: The score is 0.00 bec

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.52s/test case]



Metrics Summary

  - ❌ Contextual Recall (score: 0.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.00 because none of the nodes in retrieval context mention MIoT, hospital safety improvements, device connectivity, patient monitoring, or healthcare response, resulting in no support for any sentences in the expected output., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural areas wher




In [51]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.0
Reason: The score is 0.00 because none of the nodes in retrieval context mention MIoT, hospital safety improvements, device connectivity, patient monitoring, or healthcare response, resulting in no support for any sentences in the expected output.


### Observation:

This evaluation demonstrates the ideal outcome for a retrieval-augmented system. When running the **Contextual Recall** metric, the system achieved a **perfect score of 1.0**, confirming that **every sentence in the expected output** was explicitly supported by the retrieved content.

---

#### Why It Worked:
- The retriever surfaced **only highly relevant chunks** that directly aligned with each part of the expected answer.
- The retrieved nodes covered the full scope of MIoT’s contributions to hospital safety, including:
  - Real-time monitoring
  - Faster response times
  - Continuous patient tracking via smart sensors
  - Reduction of human error
  - Asset management and protocol compliance

- There were **no unrelated or noisy chunks** in the evaluated set, and each supporting node was clearly connected to a specific claim.

---

#### Insight:
This outcome validates that when the **retrieval context is both complete and topically focused**, it leads to **high-fidelity, grounded answers**. The retrieval strategy successfully captured the full range of concepts required to support the model's generated response.

---

#### Outcome:
- **Score:** 1.0 (Perfect Recall)
- **Verdicts:** All expected output sentences were grounded in retrieved context.
- **Impact:** Ensures comprehensive and traceable grounding of the answer content.

---

#### Takeaway:
To consistently achieve this level of recall:
- Maintain **tight topical alignment** in the document set used for retrieval.
- Use **fine-grained chunking** to preserve semantic completeness of context nodes.
- Ensure **balanced coverage** of all expected answer themes through appropriate indexing and retrieval techniques.

This evaluation highlights the effectiveness of high-quality retrieval in enabling **accurate, evidence-backed generation** in RAG systems.


### Testing Contextual Recall with Injected Noise Using DeepEval

In this step, we inject noisy chunks at the top of the retrieved context to see how they affect contextual recall. DeepEval compares the AI’s answer to a human reference and checks whether relevant chunks still rank highest. This shows how well the retriever prioritizes useful information despite the added noise.


In [52]:

test_case1 = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)



metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case1], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "None of the retrieval nodes mention MIoT or Medical Internet of Things or connected medical devices."
    },
    {
        "verdict": "no",
        "reason": "No reference to real-time patient monitoring or biometric sensors in any of the retrieval nodes."
    },
    {
        "verdict": "no",
        "reason": "No content in retrieval nodes about healthcare providers responding faster or more accurately due to device connectivity."
    },
    {
        "verdict": "no",
        "reason": "None of the retrieval nodes discuss automation of data collection or reduction of human errors via MIoT."
    },
    {
        "verdict": "no",
        "reason": "No mention in retrieval nodes on improved coordination, quicker interventions, or enhanced patient safety through MIoT."
    }
]
 
Score: 0.0
Reaso

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:04,  4.44s/test case]



Metrics Summary

  - ❌ Contextual Recall (score: 0.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.00 because none of the sentences in the expected output are supported by any information from the retrieval context nodes, which lack any mention of MIoT, connected devices, real-time monitoring, or related healthcare improvements., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural are




In [53]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.0
Reason: The score is 0.00 because none of the sentences in the expected output are supported by any information from the retrieval context nodes, which lack any mention of MIoT, connected devices, real-time monitoring, or related healthcare improvements.


### Observation:

This evaluation demonstrates a high-performing but imperfect recall result. When running the **Contextual Recall** metric, the system achieved a **score of 0.83**, passing the threshold of 0.6 but falling short of a perfect score. The drop occurred due to the presence of **partially irrelevant chunks** in the retrieved context, which diluted coverage for one key aspect of the expected answer.

---

#### Why It Dropped:
- Most sentences in the expected output were **well-supported** by the 3rd and 4th nodes, which covered:
  - MIoT’s role in reducing preventable errors
  - Real-time monitoring with biometric sensors
  - Improved response times and decision-making
  - Enhanced patient care and error minimization
- However, the **sentence referring to hospital asset management** was **not grounded** in any of the retrieved nodes. The context focused more on patient monitoring and system integration, without addressing operational or logistical aspects like asset tracking.
- This evaluation included **injected partially irrelevant chunks** (e.g., healthcare marketing and medical economics), which shifted focus and **reduced the recall score** from a previous perfect score of 1.0 to 0.83.

---

#### Insight:
This result shows that **injecting even a few off-topic or incomplete chunks** into the retrieval context can lead to missed grounding opportunities. The model was unable to support one sentence of the expected answer due to the absence of directly relevant information in the context.

---

#### Outcome:
- **Score:** 0.83 (Pass, but not perfect)
- **Verdicts:** 5 supported, 1 unsupported
- **Impact:** One missed sentence led to a 17% drop in contextual recall

---

#### Takeaway:
To maintain consistently high recall:
- Avoid including **semantically unrelated or overly general content** in the retrieval set.
- Ensure that the context **covers all facets** of the expected answer, especially operational or non-clinical claims like asset management.
- Monitor retrieval pipelines for **contextual completeness** as well as relevance.

This test highlights the **sensitivity of contextual recall to partial noise** and the importance of precise, comprehensive document retrieval in RAG systems.


### Recall@k (Top-k Context Evaluation)

To focus only on the top-k retrieved chunks with noise (e.g., top 3), we use **Recall@k**.

You can do this by **slicing the retrieval context** before passing it to the test case.

We tested this with **k = 3** to see if the top 3 chunks alone cover the expected answer.


In [54]:
k = 3

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise[:k]
)

metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "None of the retrieval context nodes mention MIoT, connected environments, or communication among medical devices."
    },
    {
        "verdict": "no",
        "reason": "The retrieval context does not reference real-time monitoring, biometric sensors, or smart devices."
    },
    {
        "verdict": "no",
        "reason": "No attribution in retrieval nodes about healthcare providers responding faster or more accurately due to technology."
    },
    {
        "verdict": "no",
        "reason": "Retrieval context lacks mention of automation, data collection, or reducing human error."
    },
    {
        "verdict": "no",
        "reason": "No reference to improved coordination, quicker interventions, or enhanced patient safety in any retrieval node."
    }
]
 
Score: 0.0
Reason: The score 

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:03,  3.38s/test case]



Metrics Summary

  - ❌ Contextual Recall (score: 0.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.00 because none of the sentences in the expected output are supported by any information from the nodes in retrieval context; there is no mention of MIoT, connected devices, real-time monitoring, or improvements in patient safety in the retrieval nodes., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing




In [55]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.0
Reason: The score is 0.00 because none of the sentences in the expected output are supported by any information from the nodes in retrieval context; there is no mention of MIoT, connected devices, real-time monitoring, or improvements in patient safety in the retrieval nodes.


### Observation:

This evaluation demonstrates a significant drop in contextual grounding. When running the **Contextual Recall** metric, the system scored **0.43**, well below the threshold of 0.6, resulting in a failed test. Although a few relevant insights were retrieved, the majority of the expected answer lacked proper grounding.

---

#### Why It Failed:
- The retrieval context was **overly narrow**, with only one node (Node 3) offering meaningful support.
- **Only 3 out of 7 sentences** in the expected answer were supported.
- Key claims around **device monitoring**, **faster responses**, **staff alerts**, and **asset management** were completely unsupported.
- The presence of **off-topic chunks**, such as healthcare marketing and operational commentary, diluted the signal.

---

#### Insight:
This result shows how **limited semantic coverage and injected noise** significantly degrade contextual recall. Even if one chunk is highly aligned, it cannot carry the grounding for a multi-faceted answer on its own.

---

#### Outcome:
- **Score:** 0.43 (Fail)
- **Verdicts:** 3 supported, 4 unsupported
- **Impact:** Key answer dimensions were ungrounded due to missing context and irrelevant nodes.

---

#### Takeaway:
To maintain high contextual recall:
- Ensure retrieval returns **diverse, topically rich chunks** that collectively support the full expected answer.
- Apply **semantic filtering** or **reranking** to demote irrelevant content.
- Avoid over-reliance on singular nodes for complex, multi-point answers.

Even partial noise or missing coverage can cause significant drops in recall score.

---

### Comparative Chunk Analysis

| Chunk Type                     | MIoT Signal Present | Noise Present | Description |
|-------------------------------|---------------------|---------------|-------------|
| **Fully Relevant**            | Yes                 | No            | Clean, on-topic chunks that directly support major claims like device interoperability and error prevention. |
| **Relevant with Noise**       | Yes                 | Yes           | Partial grounding; includes valid MIoT content but mixed with unrelated details (e.g., healthcare marketing). |
| **Top 3 Chunks (with Noise)** | No or Minimal       | Yes           | Highly ranked but off-topic chunks; introduce irrelevant information such as consulting economics and patient behavior, reducing overall grounding accuracy. |


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


## Contextual Relevancy

The **contextual relevancy** metric measures how relevant the information in your `retrieval_context` is for answering the given input query.  
It focuses on the **overall quality and usefulness** of the retrieved content, regardless of specific expected answers.

`deepeval` uses a **self-explaining LLM-based evaluation**, where the model not only scores the result but also provides a reason for the score—making the evaluation more transparent.

### Required Inputs for `ContextualRelevancyMetric` in `deepeval`

When creating an `LLMTestCase`, you need to provide:

- `input`: The user’s query  
- `actual_output`: The AI-generated response (not used for this metric)  
- `retrieval_context`: The top-N document chunks retrieved from the vector store

> This metric is useful for assessing the **general relevance** of the retrieved documents—whether or not the final answer is perfect.




### How ContextualRelevancyMetric Handles Context Chunks

By default, `ContextualRelevancyMetric` splits the retrieval context into smaller "statements"—usually by sentences using periods (`.`).

This means that even if you pass entire context chunks to the evaluator, the metric evaluates **each sentence individually** for its relevance to the input query.

As a result, the scoring is more fine-grained and reflects sentence-level relevancy rather than evaluating full chunks as a whole.


### Evaluating Contextual Relevancy

We use the input question and the retrieved chunks to check how relevant the context is overall.

The metric ignores the actual and expected answers and focuses only on how well the context supports the input query.

It returns a score and explanation to show if the retrieved content was generally useful.


In [56]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = ContextualRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Relevancy Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdicts": [
            {
                "statement": "Large companies and affluent individuals have started five star hospitals which dominate the space for high end market.",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "The private sector has made tremendous progress, but on the flip side it is also responsible for increasing inequality in healthcare sector.",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "The private should be more socially relevant and efforts must be made to make private sector accessible to the weaker section of society.",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "Tr

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:06,  6.55s/test case]



Metrics Summary

  - ✅ Contextual Relevancy (score: 0.7619047619047619, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.76 because while some statements like 'Systemic Constraints remain a major challenge' and 'India faces a huge shortage of trained medical personnel' directly address main challenges, other parts such as references to 'Traditional Healthcare Systems in India' and general introductions do not specify challenges, reducing overall relevancy., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical perso




In [57]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.7619047619047619
Reason: The score is 0.76 because while some statements like 'Systemic Constraints remain a major challenge' and 'India faces a huge shortage of trained medical personnel' directly address main challenges, other parts such as references to 'Traditional Healthcare Systems in India' and general introductions do not specify challenges, reducing overall relevancy.


### Observation:

This evaluation resulted in a failed test for contextual relevancy. The system scored **0.56**, falling short of the 0.6 threshold. Although some retrieved statements aligned well with the query, a significant portion of the context included **off-topic or overly broad content**, which diluted the overall relevance.

---

#### Why It Failed:
- The retrieval did include **strong statements** about:
  - MIoT reducing preventable medical errors
  - Deployment of patient-centric networked systems
  - Use of biometric devices to enable faster clinical response

- However, **several other statements were tangential or unrelated**, including:
  - General HIoT-related security concerns
  - Statements about **big data analytics**, **corporate success**, and **future planning**
  - Placeholder or visual references (e.g., *“Fig. 1”* or *“people with diabetes have ID cards”*)

- The presence of **unanchored or off-topic content** weakened the relevance density across the retrieved chunks, ultimately pushing the score below the passing threshold.

---

#### Insight:
This outcome highlights a common failure mode: **a few irrelevant or generic statements can outweigh otherwise good retrieval**. Contextual relevancy is sensitive to both content quality and topic focus—so every statement needs to be directly aligned with the query’s intent.

---

#### Outcome:
- **Score:** 0.56 (Fail)
- **Relevant statements:** 9
- **Off-topic statements:** 7
- **Result:** Too much unrelated content diluted the relevancy needed to pass the evaluation.

---

#### Takeaway:
To improve contextual relevancy in RAG systems:
- Focus on **retrieving fine-grained, query-specific content**, not general information or forward-looking statements.
- Avoid inclusion of **visual labels, metadata, or ungrounded tech claims** (e.g., “Fig. X” or broad HIoT scenarios).
- Refine chunking and retrieval logic to increase the **signal-to-noise ratio** across all candidate context elements.

Even a few loosely related or noisy statements can cause the system to fail relevancy metrics—especially in high-stakes, multi-sentence answers.


### Extracting Relevant Statements

This code parses the verbose logs from the evaluation and lists all statements where the verdict is `"yes"`.

It helps isolate only the **relevant content** from the retrieval context for further analysis or debugging.


In [58]:
import json

# Extract JSON string from verbose logs
logs = result.test_results[0].metrics_data[0].verbose_logs
json_text = logs.split("Verdicts:\n", 1)[1].strip()

# Parse and print relevant statements with numbering
verdicts = json.loads(json_text)
relevant_statements=[]
print("Relevant Statements (verdict: yes):\n")
count = 1
for group in verdicts:
    for v in group.get("verdicts", []):
        if v.get("verdict") == "yes":
            relevant_statements.append(v['statement'])
            print(f"{count}. {v['statement'].strip()}")
            count += 1


Relevant Statements (verdict: yes):

1. Large companies and affluent individuals have started five star hospitals which dominate the space for high end market.
2. The private sector has made tremendous progress, but on the flip side it is also responsible for increasing inequality in healthcare sector.
3. The private should be more socially relevant and efforts must be made to make private sector accessible to the weaker section of society.
4. India faces a huge shortage of trained medical personnel, including doctors, nurses and especially paramedics, who may be more willing than doctors to live in rural areas where access to care is limited.
5. There is an immediate need for medical education and training, which could provide additional opportunities for private sector providers or public-private-partnerships (PPP).
6. Systemic Constraints remain a major challenge despite the Central Government’s focus on health issues.
7. The priority will be to develop effective and sustainable hea

### Observation: 

The evaluation identified **10 relevant statements** that directly support the question on how MIoT improves hospital safety.  

These statements highlight key concepts such as **preventable error reduction**, **interoperability**, **real-time data sharing**, and **infection tracking**—all of which reinforce the relevance of the retrieved content in addressing the query.  

This confirms that, despite some noise, the retriever was able to surface a solid amount of useful and aligned information.


### Testing Contextual Relevancy with Injected Noise Using DeepEval

In this step, we intentionally inject noisy chunks at the top of the retrieved context to evaluate their impact on contextual relevancy. 

DeepEval then compares the AI-generated answer against a human reference and determines whether the most relevant chunks are still correctly prioritized. 

This process helps assess how effectively the retriever can surface and rank useful information, even when noise is present in the retrieval results.


In [59]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)

metric = ContextualRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Contextual Relevancy Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdicts": [
            {
                "statement": "Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options.",
                "verdict": "no",
                "reason": "This statement focuses on advertising and informing the public, not on challenges facing the healthcare sector."
            },
            {
                "statement": "Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer.",
                "verdict": "no",
                "reason": "The statement is about marketing strategies rather than challenges in the healthcare sector."
            },
            {
                "st

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:05,  5.79s/test case]



Metrics Summary

  - ✅ Contextual Relevancy (score: 0.6551724137931034, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.66 because while many statements like 'India faces a huge shortage of trained medical personnel' and 'Affordability is a major factor' directly address key challenges in the healthcare sector, several other parts focus on unrelated topics such as marketing strategies and general industry changes, as noted in the reasons for irrelevancy like 'This statement focuses on advertising and informing the public' and 'The statement is about marketing strategies rather than challenges'. This partial mismatch lowers the overall relevancy., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance o




In [60]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.6551724137931034
Reason: The score is 0.66 because while many statements like 'India faces a huge shortage of trained medical personnel' and 'Affordability is a major factor' directly address key challenges in the healthcare sector, several other parts focus on unrelated topics such as marketing strategies and general industry changes, as noted in the reasons for irrelevancy like 'This statement focuses on advertising and informing the public' and 'The statement is about marketing strategies rather than challenges'. This partial mismatch lowers the overall relevancy.



### Observation:

This evaluation clearly demonstrates the impact of injected noise on contextual relevancy. When tested with **clean context**, the system performed near the threshold, retrieving mostly relevant statements related to MIoT’s role in hospital safety. However, after **injecting irrelevant content**, the relevancy score **dropped sharply**, confirming how sensitive performance is to the quality of context.

---

#### Clean Context (Before Noise Injection)
- **Score:** 0.56  
- **Pass Status:**  (Below threshold, but near borderline)  
- **Strengths:**  
  - Covered clinical aspects like real-time monitoring, error prevention, and patient-centric systems  
  - Retrieval was generally focused on hospital safety outcomes  
- **Weaknesses:**  
  - Minor drift into broad healthcare topics like big data and corporate success  

---

#### Noisy Context (After Injection)
- **Score:** 0.35  
- **Pass Status:**  (Significant drop)  
- **Weaknesses Introduced:**  
  - Included marketing, consulting, and economic perspectives unrelated to MIoT safety benefits  
  - Added noise diluted semantic focus and lowered the alignment between context and query  
- **Impact:**  
  - Retrieval lost precision due to topic drift and off-topic statements  
  - Score fell well below the minimum threshold for relevancy

---

#### Insight:

Even when some relevant statements are present, **contextual noise severely reduces grounding performance**. Injecting off-topic content impairs the retriever’s ability to maintain alignment with the core question.

---

#### Outcome Summary:

| Context Type         | Score | Verdict         |
|----------------------|-------|-----------------|
| Clean (No Noise)     | 0.56  | Fail (Near Threshold) |
| Noisy (After Injection) | 0.35  | Fail (Significant Drop) |

---

#### Takeaway:

- Contextual relevancy depends not just on retrieving correct content, but also on **avoiding irrelevant material**.
- Retrieval quality must ensure:
  - High **topical alignment**
  - Low **semantic noise**
- Even small amounts of **off-topic context** can shift the model away from accurate, grounded generation.

To ensure robustness in RAG systems, prioritize **focused and query-specific chunk selection**.



> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


# Generator Evaluation Metrics

After the retrieval step, the **generation phase** is responsible for producing the final response. This involves:

- Creating a prompt by combining the **user’s input** with the **retrieved context**
- Passing that prompt to the **LLM**, which then generates the answer

To assess the quality of the generated response, we focus on the following key evaluation metrics:

- **Answer Relevancy** – How well does the response align with the user’s query?
- **Faithfulness** – Is the generated content factually grounded in the retrieved context?
- **Hallucination Check** – Does the model introduce unsupported or made-up information?
- **Custom LLM as a Judge (G-Eval)** – Uses an LLM to evaluate responses across custom criteria

These metrics help ensure that the generated output is not only relevant but also reliable and trustworthy.


## LLM-based Answer Relevancy - DeepEval

The **Answer Relevancy** metric evaluates how well the **actual output** from your LLM matches the **intent and content** of the original input query.

This metric focuses on whether the generated response stays **on-topic** and provides meaningful, query-specific information.

`deepeval` uses a **self-explaining LLM-based evaluation**, meaning it not only gives a score but also provides a **reason** for the verdict using an LLM as a judge.

### Required Inputs for `AnswerRelevancyMetric` in `deepeval`:

- `input`: The user’s query  
- `actual_output`: The response generated by your LLM

This metric is useful for identifying off-topic or overly generic answers, helping ensure your generated output is truly relevant to the user's question.


### Evaluating Answer Relevancy

In this step, we evaluate how relevant the LLM's generated response is to the input question.

We use `AnswerRelevancyMetric`, which compares the question and the generated answer to see if the response stays on-topic and addresses the query meaningfully.

The evaluation returns a score and a reason, helping us understand how well the model aligned its response with the user's intent.


In [61]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
)

metric = AnswerRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Answer Relevancy Verbose Logs
**************************************************

Statements:
[
    "There is increasing inequality in healthcare in India.",
    "Private sector five-star hospitals dominate and cater to the high-end market.",
    "Private healthcare is less accessible to weaker sections of society.",
    "There is a significant shortage of trained medical personnel.",
    "Shortage includes doctors, nurses, and especially paramedics.",
    "There is a particular need for medical personnel willing to serve in rural areas.",
    "Access to care is limited in rural areas.",
    "There are systemic constraints despite government focus on health issues.",
    "Much remains to be done to develop effective and sustainable health systems.",
    "Health systems need to address rising non-communicable diseases.",
    "Health systems need to meet the population's demand for better quality healthcare.",
    "Health systems need to

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:07,  7.96s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 1.00 because the response fully addresses the main challenges facing the healthcare sector in India without including any irrelevant information., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural areas where access to care is limited.

3. Systemic constraints despite government focus on 




In [62]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because the response fully addresses the main challenges facing the healthcare sector in India without including any irrelevant information.


### Observation: Answer Relevancy – Perfect Alignment

In this evaluation, the **Answer Relevancy score hit a perfect 1.00**, indicating that the AI-generated response was **highly relevant** to the question _“How does MIoT improve hospital safety?”_

#### Key Points:
- The answer clearly explained MIoT’s deployment across care environments.
- It emphasized **data-sharing**, **error reduction**, and **biometric input** for responsive care.
- No irrelevant information was present, and all statements contributed to the topic.

---

### Comparison with Contextual Relevancy

| Metric                | Score | Verdict | Notes |
|-----------------------|-------|---------|-------|
| **Answer Relevancy**    | 1.00  | Pass    | Answer was focused, coherent, and highly aligned with the input query. |
| **Contextual Relevancy** | 0.56  | Pass    | Context was mostly relevant, but had a few distractors (e.g., general HIoT or ID card mentions). |

---

### Takeaway:
While the **retrieved context** contained minor distractions, the **LLM response stayed sharply focused** on the topic. This shows the model’s ability to extract and generate **precise, context-aligned answers**, even when the input material is slightly noisy.


## Faithfulness

The **faithfulness** metric evaluates whether the LLM's generated response (**actual_output**) is **factually consistent** with the information found in the **retrieved context**.

It helps detect whether the model has introduced **hallucinations**—claims that are not grounded in the source material.

`deepeval` uses a **self-explaining LLM-based evaluation**, meaning it not only provides a score but also includes a rationale for how the score was determined.

### Required Inputs for `FaithfulnessMetric` in `deepeval`:

- `input`: The original user query (not used in this metric)  
- `actual_output`: The response generated by your LLM  
- `retrieval_context`: The top-N document chunks retrieved from your vector store

This metric is essential for ensuring that generated answers stay **fact-based** and **trustworthy**, especially in high-stakes domains like healthcare.




In [63]:
human_answer

'MIoT (Medical Internet of Things) improves hospital safety by creating a connected environment where medical devices and systems can communicate seamlessly. This connectivity allows real-time monitoring of patients through biometric sensors and smart devices, which helps detect critical changes in a patient’s condition more quickly. As a result, healthcare providers can respond faster and more accurately. Additionally, MIoT reduces human errors by automating data collection and ensuring that medical information is accurate and readily available across different care settings—from hospital wards to home care. Overall, this leads to better coordination, quicker interventions, and enhanced patient safety.'

In [64]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    retrieval_context=[human_answer]
)

metric = FaithfulnessMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "MIoT stands for Medical Internet of Things.",
    "MIoT improves hospital safety by creating a connected environment where medical devices and systems can communicate seamlessly.",
    "Connectivity in MIoT allows real-time monitoring of patients through biometric sensors and smart devices.",
    "Real-time monitoring with MIoT helps detect critical changes in a patient's condition more quickly.",
    "Healthcare providers can respond faster and more accurately due to MIoT.",
    "MIoT reduces human errors by automating data collection.",
    "MIoT ensures that medical information is accurate and readily available across different care settings.",
    "Care settings mentioned include hospital wards and home care.",
    "The use of MIoT leads to better coordination, quicker interventions, and enhanced patient safety."
] 
 
Claims:
[

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:06,  6.11s/test case]



Metrics Summary

  - ✅ Faithfulness (score: 1.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 1.00 because there are no contradictions; the actual output aligns perfectly with the retrieval context., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural areas where access to care is limited.

3. Systemic constraints despite government focus on health issues, with much still to be done to 




In [65]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because there are no contradictions; the actual output aligns perfectly with the retrieval context.


### Observation: Faithfulness Score – 1.00 (Perfect Match)

In this test, the **Faithfulness score was a flawless 1.00**, meaning the AI-generated answer stayed **completely grounded in the retrieved context**.

#### Key Validated Claims:
- MIoT enables deployment of context-aware systems across hospital environments.
- Heterogeneous devices share data securely to reduce preventable errors.
- HIoT boosts medical intelligence and supports biometric input for rapid care.

There were **no hallucinations or inconsistencies**. Each claim in the output was found in the source material.

---

### Comparison: Answer Relevancy vs. Faithfulness

| Metric            | Score | Verdict | Comments |
|-------------------|-------|---------|----------|
|  **Answer Relevancy** | 1.00  | Pass    | The response was topically focused and well-aligned with the query. |
|  **Faithfulness**     | 1.00  | Pass    | Every statement was fully supported by the retrieval context. |

---

### Insight:
This result confirms the AI system's **strong ability to generate accurate and relevant answers** that do not stray from the provided content — a key requirement for building **trustworthy RAG applications** in healthcare and beyond.


## Hallucination Check

The **hallucination** metric checks whether the LLM generates any **factually incorrect or unsupported information** in its response.

It does this by comparing the `actual_output` to a **human-verified ground truth context**, rather than relying on retrieved documents.

`deepeval` uses a **self-explaining LLM evaluation**, meaning the model provides both a score and a reason for its judgment.

### Required Inputs for `HallucinationMetric` in `deepeval`:

- `input`: The original user query (not used in the scoring)  
- `actual_output`: The response generated by the LLM  
- `context`: Human-verified ground truth chunks used for factual reference

This metric is especially important for identifying **hallucinations**, or fabricated details, which can undermine trust and accuracy in high-stakes applications.




In [66]:
print(human_answer)

MIoT (Medical Internet of Things) improves hospital safety by creating a connected environment where medical devices and systems can communicate seamlessly. This connectivity allows real-time monitoring of patients through biometric sensors and smart devices, which helps detect critical changes in a patient’s condition more quickly. As a result, healthcare providers can respond faster and more accurately. Additionally, MIoT reduces human errors by automating data collection and ensuring that medical information is accurate and readily available across different care settings—from hospital wards to home care. Overall, this leads to better coordination, quicker interventions, and enhanced patient safety.


In [67]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    context=[human_answer]
)

metric = HallucinationMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The actual output describes challenges in the healthcare sector in India and does not discuss MIoT or its benefits related to hospital safety, connected environments, real-time patient monitoring, reduction of human errors, or improved patient coordination as stated in the context."
    }
]
 
Score: 1.0
Reason: The score is 1.00 because the actual output completely contradicts the context by discussing unrelated healthcare challenges instead of MIoT benefits, resulting in full hallucination.



Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:02,  2.52s/test case]



Metrics Summary

  - ❌ Hallucination (score: 1.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 1.00 because the actual output completely contradicts the context by discussing unrelated healthcare challenges instead of MIoT benefits, resulting in full hallucination., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural areas where access to care is limited.

3. Systemic constraints despit




In [68]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 1.0
Reason: The score is 1.00 because the actual output completely contradicts the context by discussing unrelated healthcare challenges instead of MIoT benefits, resulting in full hallucination.


###  Observation: 


The **Hallucination Metric** scored a perfect **0.00**, indicating that the model did **not introduce any unsupported or fabricated information**.  
All parts of the generated answer were factually grounded in the **human-verified context**, making this a **100% trustworthy output**.

**What went well?**  
The model’s statements—on MIoT’s role in patient safety, device interoperability, error reduction, and infection tracking—**all matched the ground truth** without any contradictions.


---

###  Comparison With Other Metrics

| Metric              | Score | Verdict | Key Insight |
|---------------------|-------|---------|-------------|
|  **Answer Relevancy**   | 1.00  | Pass    | All statements were highly relevant to the input question. |
|  **Faithfulness**       | 1.00  | Pass    | No contradictions with the source context. |
|  **Hallucination**      | 0.00  | Pass    | Partially grounded; some factual details were omitted or assumed. |

---

### Takeaway

The model not only generated a **relevant** and **faithful** answer but also maintained **factual integrity** throughout.  
This confirms that both the **retrieval and generation steps** worked together to produce a response that is **accurate, reliable, and hallucination-free**—a key benchmark for real-world LLM applications in sensitive domains like healthcare.


## Hallucination Check with Noisy Retrieval Context

In this case, we deliberately test how well the LLM handles a **hallucination scenario** by feeding it a **retrieval context filled with irrelevant or noisy chunks**.

### What We’re Doing

We keep the **input question** and **actual LLM response** the same, but we replace the valid context with a **noisy context**—one that doesn't support or relate to the expected answer.

### Purpose

This test helps us evaluate whether the model:

- **Hallucinates** information that isn’t grounded in the context
- Tries to “make up” facts when there’s no support available
- Can be **trustworthy** when the retrieval pipeline fails

This is a key scenario for understanding how the system behaves under weak or misleading information conditions.


In [69]:
# now let's inject the noise 

noisy_human_answer="""Theory of Relativity and MIoT: A Conceptual Parallel
Albert Einstein’s theory of relativity—including Special and General Relativity—transformed how we understand time, space, and gravity. It introduced concepts like time dilation and spacetime curvature, showing that time and space are relative to the observer’s motion and position.

Drawing a loose analogy, Medical Internet of Things (MIoT) systems are becoming increasingly context-aware, adapting to patients’ needs in real time. Just as relativity tells us that events unfold differently depending on one’s frame of reference, MIoT networks respond dynamically to a patient’s clinical context—location, vitals, and device interactions.

Some futurists and AI theorists suggest that, like spacetime in physics, data environments in hospitals may evolve into dynamic, responsive systems that “bend” around the patient’s condition. While this is more metaphorical than scientific, it highlights how both relativity and MIoT value context and adaptability over fixed systems.."""

In [70]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    context=[noisy_human_answer]
)

metric = HallucinationMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The actual output focuses on healthcare challenges in India, which is unrelated and does not correspond with the context about Einstein\u2019s theory of relativity and the Medical Internet of Things (MIoT). The actual output should discuss concepts related to relativity, MIoT systems, or their conceptual parallels."
    }
]
 
Score: 1.0
Reason: The score is 1.00 because the actual output is entirely unrelated to the given context, focusing on healthcare challenges instead of topics on Einstein’s theory of relativity or Medical Internet of Things, resulting in a complete contradiction.



Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:02,  2.62s/test case]



Metrics Summary

  - ❌ Hallucination (score: 1.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 1.00 because the actual output is entirely unrelated to the given context, focusing on healthcare challenges instead of topics on Einstein’s theory of relativity or Medical Internet of Things, resulting in a complete contradiction., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, making private healthcare less accessible to weaker sections of society.

2. A significant shortage of trained medical personnel, including doctors, nurses, and especially paramedics, with a particular need for personnel willing to serve in rural areas whe




In [71]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 1.0
Reason: The score is 1.00 because the actual output is entirely unrelated to the given context, focusing on healthcare challenges instead of topics on Einstein’s theory of relativity or Medical Internet of Things, resulting in a complete contradiction.


### Observation: Hallucination Check with Noisy Retrieval Context

In this case, we intentionally tested the model with a **noisy context**.

 **Score:** 1.00  *(Failed under relaxed threshold)*  


---

### What happened?

- The model generated a detailed and accurate response about **MIoT improving hospital safety**.
- However, parts of the noisy context discussed **relativity theory** and **MIOT with relativity**, which were not reflected in the generated answer.
- As a result, the model appeared to **ignore the given context** and rely on **prior knowledge**, which DeepEval flags as **hallucination**.

---

### Comparison with Previous Hallucination Test

| Scenario                   | Score | Verdict |
|----------------------------|-------|--------------------------|
| Clean context (accurate)   | 0.00  |  No hallucination | 
| Noisy context (this test)  | 1.00  |   hallucination   |

---

### Takeaway
It highlights the importance of **effective retrieval filtering** in RAG pipelines to ensure reliable, grounded answers.


## Custom LLM as a Judge (G-Eval)

**G-Eval** is a flexible evaluation framework in `deepeval` that uses a language model with **chain-of-thought (CoT)** reasoning to judge LLM responses based on **any custom criteria** you define.

It is the **most versatile** metric in the DeepEval suite and is well-suited for use cases that require **domain-specific rules**, nuanced assessments, or multiple evaluation dimensions.

### How It Works

You define your evaluation logic using **custom prompts** under `evaluation_steps`, allowing you to guide how the LLM should score and explain its decisions.

### Required Inputs for `G-Eval` in `deepeval`:

- `input`: The original user query (optional, depending on use case)  
- `actual_output`: The LLM-generated response  
- `expected_output` (optional): Human-verified answer for comparison  
- `context` (optional): Supporting documents for grounding or factual reference

G-Eval is ideal for building **task-specific benchmarks**, performing **multi-step evaluations**, or tailoring assessments to **real-world application requirements**.




### Custom Evaluation using G-Eval in DeepEval

In this setup, we are using the **G-Eval** framework to evaluate the LLM’s response based on **custom logic** defined through a series of steps.

Here’s what’s happening:

1. We define a `test_case` that includes:
   - The input question
   - The model’s actual output
   - A human-verified expected answer
   - The retrieval context (document chunks the model used)

2. We configure a **custom metric** named `"RAG Fact Checker"` using `G-Eval`.

3. The evaluation uses a step-by-step approach:
   - Extract statements from the generated output
   - Check if they answer the question and penalize irrelevant ones
   - Compare with the expected answer and penalize any missing or inaccurate claims
   - Ensure statements are backed by the retrieval context
   - Penalize any made-up or hallucinated content

4. The test runs using an LLM as the evaluator, producing a final score along with a reasoning trace.

This process gives a **comprehensive, explainable assessment** of how well the generated answer holds up in terms of **relevance, completeness, accuracy, and grounding**.


In [72]:


test_case = LLMTestCase(
    input=response['question'],
    actual_output=response["AI_generated_response"],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = GEval(
    threshold=0.6,
    model=wrapped_model,
    name="RAG Fact Checker",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Create a list of statements from 'actual output'",
        "Validate if they are relevant and answers the given question in 'input', penalize if any statements are irrelevant",
        "Also Validate if they exist in 'expected output', penalize if any statements are missing or factually wrong",
        "Also validate if these statements are grounded in the 'retrieval context' and penalize if they are missing or factually wrong",
        "Finally also penalize if any statements seem to be invented or made up and do not make sense factually given the 'input' and 'retrieval context'"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT,
                       LLMTestCaseParams.ACTUAL_OUTPUT,
                       LLMTestCaseParams.EXPECTED_OUTPUT,
                       LLMTestCaseParams.RETRIEVAL_CONTEXT],
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

**************************************************
RAG Fact Checker (GEval) Verbose Logs
**************************************************

Criteria:
None 
 
Evaluation Steps:
[
    "Create a list of statements from 'actual output'",
    "Validate if they are relevant and answers the given question in 'input', penalize if any statements are irrelevant",
    "Also Validate if they exist in 'expected output', penalize if any statements are missing or factually wrong",
    "Also validate if these statements are grounded in the 'retrieval context' and penalize if they are missing or factually wrong",
    "Finally also penalize if any statements seem to be invented or made up and do not make sense factually given the 'input' and 'retrieval context'"
] 
 
Rubric:
None
 
Score: 0.8
Reason: The actual output is fully relevant to the input question and closely matches the retrieval context, covering inequality, personnel shortage, systemic constraints, affordability, infrastructure gaps, and e

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:01,  1.95s/test case]



Metrics Summary

  - ✅ RAG Fact Checker (GEval) (score: 0.8, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The actual output is fully relevant to the input question and closely matches the retrieval context, covering inequality, personnel shortage, systemic constraints, affordability, infrastructure gaps, and education needs. However, it omits the specific MIoT-related technological solutions described in the expected output, which focus on hospital safety and patient monitoring. No invented facts are present, but the absence of the expected technological angle prevents a perfect score., error: None)

For test case:

  - input: What are the main challenges facing the healthcare sector in India today?
  - actual output: The main challenges facing the healthcare sector in India today, based on the provided context, include:

1. Increasing inequality in healthcare due to the dominance of private sector five-star hospitals catering to the high-end market, mak




In [73]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.8
Reason: The actual output is fully relevant to the input question and closely matches the retrieval context, covering inequality, personnel shortage, systemic constraints, affordability, infrastructure gaps, and education needs. However, it omits the specific MIoT-related technological solutions described in the expected output, which focus on hospital safety and patient monitoring. No invented facts are present, but the absence of the expected technological angle prevents a perfect score.


###  RAG Fact Checker (G-Eval) Result – Score: 0.7 

In this test, the AI-generated response **barely met** the threshold for factual correctness under **custom multi-criteria evaluation** using DeepEval's G-Eval.

---

###  Key Observations:

-  **Relevant Points Included**:
  - MIoT enabling patient-centric systems in hospital environments.
  - Mentioned reduction in preventable errors and improved responsiveness.

-  **Penalties Incurred For**:
  - Introducing **HIoT references**, which were not relevant to the MIoT-focused question.
  - **Omitting key points** like "real-time alerts," "asset management," and "wearables" from the expected output.

---

### Comparison with Other Metrics

| Metric                | Score | Outcome | Notes                                                                 |
|----------------------|-------|---------|-----------------------------------------------------------------------|
|  **Faithfulness**       | 1.00  | Pass    | Fully aligned with context, no contradictions                         |
|  **Answer Relevancy**   | 1.00  | Pass    | Highly relevant, stayed on topic                                      |
|  **Hallucination**      | 0.00  | Pass    | Missed aspects like encrypted records and future use cases           |
|  **RAG Fact Checker**   | 0.70  | Pass    | Penalized for minor irrelevance and omissions                         |

---

### Insight:

This case shows how **slight context drift** (e.g., switching from MIoT to HIoT) and **incomplete coverage** of expected elements (like alerting staff) can lower trustworthiness, even if the main response is on-topic. For **mission-critical applications like healthcare**, such subtleties in alignment matter significantly.


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


### Comparative Observation: Single PDF vs Multiple PDFs Evaluation

When evaluating the RAG pipeline using **DeepEval metrics** for the query:  
**"How does MIoT improve hospital safety?"**, we observed a clear difference in performance between **single-PDF** and **multi-PDF** retrieval scenarios.

---

###  Key Insight

- The **single PDF ("Digital Transformation of the Healthcare Value Chain...")** was highly relevant and densely packed with content tailored to MIoT and hospital safety—leading to better **retrieval**, **relevance**, and **precision**.
  
- After introducing the second PDF, **"Trends, Prospects, Challenges, and Security in the Healthcare IoT"**, the **retrieved context became noisier**:
  - Some nodes discussed marketing, economics, or general HIoT trends.
  - Others introduced terminology or context **not directly aligned with the question**.

---

###  Conclusion

 **Single-PDF retrieval** worked better for focused queries tied to its content domain.  
 **Multi-PDF retrieval**, while broader, introduced **semantic noise**, especially when PDFs differ in focus.  
This shows the importance of **content-aware retrieval filtering or source attribution** when scaling to multiple documents in a RAG system.

---

### Key Solutions to Improve Accuracy (Based on Our Current Pipeline)

In our experiments, RAG performance dropped slightly when combining multiple PDFs. This is mainly because the added PDF introduced **less relevant content**, which affected **retrieval precision** and overall evaluation scores.

To improve accuracy, we recommend the following **within our existing setup**:

---

#### 1.  Use Only Relevant Chunks for Answer Generation
- Ensure that only the **top semantically relevant chunks** are passed to the generation stage.
- Our current `retrieve_chunks()` function already filters duplicate content—further **limit to top-3 or top-5** chunks based on semantic similarity.

---

#### 2. Eliminate Noisy Context Before Response Generation
- In multi-PDF scenarios, some retrieved chunks (e.g., about healthcare marketing) were **not aligned** with the query.
- Before generating an answer, apply a **manual or automated filtering step** to remove irrelevant chunks based on simple keyword or topic matching.

---

#### 3. Restrict the Model Using Strict Prompting
- Use this instruction before generation:  
  _“Generate an answer strictly based on the above context; do not use your own knowledge. If the query is not covered in the context, respond with: 'This query is not as per the PDF.'”_
- This prevents **hallucination** and ensures **faithfulness to context**.

---

#### 4. Extract & Use Statement-Level Grounded Evidence (for Evaluation)
- In metrics like Hallucination, Faithfulness, and G-Eval, we observed better scores when the **response used precise, grounded statements** from context chunks.
- Prefer summarizing from **atomic-level claims** rather than full paragraphs.

---

By following these steps with our current tools (retriever, chunker, generator, and DeepEval metrics), we can improve the **trustworthiness and quality** of RAG responses across multiple PDF sources.


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


## Homework: "RAG Under Pressure" – Evaluating System Robustness with Noisy Inputs

### Scenario:
You built a powerful healthcare assistant using a RAG system that retrieves information from multiple medical PDFs to answer patient queries.  
Now comes the challenge: real-world users don’t always ask clear, well-structured questions. They may introduce typos, irrelevant details, or mix languages. Also, not every document in your retrieval database is guaranteed to be relevant or reliable.

---

### Objective::
Simulate real-world conditions by introducing noise into either the user queries or the retrieved documents. Then evaluate how well your RAG system performs under pressure.

---

### Task Description:

#### 1. Stage 1: Run a Clean Test
- Choose three healthcare-related questions (for example, symptoms of a disease, treatment options, or medication usage).
- Run these questions through your current RAG system.
- Record the following:
  - Retrieved context chunks
  - Final answers
  - Evaluation scores using DeepEval metrics (such as `ContextualRetrievedRelevancy` or `G-Eval`)

#### 2. Stage 2: Add Real-World Messiness
Select one of the following methods to simulate noisy input:

- **Option A: Corrupt the Query**  
  - Add typos, abbreviations, informal language, or a mix of English and another language.  
  - Example: "Plz tell symptms of diabts" or "¿Cuales son los síntomas de hypertension?"

- **Option B: Corrupt the Document Base**  
  - Add at least one unrelated PDF (such as a document about sports, cooking, or finance) to your retrieval system.  
  - Make sure your retriever indexes this new content along with the medical PDFs.

#### 3. Stage 3: Rerun and Evaluate
- Rerun the same three queries, using either the noisy versions or the clean versions with the corrupted document base.
- Record the following:
  - Retrieved context chunks
  - Generated answers
  - Evaluation scores (using `ContextualRetrievedRelevancy`, `G-Eval`, etc.)

#### 4. Stage 4: Compare and Reflect
- Create a side-by-side comparison of the evaluation scores from before and after introducing noise.
- Analyze the following:
  - How did the answer quality change?
  - Which type of noise had the greatest impact?
  - What does this reveal about your system’s robustness?

Then answer briefly:
1. Did your RAG pipeline perform well under noisy conditions?  
2. What improvements would you recommend to make retrieval and generation more resilient to real-world inputs?
