# ❤️‍🩹📖 VitalStory: GenAI Model Evaluation | Feature 1 - Follow up Question Model

**Author:** Tyler Gustafson (Gustani)

This notebook is used to evaluate our model for Feature 1 (Follow up Question Model) - testing different pipelines, models, and parameter tunings.

## 1. 📦 Setup & Installs
First we will setup the initial libraries that are required


In [None]:
# HuggingFace Login

from huggingface_hub import login
import os

# Retrieve the token from Colab secrets
hf_token = os.getenv("HUGGINGFACE_TOKEN")
login(hf_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Core Installations
%%capture
!pip -q install git+https://github.com/huggingface/transformers
!pip -q install bitsandbytes accelerate  # For 4-bit/8-bit quantized models
!pip -q install sentencepiece einops     # Tokenization & tensor ops
!pip install sentence_transformers

# LangChain & Ecosystem
!pip -q install langchain
!pip -q install langchain_community
!pip -q install langchainhub
!pip -q install -U langchain-huggingface
!pip -q install -U langchain-cohere

# Vector DBs / Search
!pip -q install faiss-gpu               # GPU-accelerated similarity search
!pip -q install --upgrade --quiet chromadb bs4 qdrant-client
!pip -q install --upgrade --quiet wikipedia arxiv pymupdf xmltodict  # Document sources

# Model Tuning / Tools
!pip -q install loralib                # LoRA fine-tuning

# Evaluation Tools
!pip -q install evaluate               # HF evaluation framework
!pip -q install bert_score             # Semantic similarity
!pip -q install ragas                  # RAG evaluation metrics
!pip install -q nltk evaluate
!pip install rouge_score


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# from langchain.llms import HuggingFacePipeline
from langchain_huggingface import HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from pydantic import BaseModel, Field
from typing import List



# Core Python Libraries
import os
import json
import re
import time
import locale
from pprint import pprint

# Data Processing & Scientific Computing
import numpy as np
import pandas as pd
import torch

# NLP Evaluation Libraries
import nltk
from bert_score import score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import evaluate

# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')  # Important! Prevents the LookupError

# Web and Document Parsing
import bs4

# Hugging Face Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig
)

# LangChain Core Components
from langchain import PromptTemplate, LLMChain, hub
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter
)

# LangChain LLM Interfaces
from langchain_huggingface import HuggingFacePipeline
from langchain.llms import HuggingFacePipeline
from langchain_cohere import ChatCohere
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
# from langchain_community.chat_models import ChatCohere  # Commented out

# LangChain Vector Stores
from langchain_community.vectorstores import (
    FAISS,
    Chroma,
    Qdrant
)

# LangChain Embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.utils.math import cosine_similarity

# LangChain Document Loaders
from langchain_community.document_loaders import (
    WebBaseLoader,
    TextLoader,
    ArxivLoader,
    WikipediaLoader,
    OnlinePDFLoader,
    PyMuPDFLoader,
    PubMedLoader
)

# Google Colab Utilities
from google.colab import userdata

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
# Set up GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

Using device: cuda
CUDA available: True
Device: Tesla T4


## 2. 📊 Load Gold Reference Dataset (Feature 1)

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
file_path = '/content/drive/My Drive/00_UC-Berkeley/07_Capstone/gold_standard_feature1.csv'

In [None]:
df = pd.read_csv(file_path)
df.head(2)

Unnamed: 0,health_log,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11
0,I've been having really bad headaches lately.,When did the headaches start?,How often do you get them?,"Where is the pain located (front, sides, back ...",How severe is the pain on a scale from 1-10?,Have you noticed any specific triggers (stress...,Do the headaches get better or worse at certai...,Are there any other symptoms with the headache...,"Have you taken any medication for it, and did ...",Has anything made the pain worse or better?,Have you experienced headaches like this before?,"What does the headache feel like? (squeezing, ..."
1,I keep getting random bouts of insomnia that j...,When did this sleep pattern change begin?,What time do you typically try to go to bed an...,What kinds of thoughts keep you awake?,Have you noticed any changes in your caffeine ...,Are you using any screens before bedtime?,Have you experienced any recent stressful events?,How is your energy level during the day?,Have you tried any sleep aids or remedies?,Has your exercise routine changed recently?,Do you feel physically tired when you go to bed?,


In [None]:
# Combine q1 through q11 into a list per row and deletes teh NaNs
question_cols = [f"q{i}" for i in range(1, 12)]
df["gold_questions"] = df[question_cols].apply(lambda row: [q for q in row if pd.notna(q)], axis=1)
df.head(2)


Unnamed: 0,health_log,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,gold_questions
0,I've been having really bad headaches lately.,When did the headaches start?,How often do you get them?,"Where is the pain located (front, sides, back ...",How severe is the pain on a scale from 1-10?,Have you noticed any specific triggers (stress...,Do the headaches get better or worse at certai...,Are there any other symptoms with the headache...,"Have you taken any medication for it, and did ...",Has anything made the pain worse or better?,Have you experienced headaches like this before?,"What does the headache feel like? (squeezing, ...","[When did the headaches start?, How often do y..."
1,I keep getting random bouts of insomnia that j...,When did this sleep pattern change begin?,What time do you typically try to go to bed an...,What kinds of thoughts keep you awake?,Have you noticed any changes in your caffeine ...,Are you using any screens before bedtime?,Have you experienced any recent stressful events?,How is your energy level during the day?,Have you tried any sleep aids or remedies?,Has your exercise routine changed recently?,Do you feel physically tired when you go to bed?,,"[When did this sleep pattern change begin?, Wh..."


## 3. ⚙️ Model & Pipeline Setup
This section is where we load the actual language model and set it up so it can generate text based on prompts.

Tokenizer prepares input / output for the model.
THe model predicts new text given the input tokens.

### **3a. Load model, tokenizer and initial pipeline functions**

In [None]:
model_names = {
    "mistral": "mistralai/Mistral-7B-Instruct-v0.3",
    "llama3": "meta-llama/Llama-3.2-3B-Instruct",
    "med42_8b": "m42-health/Llama3-Med42-8B",
    "med42_70b": "m42-health/Llama3-Med42-70B"
}

In [None]:
# Ramakrishna Ramadurgam - loaded model - note the 8 bit quant

model_name = model_names["med42_8b"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    load_in_8bit=True
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

In [None]:
# Ramakrishna Ramadurgam - model piopeline and parser (make sure snag the specific med42 one)


# # JSON parser
def extract_questions_list(text):
    import json
    try:
        data = json.loads(text.split("```json")[-1].split("```")[0])
        return data.get("questions", [])
    except:
        return ["Error: Failed to parse JSON"]

def extract_questions_list_med42(text):
    try:
        # Find the last JSON object in the string using regex
        matches = re.findall(r"\{[\s\S]*?\}", text)
        if not matches:
            return ["Error: No JSON object found"]

        # Try parsing the last match
        data = json.loads(matches[-1])
        return data.get("questions", [])
    except Exception as e:
        return [f"Error: Failed to parse JSON ({str(e)})"]


def build_health_chain_with_model(model, tokenizer, prompt_template, temperature=0.2, max_new_tokens=100):
    text_gen_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        do_sample=True
    )

    llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

    class Questions(BaseModel):
        questions: List[str] = Field(description="List of medical follow-up questions")

    chain = (
        {"health_log": RunnablePassthrough()}
        | prompt_template
        | llm
        #| extract_questions_list
        | extract_questions_list_med42
    )
    return chain

### **3b. Prompt Engineering**

In [None]:
# ZERO SHOT PROMPT
llama_friendly_prompt = PromptTemplate.from_template(
        """Patient Health Log: {health_log}

Generate 3 medically relevant follow-up questions. Be concise and only ask short questions. Do **NOT** give any advice. Do **NOT** repeat the patient log or instructions in your response. Only return JSON.

Your response must be a valid JSON object with the following structure:
```json
{{
  "questions": [
    "First follow-up question here?",
    "Second follow-up question here?",
    "Third follow-up question here?"
  ]
}}```"""
    )




In [None]:
# CHAIN OF THOUGHT PROMPT

llama_cot_prompt = PromptTemplate.from_template(
    """Patient Health Log: {health_log}

Step 1: Reflect on the key medical concerns or symptoms mentioned in the log.

Step 2: Based on this reflection, generate 3 medically relevant follow-up questions. Be concise and only ask short questions. Do **NOT** give any medical advice.

Only return JSON in the following structure:
```json
{{
  "questions": [
    "First follow-up question here?",
    "Second follow-up question here?",
    "Third follow-up question here?"
  ]
}}```"""
)


In [1]:
# Ramakrishna Ramadurgam - final prompt


# FEW SHOT PROMPT (*Best prompt for MED42-8B)

llama_few_shot_prompt = PromptTemplate.from_template(
    """You are a medical question-generation assistant.

Given a patient health log, generate 3 medically relevant follow-up questions. Be concise and only ask short questions. Do **not** give any advice. Do **not** repeat the patient log or the instructions in your response. Only return valid JSON.

The output must follow this format:
```json
{{
  "questions": [
    "First follow-up question here?",
    "Second follow-up question here?",
    "Third follow-up question here?"
  ]
}}```

---

Example 1:

Patient Health Log: "My stomach hurts after I eat anything, and I feel bloated all the time."

Response:
```json
{{
  "questions": [
    "What types of foods trigger your symptoms?",
    "Do you experience nausea or vomiting?",
    "Have you had any recent changes in bowel habits?"
  ]
}}```

---

Example 2:

Patient Health Log: "I keep getting migraines that last all day and don’t respond to painkillers."

Response:
```json
{{
  "questions": [
    "How frequently do the migraines occur?",
    "Do you notice any warning signs before they start?",
    "Have you tried any treatments other than painkillers?"
  ]
}}```

---

Now, generate follow-up questions for the following:

Patient Health Log: {health_log}

Response:
"""
)


NameError: name 'PromptTemplate' is not defined

### **3c. Set model hyperparameters and build model**

In [None]:
# Ramakrishna Ramadurgam - creating the health chain (note temperature etc.)


# Create Chains - REMEMBER YOU NEED TO SET THE MODEL A COUPLE CELLS UP

# Temperature Tuning
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_friendly_prompt, temperature=0.1)
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_friendly_prompt, temperature=0.2)
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_friendly_prompt, temperature=0.5)
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_friendly_prompt, temperature=0.9)


# Prompt Engineering
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_friendly_prompt, temperature=0.2) # Zero shot
# health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_cot_prompt, temperature=0.2) # Chain of Thought
health_chain = build_health_chain_with_model(model,tokenizer,prompt_template=llama_few_shot_prompt, temperature=0.2) # Few shot


In [None]:
# ✅ Run test inference
test_log = "I've been having really bad headaches lately."
response = health_chain.invoke({"health_log": test_log})
print(response)

['How often do the headaches occur?', 'Are the headaches accompanied by any other symptoms?', 'Have you noticed any triggers for the headaches?']


## 6. 📈 Evaluation (BERTScore)

### **6a. Generate model responses for test / eval questions**

In [None]:
import warnings
warnings.filterwarnings("ignore", message="MatMul8bitLt: inputs will be cast*")


generated_all = []
number_of_questions = 20

# Loop through health logs and save generated questions
for i in range(number_of_questions):
    row = df.iloc[i]
    health_log = row["health_log"]

    print(f"🔄 Generating for row {i}:\n{health_log}\n")

    # Run the model chain
    generated_questions = health_chain.invoke({"health_log": health_log})

    # Save the results
    generated_all.append({
        "index": i,
        "health_log": health_log,
        "generated_questions": generated_questions
    })

    # Print them out for review
    for q in generated_questions:
        print("💬", q)
    print("-" * 60)


🔄 Generating for row 0:
I've been having really bad headaches lately.

💬 How often have you been experiencing these headaches?
💬 Are the headaches localized to one area or are they more general?
💬 Have you noticed any associated symptoms such as sensitivity to light or nausea?
------------------------------------------------------------
🔄 Generating for row 1:
I keep getting random bouts of insomnia that just won't quit. I've been lying in bed for HOURS every night this week just staring at the ceiling. My brain won't shut up about the most random things and I'm exhausted but also weirdly wired? My fitbit sleep score is in the toilet 😩

💬 How long have you been experiencing these bouts of insomnia?
💬 Have you noticed any triggers or patterns to these episodes?
💬 Are you using any sleep aids or supplements?
------------------------------------------------------------
🔄 Generating for row 2:
Pretty sure my new neighbors are gonna give me a heart attack fr... They got this super bright mo

In [None]:
generated_all [0]

{'index': 0,
 'health_log': "I've been having really bad headaches lately.",
 'generated_questions': ['How often have you been experiencing these headaches?',
  'Are the headaches localized to one area or are they more general?',
  'Have you noticed any associated symptoms such as sensitivity to light or nausea?']}

## 7. 📊 Results Visualization

### **7a. Supress Warnings**

In [None]:
import warnings
import transformers

# Filter out specific warning from transformers
warnings.filterwarnings("ignore", category=UserWarning,
                       module="transformers.modeling_utils",
                       message="Some weights of")
# Or suppress all transformers warnings
transformers.logging.set_verbosity_error()

### **7b. Define Metric Calculation Functions**

In [None]:
# BERTScore, Bleu and Rouge functions

def calculate_bertscore(generated_texts, reference_texts):
    """
    Calculate BERTScore for a list of generated texts against reference texts.

    Args:
        generated_texts: List of generated text strings
        reference_texts: List of reference text strings

    Returns:
        Dict containing lists of precision, recall, f1 scores and their averages
    """
    # Calculate BERTScore
    P, R, F1 = score(generated_texts, reference_texts, lang="en", verbose=False)

    return {
        "precision": P.tolist(),
        "recall": R.tolist(),
        "f1": F1.tolist(),
        "avg_precision": float(P.mean()),
        "avg_recall": float(R.mean()),
        "avg_f1": float(F1.mean())
    }

def calculate_bleu(generated_text, reference_text):
    """
    Calculate BLEU score for a single generated text against a reference.

    Args:
        generated_text: String of generated text
        reference_text: String of reference text

    Returns:
        BLEU score as a float
    """
    smooth = SmoothingFunction().method1
    reference = [nltk.word_tokenize(reference_text)]
    hypothesis = nltk.word_tokenize(generated_text)

    return sentence_bleu(reference, hypothesis, smoothing_function=smooth)

def calculate_rouge(generated_texts, reference_texts):
    """
    Calculate ROUGE scores for a list of generated texts against reference texts.

    Args:
        generated_texts: List of generated text strings
        reference_texts: List of reference text strings

    Returns:
        Dict containing ROUGE scores
    """
    rouge = evaluate.load("rouge")
    return rouge.compute(predictions=generated_texts, references=reference_texts)

### **7c. Define Matching Function**

In [None]:
def match_generated_to_best_gold_unique(generated_list, gold_list, scorer,weights=None):
    """
    Matches each generated question to the best *unique* gold question using a weighted composite score.
    Ensures no gold question is matched more than once.
    Returns all matched pairs and metric averages.
    """
    from bert_score import score

    if weights is None:
        weights = {"bert": 0.6, "rouge": 0.3, "bleu": 0.1}

    remaining_gold = gold_list.copy()
    matched_pairs = []
    composite_scores = []
    bert_f1s = []
    rouge_ls = []
    bleus = []

    # Track which generated questions matched and which didn't
    health_log = {
        "total_generated": len(generated_list),
        "total_gold": len(gold_list),
        "matched_count": 0,
        "unmatched_count": 0,
        "unmatched_generated": []
    }

    for i, gen_q in enumerate(generated_list):
        if not remaining_gold:  # No more gold questions available
            health_log["unmatched_count"] += 1
            health_log["unmatched_generated"].append((i, gen_q))
            continue

        best_score = -1
        best_match = None
        best_details = {}

        for gold_q in remaining_gold:
            # BERTScore
            #_, _, f1 = score([gen_q], [gold_q], lang="en", verbose=False)
            _, _, f1 = scorer.score([gen_q], [gold_q])

            bert_f1 = f1[0].item()

            # ROUGE
            rouge_result = calculate_rouge([gen_q], [gold_q])
            rouge_l = rouge_result["rougeL"]

            # BLEU
            bleu = calculate_bleu(gen_q, gold_q)

            # Composite score
            composite = (
                weights["bert"] * bert_f1 +
                weights["rouge"] * rouge_l +
                weights["bleu"] * bleu
            )

            if composite > best_score:
                best_score = composite
                best_match = gold_q
                best_details = {
                    "bert_f1": bert_f1,
                    "rouge_l": rouge_l,
                    "bleu": bleu,
                    "composite": composite
                }

        # Save and remove matched gold
        if best_match:
            matched_pairs.append((gen_q, best_match, i))  # Added index of generated question
            composite_scores.append(best_details["composite"])
            bert_f1s.append(best_details["bert_f1"])
            rouge_ls.append(best_details["rouge_l"])
            bleus.append(best_details["bleu"])
            remaining_gold.remove(best_match)
            health_log["matched_count"] += 1

    # Only compute averages if we have matches
    avg_results = {}
    if matched_pairs:
        avg_results = {
            "avg_composite": sum(composite_scores) / len(composite_scores),
            "avg_bert_f1": sum(bert_f1s) / len(bert_f1s),
            "avg_rouge_l": sum(rouge_ls) / len(rouge_ls),
            "avg_bleu": sum(bleus) / len(bleus)
        }
    else:
        avg_results = {
            "avg_composite": 0,
            "avg_bert_f1": 0,
            "avg_rouge_l": 0,
            "avg_bleu": 0
        }

    return {
        "matched_pairs": matched_pairs,
        "health_log": health_log,
        **avg_results
    }


### **7d. Apply Matching to Generated and Gold Questions**

In [None]:
from bert_score import BERTScorer
scorer = BERTScorer(lang="en", model_type="roberta-large")


composite_results = []
health_logs = []
skipped_rows = []

for entry in generated_all:
    idx = entry["index"]
    gen = entry["generated_questions"]
    gold = df.loc[idx, "gold_questions"]

    # Validate
    if not isinstance(gen, list) or len(gen) == 0 or "Error" in gen[0] or len(gen) != 3:
        print(f"⚠️ Skipping row {idx}: Invalid generated questions")
        skipped_rows.append({"index": idx, "reason": "Invalid generated questions"})
        continue
    if not isinstance(gold, list) or len(gold) == 0:
        print(f"⚠️ Skipping row {idx}: No gold questions available")
        skipped_rows.append({"index": idx, "reason": "No gold questions available"})
        continue

    # Match and evaluate
    result = match_generated_to_best_gold_unique(gen, gold,scorer=scorer)

    composite_results.append({
        "index": idx,
        "avg_composite": result["avg_composite"],
        "avg_bert_f1": result["avg_bert_f1"],
        "avg_rouge_l": result["avg_rouge_l"],
        "avg_bleu": result["avg_bleu"],
        "matched_pairs": result["matched_pairs"]
    })

    health_logs.append({
        "index": idx,
        "health_log": result["health_log"]
    })


### **7e. Display evaluation summary with overall averages**

In [None]:
# Create a DataFrame for the results
composite_df = pd.DataFrame(composite_results)

# Calculate overall averages
overall_averages = {
    "index": "OVERALL AVG",
    "avg_composite": composite_df["avg_composite"].mean(),
    "avg_bert_f1": composite_df["avg_bert_f1"].mean(),
    "avg_rouge_l": composite_df["avg_rouge_l"].mean(),
    "avg_bleu": composite_df["avg_bleu"].mean()
}

# Add the overall averages row
composite_df_with_totals = pd.concat([
    composite_df,
    pd.DataFrame([overall_averages])
])

# Show the average scores with overall averages
print("📊 Evaluation Summary:")
display(composite_df_with_totals[["index", "avg_composite", "avg_bert_f1", "avg_rouge_l", "avg_bleu"]])

📊 Evaluation Summary:


Unnamed: 0,index,avg_composite,avg_bert_f1,avg_rouge_l,avg_bleu
0,0,0.65734,0.91539,0.346314,0.042121
1,1,0.709851,0.92757,0.445495,0.196608
2,2,0.686882,0.915816,0.412008,0.137898
3,3,0.719099,0.947732,0.446934,0.163793
4,4,0.663202,0.919915,0.355556,0.045864
5,5,0.676196,0.932434,0.366758,0.067081
6,6,0.695849,0.922946,0.429762,0.131526
7,7,0.775357,0.946453,0.599415,0.276612
8,8,0.784029,0.95072,0.592398,0.358781
9,9,0.695404,0.912762,0.432941,0.178648


### **7f. Display Health Logs, Skipped Rows and Displayed Matched Pairs**

In [None]:
# Show the health logs
print("\n🏥 Health Logs:")
for log in health_logs:
    idx = log["index"]
    health = log["health_log"]
    print(f"\nRow {idx}:")
    print(f"  Generated: {health['total_generated']}, Gold: {health['total_gold']}")
    print(f"  Matched: {health['matched_count']}, Unmatched: {health['unmatched_count']}")
    if health['unmatched_count'] > 0:
        print("  Unmatched generated questions:")
        for i, q in health['unmatched_generated']:
            print(f"    - Q{i+1}: {q}")

# Show skipped rows
if skipped_rows:
    print("\n⚠️ Skipped Rows:")
    for row in skipped_rows:
        print(f"  Row {row['index']}: {row['reason']}")

# Show the actual matched pairs
print("\n🧾 Matched Generated ↔️ Gold Pairs:")
for row in composite_results:
    print(f"\nRow {row['index']}:")
    for i, (gen_q, gold_q, gen_idx) in enumerate(row["matched_pairs"]):
        print(f"Q{gen_idx+1} Gen : {gen_q}")
        print(f"     ↪️ Matched Gold: {gold_q}\n")


🏥 Health Logs:

Row 0:
  Generated: 3, Gold: 11
  Matched: 3, Unmatched: 0

Row 1:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 2:
  Generated: 3, Gold: 9
  Matched: 3, Unmatched: 0

Row 3:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 4:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 5:
  Generated: 3, Gold: 7
  Matched: 3, Unmatched: 0

Row 6:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 7:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 8:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 9:
  Generated: 3, Gold: 9
  Matched: 3, Unmatched: 0

Row 10:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 11:
  Generated: 3, Gold: 7
  Matched: 3, Unmatched: 0

Row 12:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 13:
  Generated: 3, Gold: 9
  Matched: 3, Unmatched: 0

Row 14:
  Generated: 3, Gold: 9
  Matched: 3, Unmatched: 0

Row 15:
  Generated: 3, Gold: 10
  Matched: 3, Unmatched: 0

Row 16:
  Generated: 3,