<a href="https://colab.research.google.com/github/debparth/Codabench_MEDIQA-OE/blob/main/Codabench_MEDIQA_OE_(2025).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Environment & Paths

In [None]:
%pip install --quiet --upgrade google google-generativeai tqdm

import os
from pathlib import Path
from google.colab import userdata, drive

# ── Keys & project folder ───────────────────────────────────────────────────────
os.environ["GEMINI_API_KEY"] = userdata.get("GOOGLE_API_KEY")

drive.mount("/content/drive", force_remount=True)

#### System Prompt

In [None]:
from textwrap import dedent

SYSTEM_PROMPT: str = dedent("""
You are a deterministic, expert-level clinical information extraction engine. Your sole function is to receive a JSON object representing a medical encounter and return a JSON object containing extracted medical orders with zero defects. You must operate as a state machine, following a fixed workflow with the highest level of precision and strictly adhere to all instructions. Failure to adhere to the output format is not an option. Deviating from these instructions is a protocol violation.

### Core Directive ###
Analyze the provided transcript and extract all medical orders. An order is defined by four attributes: order_type, description, reason, and provenance.

### Attribute Definitions ###
- order_type: (String) MUST be one of four exact strings: "medication", "lab", "imaging", "follow-up".
- description: (String) The specific service or product ordered. This should be a direct, non-conversational summary. Extract verbatim details like dosage, frequency, and location. For example, from "I'm going to prescribe some Lasix, 40 milligrams a day," the description is "lasix 40 milligrams a day". Another example, from "increase lasix from twenty milligrams to sixty milligrams for the next four days", the description is "lasix sixty milligrams four days pill". Another example, from "use albuterol and atrovent inhalers", the order is repeated twice having one order’s description as "albuterol" and the other order’s description as "atrovent inhalers".
- reason: (String) The medical justification for the order. This should also be a direct summary. For "For your shortness of breath... I want to... put you on some Lasix," the reason is "shortness of breath". If no reason is explicitly stated, use the most relevant diagnosis mentioned in connection with the order else an empty string "".
- provenance: (List of Integers) A JSON list of integer turn_ids. These turns are the absolute proof for the extracted order. Every piece of information (type, description, reason) must be traceable to the turn_ids listed here.

### Processing Workflow ###
Execute the following nine-step process. This entire process must be logged within <chain_of_thought> tags before the final JSON output. This log is a mandatory component of the operation.
1. Context Ingestion: Read, Scan and Analyze the entire transcript first to build a complete contextual model of the encounter.
3. Evidence Gathering: Identify and list all turn_id where potential order candidates stated by doctor or any turn where a doctor issues a command or action plan.
4. Chronological Sweep & Extraction: Iterate through the evidence gathered one by one.
	- Focus exclusively on the "DOCTOR" speaker. Orders are only valid if stated or confirmed by the doctor.
	- Apply the "Definitive Order" Test:
		a. EXTRACT: Clear, direct, undeniable statements of action. (e.g., "I am ordering...", "We will get a...", "I'm going to prescribe...", "Make sure you schedule...").
		b. IGNORE: Tentative, conditional, recommended actions or exploratory language. (e.g., "We could think about...", "An option might be...", "If it gets worse, we might need...", "we might consider...", "I'd recommend...").
		c. IGNORE: Orders mentioned only by the PATIENT and not confirmed by the DOCTOR.
		d. IGNORE: General advice that is not a specific order (e.g., "You should drink more water").
        e. IGNORE: If a phrase is ambiguous, and it is not a specific, actionable order (e.g., "we need to watch your blood pressure...").
		f. IGNORE: Continuations of existing treatments (e.g., "continue taking...", "continue on medication...").
		g. IGNORE: If needed order (e.g., "use medication if needed...", "take medication only as needed for...", "take this medication which is stronger than medication only if needed...").
	- Handle Multi-Order Turns: If a single turn contains multiple distinct orders or actions, generate a separate order object for each.
5. Candidate Auditing: For each candidate, audit it against the Core Directives. State explicitly whether it is VALID or INVALID and provide a brief justification referencing the rule violated (R1, R2, etc) or not meeting the validation based on the JSON Order Schema. This analysis is mandatory.
- Example Invalid Justification: "INVALID: Violates Rule R2 - Conditional Language."
- Example Invalid Justification: "INVALID: Violates Rule R3 - This is an instruction for the scribe, not the patient."
6. Data Structured Extraction: For each VALID candidate identified, systematically extract the four fields and construct the order object with meticulous adherence to the JSON Order Schema and populate those four fields.
7. Mandatory Final Quality (Self-Correction): Before generating the output, Perform a final check on all your extracted valid orders. conduct this final check:
	- Schema Adherence: Is every field present and correctly typed in every order object?
	- Provenance Integrity: Read the text at the provenance turn(s). Does it unambiguously support the extracted description and order_type? Is reason set to null when no explicit justification was given? Is every single order from the transcript captured?
	- Redundancy Check: Is every single order from the transcript captured? Is the same order listed multiple times? Consolidate if necessary into the most complete description.
	- Completeness Check: Confirm that no valid orders have been missed.
	- JSON Syntax Validation: Is the final string a single, perfectly formed JSON object? Ensure they are complete, correct, and fully compliant with all directives?
8. Verification Protocol: If any check fails, you must restart and redo from start and correct your draft JSON along and re-verify. Log any corrections made during this audit. If no corrections are needed, state "Integrity audit passed."
9. Final JSON Assembly: Assemble the audited, corrected data into the final, single JSON object according to the JSON Order Schema. This JSON object is the only and final output of your response final JSON for output.

### Critical Rules & Edge Cases ###
- (R1) No Orders Rule: If the transcript contains no identifiable medical orders, the value for the encounter id key MUST be an empty list: [].
- (R2) Multiple Orders in One Turn Rule: If a single turn contains multiple distinct orders, create a separate order object for each one. The turn_id can be reused in the provenance for each of these orders.
- (R3) Implicit Reasons Rule: If a reason is not stated in the same sentence as the order, look at the immediately preceding sentences in the conversation for the relevant diagnosis or justification.
- (R4) Do Not Infer Rule: Do not invent orders or reasons that are not supported by the text. If you cannot find a piece of information for a field, you must do your best to populate it with the closest available information. All fields are mandatory.
- (R5) No-Hallucination Rule: Do not infer, add, or embellish any information not explicitly present in the transcript. The extraction must be a literal representation of the doctor's plan.
- (R6) JSON Rule: The JSON object's key is the encounter_id, and its value is a list of order objects. Your final output must be the JSON object and nothing else. No introductory text, no apologies, no explanations.

### JSON Order Schema ###
- order_type: (String) The high-level clinical category. It must be one of: "medication", "lab", "imaging", "follow-up".
- description: (String) The formal, clean, accurate and most concise non-conversational summary or action of the order excluding conversational filler. Contains only 1 thing. If number are digits then digits else words.
- reason: (String) The direct, concise, explicit stated medical justification for the order. If no reason is explicitly stated in the transcript before or after the order for that specific order, then it must be null. Do not infer or guess a reason from general context. Do not alter or paraphrase or phrase or change a reason. Keep it same as in the transcription. Short phrase the reason.
- provenance: (List of Integers) A list of the turn_id(s) that provide the most direct and concise evidence for the order.

### Example of Perfection ###
Input:
```
{
    "id": "acibench_D2N122_aci_clinicalnlp_taskB_test1",
    "transcript": [
        { "turn_id": 2, "speaker": "PATIENT", "transcript": "...they did that chest x-ray...and they found this lung nodule...referred me here to you..." },
        { "turn_id": 27, "speaker": "DOCTOR", "transcript": "...you do have an incidentally found right upper lobe lung nodule... I'm also going to schedule a pet ct this is gon na help to determine if that nodule is metabolically active... for your secondary concern of your rheumatoid arthritis i want you to continue to follow up with your rheumatologist..." }
    ]
}
```

Your Required Output:
```
{
    "acibench_D2N122_aci_clinicalnlp_taskB_test1": [
        {
            "order_type": "imaging",
            "description": "pet ct",
            "reason": "to determine if that nodule is metabolically active",
            "provenance": [
                2,
                27
            ]
        },
        {
            "order_type": "follow-up",
            "description": "follow up with your rheumatologist",
            "reason": "rheumatoid arthritis",
            "provenance": [
                27
            ]
        }
    ]
}
```
""").strip()

#### Model Client & Helper

In [None]:
import json
import logging
from typing import Any, Dict, List

from google import genai
from google.genai import types

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

def generate(
    user_prompt: str,
    *,
    system_prompt: str = SYSTEM_PROMPT,
    model_name: str = "gemini-2.5-pro",
    seed: int = 42,
    temperature: float = 2.0,
    top_p: float = 0.97,
) -> str:
    """
    Call Gemini and return the raw streamed text response.
    """
    client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
    contents = [
        types.Content(
            role="user",
            parts=[types.Part.from_text(text=user_prompt)],
        )
    ]

    cfg = types.GenerateContentConfig(
        temperature=temperature,
        top_p=top_p,
        seed=seed,
        response_mime_type="text/plain",
        thinking_config=types.ThinkingConfig(thinking_budget=-1),
        system_instruction=[types.Part.from_text(text=system_prompt)],
    )

    output_chunks: List[str] = []
    for chunk in client.models.generate_content_stream(
        model=model_name,
        contents=contents,
        config=cfg,
    ):
        if chunk.text:  # filter out keep‑alive / empty chunks
            output_chunks.append(chunk.text)

    return "".join(output_chunks)

#### Load Test Data

In [None]:
import json
from pathlib import Path

INPUT_PATH = Path("/content/test_input_data.json")
with INPUT_PATH.open(encoding="utf‑8") as f:
    test_data = json.load(f)["test"]

#### Batch Generation Loop

In [None]:
from tqdm.auto import tqdm

results: Dict[str, Any] = {}

for item in tqdm(test_data, desc="Generating orders"):
    encounter_id = item["id"]
    transcript = item["transcript"]

    user_prompt = json.dumps(
        {"id": encounter_id, "transcript": transcript},
        indent=2,
        ensure_ascii=False,
    )

    try:
        raw = generate(user_prompt)
        try:  # attempt to parse if response is pure JSON
            results[encounter_id] = json.loads(raw)
        except json.JSONDecodeError:
            results[encounter_id] = raw  # keep raw text for post‑cleaning
    except Exception as exc:
        logging.error("❌ %s – %s", encounter_id, exc)
        results[encounter_id] = None

#### Retry Failed Cases

- Until you don't see the ERROR Code: 503 or Hit API limit Code: 429

In [None]:
failed_ids = [k for k, v in results.items() if v is None]
logging.info("Retrying %d failed encounter(s)…", len(failed_ids))

for encounter_id in failed_ids:
    transcript = next(i["transcript"] for i in test_data if i["id"] == encounter_id)
    user_prompt = json.dumps({"id": encounter_id, "transcript": transcript}, indent=2)

    try:
        raw = generate(user_prompt)
        results[encounter_id] = json.loads(raw) if raw.strip().startswith("{") else raw
    except Exception as exc:
        logging.error("⚠️  Second failure for %s – giving up (%s)", encounter_id, exc)

#### Persist Raw Results

In [None]:
RAW_OUT = Path("/content/test_output_data_raw.json")
RAW_OUT.write_text(json.dumps(results, indent=2, ensure_ascii=False))
print(f"💾 Raw outputs saved → {RAW_OUT}")

#### Utility Functions

- We need to clean the output since, the google's API doesn't support prefix in output.
- Also, sometimes model thinks in the actual answer instead of canvas/scratchpad unlike other models.

In [None]:
import re
from typing import Union

def extract_last_json_block(text: str, key_hint: str = "") -> Union[Dict, List, str]:
    """
    Return the *last* JSON object/array embedded in `text`.
    Falls back to the full string if parsing fails.
    """
    fenced = re.findall(r"```json\s*(.*?)\s*```", text, flags=re.S)
    candidate = fenced[-1] if fenced else text
    candidate = candidate.replace("```", "").strip()

    # Strategy 2: use key_hint heuristic when no fenced block found i.e. JSON
    if not fenced and key_hint:
        after_key = re.search(rf"{re.escape(key_hint)}.*?([\[{{]])", text, re.S)
        if after_key:
            candidate = text[after_key.start(1):]

    try:
        return json.loads(candidate)
    except json.JSONDecodeError:
        logging.warning("Could not parse JSON for key '%s'", key_hint)
        print(candidate)
        return text  # leave as‑is for manual review


def clean_outputs(raw: Dict[str, Any]) -> Dict[str, Any]:
    """
    Strip chain‑of‑thought and keep only the final valid JSON portion.
    """
    cleaned: Dict[str, Any] = {}

    for k, v in raw.items():
        if not isinstance(v, str):
            cleaned[k] = v  # already parsed → nothing to do
            continue

        # 1. Remove <chain_of_thought> … </chain_of_thought>
        v = re.sub(r".*?<chain_of_thought>.*?</chain_of_thought>", "", v, flags=re.S)

        # 2. Extract last JSON
        cleaned[k] = extract_last_json_block(v, k)

    return cleaned

In [None]:
cleaned_output = clean_outputs(results)

#### Normalize & Save Final Clean File

In [None]:
def normalize_schema(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    If model nests each encounter ID inside itself, flatten it.
    """
    normalized: Dict[str, Any] = {}
    for k, v in data.items():
        normalized[k] = v.get(k, v) if isinstance(v, dict) else v
    return normalized

FINAL_OUT = Path("/content/test_output_data_clean.json")
FINAL_OUT.write_text(json.dumps(normalize_schema(cleaned_output), indent=2, ensure_ascii=False))
print(f"✅ Clean JSON saved → {FINAL_OUT}")