<a href="https://colab.research.google.com/github/christiejibaraki/CUREBench/blob/main/notebooks/2-eval_gpt_oss_20b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Run CUREBench eval
- run prediction
- run eval

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [2]:
import os
import sys
import os.path

In [3]:
os.environ['BNB_CUDA_VERSION'] = '125'
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

## Setup
- Clone forked CUREBench repo onto local `content` folder (this is not persistent)
- Create virtual environment and install packages
  - Package installation takes about **5 min** ⏰
- Add virtual environment's site-packages to notebook's system path

In [4]:
!git clone https://github.com/christiejibaraki/CUREBench.git

Cloning into 'CUREBench'...
remote: Enumerating objects: 215, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 215 (delta 73), reused 61 (delta 32), pack-reused 89 (from 2)[K
Receiving objects: 100% (215/215), 3.00 MiB | 8.45 MiB/s, done.
Resolving deltas: 100% (102/102), done.


In [5]:
%cd CUREBench

/content/CUREBench


In [6]:
!git pull

Already up to date.


In [8]:
!pip install virtualenv

Collecting virtualenv
  Downloading virtualenv-20.35.3-py3-none-any.whl.metadata (4.6 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.4.0-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading virtualenv-20.35.3-py3-none-any.whl (6.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.4.0-py2.py3-none-any.whl (469 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: distlib, virtualenv
Successfully installed distlib-0.4.0 virtualenv-20.35.3


In [9]:
!virtualenv env

created virtual environment CPython3.12.12.final.0-64 in 246ms
  creator CPython3Posix(dest=/content/CUREBench/env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==25.2
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [10]:
!./env/bin/pip list

Package Version
------- -------
pip     25.2


In [None]:
!./env/bin/pip install -r requirements.txt

In [12]:
env_path = './env/lib/python3.12/site-packages'
sys.path.append(env_path)

### read config

In [13]:
import json

In [14]:
config_path= "metadata_config_val.json"
config = json.load(open(config_path, 'r')) if config_path else {}
if 'dataset' in config:
    dataset_config = config['dataset']
    dataset_name = dataset_config.get('dataset_name', 'treatment')
print(f"\nconfig file: {config_path}\ncontents:\n{dataset_config}")
dataset_path = dataset_config.get("dataset_path")


config file: metadata_config_val.json
contents:
{'dataset_name': 'cure_bench_phase1_val', 'dataset_path': 'resources/curebench_valset_pharse1.jsonl', 'description': 'CureBench 2025 val questions'}


In [15]:
config

{'metadata': {'model_name': 'gpt-oss-20b',
  'model_type': 'Open weight model',
  'track': 'internal_reasoning',
  'base_model_type': 'Open weight model',
  'base_model_name': 'gpt-oss-20b',
  'dataset': 'cure_bench_phase1_val',
  'additional_info': 'Submission using configuration file'},
 'dataset': {'dataset_name': 'cure_bench_phase1_val',
  'dataset_path': 'resources/curebench_valset_pharse1.jsonl',
  'description': 'CureBench 2025 val questions'},
 'output_dir': 'competition_test_results'}

### load model with competition kit

In [16]:
from core.eval_framework import CompetitionKit, load_and_merge_config, create_metadata_parser

In [17]:
model_name = "gpt-oss-20b" # config['metadata']['model_name']
print(f"model name: {model_name}")
model_class = 'auto'

model name: gpt-oss-20b


In [18]:
kit = CompetitionKit(config_path=config_path)

In [None]:
print(f"Loading model: {model_name}")
kit.load_model(model_name, model_class)

In [20]:
kit.model.system_identity

'\nYou are an expert medical assistant specializing in answering questions.\n\n**Your communication MUST strictly adhere to the Harmony channels:**\n1.  **analysis:** Use this for all internal Chain-of-Thought (CoT), clinical reasoning, and factual processing. This content is for internal use only.\n2.  **final:** Use this channel for the final output intended for the user.\n\n**Output Rule is Conditional:**\n* **If the input is a Multiple-Choice Question (MCQ):** Your output MUST be a single, valid JSON object containing only the selected answer letter.\n    * **Format:** `{"answer": "<LETTER>"}` (e.g., `{"answer": "A"}`)\n* **If the input is an Open-Ended Question:** Your output MUST be a detailed, coherent narrative response.\n\n**Instruction:** Generate a complete response. The final output must use the \'final\' channel and adhere to the conditional format rule.'

In [21]:
kit.list_datasets()

Available Datasets:
--------------------------------------------------
  cure_bench_phase1_val: CureBench 2025 val questions


### run model with evaluate
- takes about 23 seconds to do one example on L4 gpu

In [22]:
subset_size = 3
print(f"Running evaluation on dataset: {dataset_name} (subset-size={subset_size})")
results = kit.evaluate(dataset_name, subset_size=subset_size)

Running evaluation on dataset: cure_bench_phase1_val (subset-size=3)
dataset_path: resources/curebench_valset_pharse1.jsonl
CureBenchDataset initialized with 459 examples


Evaluating:   0%|          | 0/3 [00:00<?, ?it/s]

stopping criteria: []


Evaluating:  33%|███▎      | 1/3 [00:31<01:02, 31.48s/it]

expected_answer: A
stopping criteria: []




stopping criteria: []


Evaluating:  67%|██████▋   | 2/3 [01:36<00:51, 51.10s/it]

expected_answer: B
stopping criteria: []




stopping criteria: []


Evaluating: 100%|██████████| 3/3 [02:19<00:00, 46.47s/it]

expected_answer: B





In [23]:
results

EvaluationResult(dataset_name='cure_bench_phase1_val', model_name='gpt-oss-20b', accuracy=1.0, correct_predictions=3, total_examples=3, predictions=[{'choice': 'A', 'open_ended_answer': '{"answer": "A"}'}, {'choice': 'B', 'open_ended_answer': 'analysisOpen ended; need explanation of steps: seek immediate care, call emergency services, administer epinephrine, etc. Also inform provider, stop infusion, monitor vitals, possible antihistamines/ steroids. Mention reporting adverse event.AssistantfinalWhen someone develops signs that point toward a serious hypersensitivity reaction—such as sudden swelling around the face or throat, difficulty breathing, wheezing, hives spreading over large areas of skin, chest tightness, rapid heartbeat, dizziness, fainting, or any other symptoms suggesting angioedema or anaphylaxis—they should act immediately:\n\n| Step | Action |\n|------|--------|\n| **Stop the drug** | If you’re still on the IV line, have your healthcare team halt the administration right

### run model manually

In [None]:
import pandas as pd
from typing import Dict, List
from torch.utils.data import DataLoader
from dataset_utils import build_dataset
from core.eval_framework import GPTOSS20BModel
from openai_harmony import load_harmony_encoding

In [None]:
def load_dataset_by_config(config_path):
  # load config file to get dataset info
  config = json.load(open(config_path, 'r')) if config_path else {}
  if 'dataset' in config:
      dataset_config = config['dataset']
      dataset_name = dataset_config.get('dataset_name', 'treatment')
  print(f"\nconfig file: {config_path}\ncontents:\n{dataset_config}")
  dataset_path = dataset_config.get("dataset_path")

  # build dataset
  dataset = build_dataset(
        dataset_config.get("dataset_path"),
    )
  dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
  dataset_list = []

  for batch in dataloader:
      question_type = batch[0][0]

      if question_type == "multi_choice":
          dataset_list.append({
              "question_type": batch[0][0],
              "id": batch[1][0],
              "question": batch[2][0],
              "answer": batch[3][0],
          })
      elif question_type == "open_ended_multi_choice":
          dataset_list.append({
              "question_type": batch[0][0],
              "id": batch[1][0],
              "question": batch[2][0],
              "answer": batch[3][0],
              "meta_question": batch[4][0],
          })
      elif question_type == "open_ended":
          dataset_list.append({
              "question_type": batch[0][0],
              "id": batch[1][0],
              "question": batch[2][0],
              "answer": batch[3][0],
          })
  return dataset_list

In [None]:
val_data_config_path= "metadata_config_val.json"
val_data_list = load_dataset_by_config(val_data_config_path)


config file: metadata_config_val.json
contents:
{'dataset_name': 'cure_bench_phase1_val', 'dataset_path': 'resources/curebench_valset_pharse1.jsonl', 'description': 'CureBench 2025 val questions'}
dataset_path: resources/curebench_valset_pharse1.jsonl
CureBenchDataset initialized with 459 examples


In [None]:
dataset = val_data_list[:10]

In [None]:
dataset

[{'question_type': 'multi_choice',
  'id': 'U9PHZ83RKYV8',
  'question': 'Which drug brand name is associated with the treatment of acne?\nA: Salicylic Acid\nB: Minoxidil\nC: Ketoconazole\nD: Fluocinonide',
  'answer': 'A'},
 {'question_type': 'open_ended_multi_choice',
  'id': 'vIGwm8qguXYi',
  'question': 'What should patients do if they experience severe allergic reactions during or after receiving fosaprepitant for injection?',
  'answer': 'B',
  'meta_question': "The following is a multiple choice question about medicine and the agent's open-ended answer to the question. Convert the agent's answer to the final answer format using the corresponding option label, e.g., 'A', 'B', 'C', 'D', 'E' or 'None'. \n\nQuestion: What should patients do if they experience severe allergic reactions during or after receiving fosaprepitant for injection?\nA: Wait for the symptoms to resolve on their own.\nB: Inform their healthcare provider immediately and seek emergency medical care.\nC: Stop chem

In [None]:
example = dataset[3]
print(example)

{'question_type': 'open_ended_multi_choice', 'id': 'WfWiWK0yULaX', 'question': 'Which of the following conditions is a contraindication for the use of Gadavist?', 'answer': 'B', 'meta_question': "The following is a multiple choice question about medicine and the agent's open-ended answer to the question. Convert the agent's answer to the final answer format using the corresponding option label, e.g., 'A', 'B', 'C', 'D', 'E' or 'None'. \n\nQuestion: Which of the following conditions is a contraindication for the use of Gadavist?\nA: Mild hypersensitivity reactions to Gadavist\nB: History of severe hypersensitivity reactions to Gadavist\nC: Renal impairment\nD: Liver dysfunction\n\n"}


In [None]:
question = example["question"]
question_type = example["question_type"]

In [None]:
# Format prompt
if question_type == "multi_choice":
    prompt = f"The following is a multiple choice question about medicine. Answer with a valid json containing the letter corresponding to the correct answer.\n\nQuestion: {question}\n\nAnswer:"
elif question_type == "open_ended_multi_choice" or question_type == "open_ended":
    prompt = f"The following is an open-ended question about medicine. Provide a comprehensive answer.\n\nQuestion: {question}\n\nAnswer:"

In [None]:
print(f"prompt: {prompt}")

prompt: The following is an open-ended question about medicine. Provide a comprehensive answer.

Question: Which of the following conditions is a contraindication for the use of Gadavist?

Answer:


In [None]:
system_identity = """
You are an expert medical assistant specializing in answering questions.

**Your communication MUST strictly adhere to the Harmony channels:**
1.  **analysis:** Use this for all internal Chain-of-Thought (CoT), clinical reasoning, and factual processing. This content is for internal use only.
2.  **final:** Use this channel for the final output intended for the user.

**Output Rule is Conditional:**
* **If the input is a Multiple-Choice Question (MCQ):** Your output MUST be a single, valid JSON object containing only the selected answer letter.
    * **Format:** `{"answer": "<LETTER>"}` (e.g., `{"answer": "A"}`)
* **If the input is an Open-Ended Question:** Your output MUST be a detailed, coherent narrative response.

**Instruction:** Generate a complete response. The final output must use the 'final' channel and adhere to the conditional format rule."""

In [None]:
# model = kit.model
model = GPTOSS20BModel("openai/gpt-oss-20b", reasoning_lvl="low", quantization="auto",
                       system_identity=system_identity)
model.load()

In [None]:
# Pass this list to the stop_tokens argument
stop_sequences = [
    # 1. Stops the loop of internal reasoning tags
    "assistantfinal",

    # 2. Stops the common repetitive noise (often necessary)
    "analysisdone."
]

In [None]:
response, reasoning_trace = model.inference(prompt,
    stop_strings=stop_sequences )

In [None]:
response

'**Gadavist (gadobutrol)** is a gadolinium‑based contrast agent commonly used for magnetic resonance imaging (MRI) of the brain and spine. The safety of Gadavist, like all gadolinium‑based contrast agents, is contingent upon adequate renal clearance. **Severe or advanced renal impairment is the principal contraindication** because it predisposes patients to the rare but serious condition known as **Nephrogenic Systemic Fibrosis (NSF)** and increases the risk of gadolinium deposition.\n\nKey points:\n\n| Condition | Why it’s a contraindication for Gadavist | Typical guidelines |\n|-----------|------------------------------------------|--------------------|\n| **Severe chronic kidney disease (CKD Stage 4–5)** – eGFR < 30\u202fmL/min/1.73\u202fm² or requiring dialysis | Reduced elimination of gadobutrol → higher risk of NSF and gadolinium‑related toxicity | Avoid unless essential; if needed, use the lowest effective dose and consider alternative imaging modalities |\n| **Acute kidney inju

In [None]:
reasoning_trace

[{'role': <Role.ASSISTANT: 'assistant'>,
  'name': None,
  'content': [{'type': 'text',
    'text': 'Answer is open-ended? They phrased "Which of the following conditions is a contraindication for the use of Gadavist?" But no options given. Likely expecting mention kidney disease (renal impairment). So provide narrative.'}],
  'channel': 'analysis'},
 {'role': <Role.ASSISTANT: 'assistant'>,
  'name': None,
  'content': [{'type': 'text',
    'text': '**Gadavist (gadobutrol)** is a gadolinium‑based contrast agent commonly used for magnetic resonance imaging (MRI) of the brain and spine. The safety of Gadavist, like all gadolinium‑based contrast agents, is contingent upon adequate renal clearance. **Severe or advanced renal impairment is the principal contraindication** because it predisposes patients to the rare but serious condition known as **Nephrogenic Systemic Fibrosis (NSF)** and increases the risk of gadolinium deposition.\n\nKey points:\n\n| Condition | Why it’s a contraindicatio

In [None]:
print(model.system_identity)


You are an expert medical assistant specializing in answering questions.

**Your communication MUST strictly adhere to the Harmony channels:**
1.  **analysis:** Use this for all internal Chain-of-Thought (CoT), clinical reasoning, and factual processing. This content is for internal use only.
2.  **final:** Use this channel for the final output intended for the user.

**Output Rule is Conditional:**
* **If the input is a Multiple-Choice Question (MCQ):** Your output MUST be a single, valid JSON object containing only the selected answer letter.
    * **Format:** `{"answer": "<LETTER>"}` (e.g., `{"answer": "A"}`)
* **If the input is an Open-Ended Question:** Your output MUST be a detailed, coherent narrative response.

**Instruction:** Generate a complete response. The final output must use the 'final' channel and adhere to the conditional format rule.


### write submission to file

In [None]:
# Generate submission with metadata from config/args
print("Generating submission with metadata...")
submission_path = kit.save_submission_with_metadata(
    results=[results],
    filename="submission.csv",
    config_path=config_path,
    args=""
)

print(f"\n✅ Evaluation completed successfully!")
print(f"📊 Accuracy: {results.accuracy:.2%} ({results.correct_predictions}/{results.total_examples})")
print(f"📄 Submission saved to: {submission_path}")

# # Show metadata summary if verbose
# final_metadata = kit.get_metadata(getattr(args, 'config', None), args)
# print("\n📋 Final metadata:")
# for key, value in final_metadata.items():
#     print(f"  {key}: {value}")


### stopping criteria

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

In [None]:
stopping_criteria = StoppingCriteriaList()

In [None]:
StoppingCriteria

In [None]:
stopping_criteria

[]

In [None]:
kit.model.enc

<openai_harmony.HarmonyEncoding at 0x78ef742ad280>