In [1]:
import json


In [19]:
JUDGE_PROMPT_TEMPLATE = """
You are an expert in biomedical education and assessment alignment analysis. Your task is to evaluate a specific Biomedical Visual Question item and assign it a Depth of Knowledge (DOK) level (1–4) based on Norman Webb’s framework.

### The DOK Framework for Biomedical VQA

**Level 1: Recall and Reproduction**
* **Core:** Rote memorization or simple visual recognition.
* **Criteria:** The answer is "seen" directly in the image or "retrieved" directly from memory without logical derivation.
* **Biomedical Examples:** Naming a bone, identifying a stain type (H&E), recognizing a cell structure.

**Level 2: Skills and Concepts**
* **Core:** Application of skills, observation, and organizing data.
* **Criteria:** Requires mental processing to observe features, describe patterns, or classify based on rules. More than just looking; it involves "processing" the image.
* **Biomedical Examples:** Describing tumor margins (smooth vs. irregular), counting cells, comparing tissue A vs. tissue B, reading a simple chart.

**Level 3: Strategic Thinking (Target for Diagnostic Reasoning)**
* **Core:** Reasoning, planning, and using evidence to explain "Why" or "How".
* **Criteria:** This is the standard level for **clinical diagnosis**. It involves drawing conclusions from observations, justifying a specific diagnosis over others (rule-in/rule-out), or explaining a physiological mechanism.
* **Note:** Multiple steps of reasoning do NOT automatically make it Level 4. If the logic is linear (Observation -> Evidence -> Diagnosis), it is Level 3.
* **Biomedical Examples:** Establishing a diagnosis based on IHC markers (e.g., Desmin+/S100-), inferring disease stage from pathology, explaining the cause of an anomaly.

**Level 4: Extended Thinking (Rare for Single VQA Items)**
* **Core:** Design, Synthesize (from conflict), Create.
* **Criteria:** Requires **generating** a new approach or solving a problem with **conflicting** multi-source data.
* **The "Design" Trap:** If the text *describes* an experiment and asks for a conclusion, that is L3. L4 requires the *student* to design the experiment from scratch.
* **The "Synthesis" Trap:** If the text and image agree with each other, there is no "synthesis of alternative perspectives." That is simply "corroboration" (Level 3).

### Input Data
**Question:** {question}
**Answer:** {answer}

### Instructions
1. Analyze the cognitive demand required to link the visual evidence to the correct answer.
2. Determine which DOK level best fits the task.
  - You must explain in your reasoning: Why is a lower level inappropriate?
3. Output the result in the following JSON format.

### Important Notes
It is easy to overestimate DOK levels. Please adhere to these guidelines:
- A difficult question (e.g., memorizing a complex diagram) can still be **DOK 1**.
- A multi-step calculation using a known formula is **DOK 1** or **DOK 2**, never DOK 3.
- **DOK 3** requires justification, distinct decision-making, or non-routine problem solving where there is no single pre-taught path to the answer.
- DOK 4 rarely appears in a single exam question. Do not consider it here.
- First assume the lowest possible DOK level, then give a higher level only if you have a strong justification.

### Expected Output Format (JSON Only)
{{
  "reasoning": "<concise_explanation.>",
  "dok_level": <integer_1_to_4>,
  "category_name": "<string_level_name>",
  "confidence_score": <float_0.0_to_1.0>
}}
"""

In [3]:
# JUDGE_PROMPT_TEMPLATE = """
# You are an expert Educational Assessment Specialist with deep expertise in Norman L. Webb's Depth of Knowledge (DOK) theory. Your task is to evaluate a specific test question based on images and determine its DOK level.
# 
# **CRITICAL INSTRUCTION:**
# AI models tend to **overestimate** DOK levels. You must rigorously distinguish between **Difficulty** (how hard it is) and **Complexity** (how much deep thinking is required).
# - A difficult question (e.g., memorizing a complex diagram) can still be **DOK 1**.
# - A multi-step calculation using a known formula is **DOK 1** or **DOK 2**, never DOK 3.
# - **DOK 3** requires justification, distinct decision-making, or non-routine problem solving where there is no single pre-taught path to the answer.
# - DOK 4 rarely appears in a single exam question. Do not consider it here.
# 
# # Analysis Framework (Webb's DOK for Science)
# **Level 1: Recall and Reproduction**
# - **Criteria:** Recalling facts, terms, or performing a simple, one-step procedure. The answer is automatic (rote response) and does not need to be "figured out".
# - **Image Context:** Identifying a part, reading a specific number from a simple instrument, or recognizing a standard representation.
# - **Keywords:** Identify, recall, recognize, measure, calculate (simple formula).
# 
# **Level 2: Skills and Concepts**
# - **Criteria:** Requires mental processing beyond habit. Involves decision-making, comparing, organizing, or interpreting simple data.
# - **Image Context:** Interpreting information from a simple graph, classifying objects in the image, or explaining a relationship shown.
# - **Keywords:** Classify, organize, estimate, compare data, describe/explain (simple relationships).
# 
# **Level 3: Strategic Thinking**
# - **Criteria:** Requires reasoning, planning, using evidence, and explaining the thought process. The task implies complex or abstract cognitive demands.
# - **Image Context:** Drawing conclusions from observations, interpreting complex graphs (requiring aggregation of features), or citing evidence from the image to build an argument.
# - **Keywords:** Draw conclusions, cite evidence, justify, explain thinking, develop a logical argument.
# 
# ### OUTPUT FORMAT
# You must respond strictly in JSON format with the following structure:
# 
# ```json
# {{
#   "dok_level": 1 | 2 | 3,
#   "category": "String (e.g., Recall, Skill/Concept, Strategic Thinking)",
#   "analysis": {{
#     "verb_analysis": "Analyze the cognitive demand of the main verb used (e.g., Identify vs. Analyze).",
#     "connections_required": "Estimate the number of conceptual connections the student must make (Low/Medium/High).",
#     "reasoning_level": "Describe the reasoning required (e.g., Simple observation vs. Abstract inference)."
#   }},
#   "justification": "A brief explanation citing why it fits this specific level based on the criteria.",
#   "improvement_suggestion": "A specific suggestion on how to rewrite this question to increase its DOK level."
# }}
# 
# Question:
# {question}
# 
# Your Answer:
# """

In [4]:
def pack_content(prompt, images):
    image_list = images or [] 
    content = [
        {"type": "image_url", "image_url": img_url}
        for img_url in image_list
    ] + [
        {"type": "text", "text": prompt}
    ]
    return content

def openai_pack_content(prompt, images):
    image_list = images or []
    content = [
        {"type": "image_url", "image_url": {
            "url": img_url,
            "detail": "auto"
        }}
        for img_url in image_list
    ] + [
        {"type": "text", "text": prompt}
    ]
    return content

In [None]:
from openai import OpenAI
import asyncio
import time
import os
import json
import pickle
import base64
from io import BytesIO
from PIL import Image
import numpy as np
from openai import AsyncOpenAI, APIConnectionError, InternalServerError
from asyncio import as_completed
from tqdm import tqdm

import logging
from datetime import datetime, timezone, timedelta

dashscope_api_key = os.getenv("DASHSCOPE_API_KEY") or "1"
vl_model = "qwen3-vl-plus"
text_model = "qwen-plus"

local_vl_api_key = "xxx"
local_vl_model = "qwen3_vl_235b_instruct"
local_text_api_key = "xxx"
local_text_model = "qwen3_235b_instruct"

dashscope_client = AsyncOpenAI(api_key=dashscope_api_key,
                               base_url="xxx",
                               timeout=120.0)

local_vl_client = AsyncOpenAI(api_key=local_vl_api_key,
                              base_url="xxx",
                              timeout=120.0)

local_text_client = AsyncOpenAI(api_key=local_text_api_key,
                                base_url="xxx",
                                timeout=120.0)

dashscope_sync_client = OpenAI(api_key=dashscope_api_key,
                               base_url="xxx",
                               timeout=120.0)

local_sync_vl_client = OpenAI(api_key=local_vl_api_key,
                              base_url="xxx",
                              timeout=120.0)

local_sync_text_client = OpenAI(api_key=local_text_api_key,
                                base_url="xxx",
                                timeout=120.0)

google_api_key = "xxx"
google_client_sync = OpenAI(api_key=google_api_key,
                            base_url="xxx",
                            timeout=120.0)
google_client_async = AsyncOpenAI(api_key=google_api_key,
                                  base_url="xxx",
                                  timeout=120.0)
google_model = "gemini-3-flash-preview"


async def get_response_async(prev_messages,
                             next_content,
                             model,
                             client,
                             tools=None,
                             max_retries=3,
                             MAX_TOKENS_LIMIT=32768):

    if isinstance(next_content, str):
        user_content = next_content
    else:
        user_content = next_content

    messages = prev_messages + [{"role": "user", "content": user_content}]

    for attempt in range(max_retries):
        try:
            reasoning_content = ""
            answer_content = ""
            tool_info = []
            is_answering = False

            if tools is not None:
                response = await client.chat.completions.create(
                    model=model,
                    messages=messages,
                    tools=tools,
                    parallel_tool_calls=True,
                    stream=True,
                    max_tokens=MAX_TOKENS_LIMIT)
            else:
                response = await client.chat.completions.create(
                    model=model,
                    messages=messages,
                    stream=True,
                    max_tokens=MAX_TOKENS_LIMIT)

            async for chunk in response:
                if chunk.choices:
                    delta = chunk.choices[0].delta
                    if hasattr(delta, 'reasoning_content'
                               ) and delta.reasoning_content != None:
                        reasoning_content += delta.reasoning_content
                    else:
                        if not is_answering:
                            is_answering = True
                        if delta.content is not None:
                            answer_content += delta.content
                        if delta.tool_calls is not None:
                            for tool_call in delta.tool_calls:
                                index = tool_call.index
                                while len(tool_info) <= index:
                                    tool_info.append({})
                                if tool_call.id:
                                    tool_info[
                                        index]['id'] = tool_info[index].get(
                                            'id', '') + tool_call.id
                                if tool_call.function and tool_call.function.name:
                                    tool_info[index][
                                        'name'] = tool_info[index].get(
                                            'name',
                                            '') + tool_call.function.name
                                if tool_call.function and tool_call.function.arguments:
                                    tool_info[index][
                                        'arguments'] = tool_info[index].get(
                                            'arguments',
                                            '') + tool_call.function.arguments
                                if tool_call.type:
                                    tool_info[index]['type'] = tool_call.type

            if not reasoning_content:
                if answer_content.startswith("<think>"):
                    end_think_idx = answer_content.find("</think>")
                    if end_think_idx != -1:
                        reasoning_content = answer_content[len("<think>"
                                                               ):end_think_idx]
                        answer_content = answer_content[end_think_idx +
                                                        len("</think>"):]

            new_message = {
                "role": "assistant",
                "content": answer_content,
            }
            if len(tool_info) > 0:
                tool_calls = [{
                    "id": tool_call["id"],
                    "function": {
                        "name": tool_call["name"],
                        "arguments": tool_call["arguments"]
                    },
                    "type": tool_call["type"],
                    "index": i
                } for i, tool_call in enumerate(tool_info)]
                new_message["tool_calls"] = tool_calls
            messages.append(new_message)

            return {
                "content": answer_content,
                "reasoning_content": reasoning_content,
                "usage": None,
                "prev_messages": messages,
                "tool_info": tool_info
            }

        except (APIConnectionError, InternalServerError) as e:
            print(
                f"--- [Retryable Error] (Attempt {attempt + 1}/{max_retries}): {e}"
            )
            if attempt == max_retries - 1: raise e
            await asyncio.sleep(5)

        except Exception as e:
            error_str = str(e).lower()
            if "incomplete chunked read" in error_str or "peer closed connection" in error_str or "connection closed" in error_str:
                print(
                    f"--- [Network/Server Cutoff] (Attempt {attempt + 1}/{max_retries}): {e}"
                )
                if attempt == max_retries - 1:
                    print("--- Max retries reached for cutoff error.")
                    raise e
                print(
                    "--- Server likely overloaded. Sleeping for 10 seconds...")
                await asyncio.sleep(10)
            else:
                print(f"--- [Fatal Error]: {e}")
                raise e


def get_response(prev_messages,
                 next_content,
                 model,
                 client,
                 tools=None,
                 max_retries=3,
                 MAX_TOKENS_LIMIT=32768):

    if isinstance(next_content, str):
        user_content = next_content
    else:
        user_content = next_content

    messages = prev_messages + [{"role": "user", "content": user_content}]

    for attempt in range(max_retries):
        try:
            reasoning_content = ""
            answer_content = ""
            tool_info = []
            is_answering = False

            if tools is not None:
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                    tools=tools,
                    parallel_tool_calls=True,
                    stream=True,
                    max_tokens=MAX_TOKENS_LIMIT)
            else:
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                    stream=True,
                    max_tokens=MAX_TOKENS_LIMIT)

            for chunk in response:
                if chunk.choices:
                    delta = chunk.choices[0].delta
                    if hasattr(delta, 'reasoning_content'
                               ) and delta.reasoning_content != None:
                        reasoning_content += delta.reasoning_content
                    else:
                        if not is_answering:
                            is_answering = True
                        if delta.content is not None:
                            answer_content += delta.content
                        if delta.tool_calls is not None:
                            for tool_call in delta.tool_calls:
                                index = tool_call.index
                                while len(tool_info) <= index:
                                    tool_info.append({})
                                if tool_call.id:
                                    tool_info[
                                        index]['id'] = tool_info[index].get(
                                            'id', '') + tool_call.id
                                if tool_call.function and tool_call.function.name:
                                    tool_info[index][
                                        'name'] = tool_info[index].get(
                                            'name',
                                            '') + tool_call.function.name
                                if tool_call.function and tool_call.function.arguments:
                                    tool_info[index][
                                        'arguments'] = tool_info[index].get(
                                            'arguments',
                                            '') + tool_call.function.arguments
                                if tool_call.type:
                                    tool_info[index]['type'] = tool_call.type

            if not reasoning_content:
                if answer_content.startswith("<think>"):
                    end_think_idx = answer_content.find("</think>")
                    if end_think_idx != -1:
                        reasoning_content = answer_content[len("<think>"
                                                               ):end_think_idx]
                        answer_content = answer_content[end_think_idx +
                                                        len("</think>"):]

            new_message = {
                "role": "assistant",
                "content": answer_content,
            }
            if len(tool_info) > 0:
                tool_calls = [{
                    "id": tool_call["id"],
                    "function": {
                        "name": tool_call["name"],
                        "arguments": tool_call["arguments"]
                    },
                    "type": tool_call["type"],
                    "index": i
                } for i, tool_call in enumerate(tool_info)]
                new_message["tool_calls"] = tool_calls
            messages.append(new_message)
            return {
                "content": answer_content,
                "reasoning_content": reasoning_content,
                "usage": None,
                "prev_messages": messages,
                "tool_info": tool_info
            }
        except (APIConnectionError, InternalServerError) as e:
            print(
                f"--- [Retryable Error] (Attempt {attempt + 1}/{max_retries}): {e}"
            )
            if attempt == max_retries - 1: raise e
            time.sleep(5)
        except Exception as e:
            error_str = str(e).lower()
            if "incomplete chunked read" in error_str or "peer closed connection" in error_str or "connection closed" in error_str:
                print(
                    f"--- [Network/Server Cutoff] (Attempt {attempt + 1}/{max_retries}): {e}"
                )
                if attempt == max_retries - 1:
                    print("--- Max retries reached for cutoff error.")
                    raise e
                print(
                    "--- Server likely overloaded. Sleeping for 10 seconds...")
                time.sleep(10)
            else:
                print(f"--- [Fatal Error]: {e}")
                raise e

In [6]:
def process_output(response_content):
    try:
        if "```json" in response_content:
            start_idx = response_content.index("```json") + len("```json")
            res_content = response_content[start_idx:].lstrip()
            # print(f"res_content: {res_content}")
            if "```" in res_content:
                end_idx = res_content.index("```")
                json_str = res_content[:end_idx].strip()
            else:
                if "}" not in res_content:
                    res_content += "}"
                end_idx = res_content.rindex("}")
                json_str = res_content[:end_idx + 1].strip()
            # print(f"Extracted JSON string: {json_str}")
            return json.loads(json_str)
        elif "{" in response_content:
                start_idx = response_content.index("{")
                if "}" not in response_content:
                    response_content += "}"
                end_idx = response_content.rindex("}")
                json_str = response_content[start_idx:end_idx + 1].strip()
                # print(f"Extracted JSON string without code block: {json_str}")
                return json.loads(json_str)
        else:
            return json.loads(response_content)
    except json.JSONDecodeError as e:
        print(f"JSON Decode Error: {e}")
        with open("debug_response.txt", "a") as f:
            f.write("-----------------------\n")
            f.write(f"Failed to decode JSON from response:\n{response_content}\n")
        return None
    except Exception as e:
        print(f"Unexpected Error: {e}")
        with open("debug_response.txt", "a") as f:
            f.write("-----------------------\n")
            f.write(f"Unexpected error processing response:\n{response_content}\n")
        return None

In [7]:
from datasets import load_dataset
import os

get_question_fn = {
    "ours": lambda entry: entry["basic_qa"]["question"],
    "microvqa": lambda entry: entry["question"] + "\nOptions:\n" + "\n".join(
        [f"{chr(65+i)}. {opt}" for i, opt in enumerate(entry['choices'])]
    )
}

get_answer_fn = {
    "ours": lambda entry: entry["basic_qa"]["answer"],
    "microvqa": lambda entry: entry['correct_answer']
}

def get_question(dataset_name, entry):
    if dataset_name in get_question_fn:
        return get_question_fn[dataset_name](entry)
    else:
        return get_question_fn["ours"](entry)
    
def get_answer(dataset_name, entry):
    if dataset_name in get_answer_fn:
        return get_answer_fn[dataset_name](entry)
    else:
        return get_answer_fn["ours"](entry)

def load_data_for_ours(file_prefix):
    with open(file_prefix + ".json", "r") as f:
        dataset_data = json.load(f)
    return dataset_data

def default_load_data_for_dataset(dataset_name, split="test", sample=100):
    local_dataset_root = "./other_datasets"
    dataset_path = os.path.join(local_dataset_root, dataset_name.split("/")[-1])
    dataset_data = load_dataset(dataset_path, split=split)
    if sample is not None and sample > 0:
        dataset_data = dataset_data.select(range(sample))
    return dataset_data

load_data_fn = {
    "ours": load_data_for_ours,
    "microvqa": lambda dataset_name: default_load_data_for_dataset(dataset_name, split="test"),
}

def load_data(dataset_name):
    if dataset_name in load_data_fn:
        return load_data_fn[dataset_name](dataset_name)
    else:
        return load_data_fn["ours"](dataset_name)

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# MedXQA
get_question_fn["MedXQA"] = lambda entry: entry["question"]
get_answer_fn["MedXQA"] = lambda entry: entry["answer"]
load_data_fn["MedXQA"] = lambda dataset_name: default_load_data_for_dataset(dataset_name, split="train")

In [9]:
mmmmm = load_data("MedXQA")
print(get_question("MedXQA", mmmmm[0]))
print(get_answer("MedXQA", mmmmm[0]))

As a medical student studying prion diseases, I want to understand: What mutations in the PRNP gene are associated with exceptionally slow progression of prion diseases, and what is their impact on the pathogenesis of these diseases?
Prion diseases are neurodegenerative disorders caused by misfolded prion proteins, resulting in progressive neuronal damage. Mutations in the PRNP gene (which encodes the prion protein, PrP) play a critical role in the pathogenesis of these diseases, including influencing their progression rates.

Mutations associated with exceptionally slow progression of prion diseases include:  
1. Large expansions in the octapeptide repeat region: Normally, the PRNP gene contains five octapeptide repeats. Expansions to over 10 repeats are associated with prion disease, but larger expansions (e.g., more than 12 repeats) have been observed to slow disease progression. This may result from altered protein-protein interactions that delay the amplification of misfolded prio

In [10]:
# path-vqa
get_question_fn["path-vqa"] = lambda entry: entry["question"]
get_answer_fn["path-vqa"] = lambda entry: entry["answer"]
load_data_fn["path-vqa"] = lambda dataset_name: default_load_data_for_dataset(dataset_name, split="test")

In [11]:
mmmmm = load_data("path-vqa")
print(get_question("path-vqa", mmmmm[0]))
print(get_answer("path-vqa", mmmmm[0]))

what are positively charged, thus allowing the compaction of the negatively charged dna?
the histone subunits


In [12]:
mmmmm = load_data("final_data")
print(get_question("final_data", mmmmm[0]))
print(get_answer("final_data", mmmmm[0]))

Klebe et al. [90] investigated the nuclear expression of the TTF-1 SP141 antibody clone in sarcomatoid mesothelioma and compared it with the TTF-1 8G7G3/1 clone to assess specificity in immunohistochemical diagnosis. Immunohistochemical analysis was performed on tissue samples from 19 sarcomatoid mesotheliomas using the TTF-1 SP141 antibody clone. [Image 1] shows the histological features of these tumors, including nuclear morphology and chromogenic signal. In a comparative experiment, immunohistochemical staining using the TTF-1 8G7G3/1 antibody clone was applied to the same or comparable sarcomatoid mesothelioma samples, and no nuclear immunoreactivity was detected. What can be concluded from these findings?
The experimental data demonstrate that the TTF-1 SP141 antibody clone produces nuclear immunoreactivity in 8 out of 19 (42%) cases of sarcomatoid mesothelioma, as evidenced by distinct brown chromogenic labeling localized to the nuclei of tumor cells in [Image 1]. This staining p

In [20]:
from tqdm.asyncio import tqdm_asyncio

def get_judgement_file_name(file_prefix):
    return f"{file_prefix}_dok_results.json"

def get_cache_dir(file_prefix):
    return f"{file_prefix}_dok_cache"
    
def get_cache_file_name(idx):
    return f"cache_dok_{idx}.json"

async def get_single_result(prompt, question, answer, idx, sem, cache_dir=None):
    try:
        if cache_dir is not None:
            cache_file = os.path.join(cache_dir, get_cache_file_name(idx))
            if os.path.exists(cache_file):
                with open(cache_file, "r") as f:
                    cached_result = json.load(f)
                if cached_result is not None:
                    return cached_result
    except Exception as e:
        print(f"Error loading cache for idx {idx}: {e}")

    async with sem:
        max_retries = 10
        while max_retries > 0:
            try:
                response = await get_response_async(
                    prev_messages=[],
                    next_content=prompt,
                    model=local_text_model,
                    client=local_text_client,
                )
                j=process_output(response["content"])
                if j is not None:
                    result = {
                        "question_index": idx,
                        "question": question,
                        "answer": answer,
                        "depth_evaluation": j
                    }
                    if cache_dir is not None:
                        os.makedirs(cache_dir, exist_ok=True)
                        cache_file = os.path.join(cache_dir, get_cache_file_name(idx))
                        with open(cache_file, "w") as f:
                            json.dump(result, f, indent=2, ensure_ascii=False)
                    return result
            except Exception as e:
                print(f"Error processing sample idx {idx}: {e}")
                max_retries -= 1
                await asyncio.sleep(1)
        return {
            "question_index": idx,
            "question": question,
            "answer": answer,
            "depth_evaluation": None
        }

async def run_judging_async(dataset_name, sem, cache_dir=None):
    new_data = load_data(dataset_name)
    # print(new_data)
    if cache_dir is not None:
        cache_dir = os.path.join(cache_dir, get_cache_dir(dataset_name))
    tasks = []
    for idx, entry in enumerate(new_data):
        question = get_question(dataset_name, entry)
        answer = get_answer(dataset_name, entry)
        prompt = JUDGE_PROMPT_TEMPLATE.format(
            question=question,
            answer=answer
        )
            
        tasks.append(get_single_result(prompt, question, answer, idx, sem, cache_dir))
    judge_results = await tqdm_asyncio.gather(*tasks)

    judge_file = get_judgement_file_name(dataset_name)
    with open(judge_file, "w") as f:
        json.dump(judge_results, f, indent=2, ensure_ascii=False)
    return judge_results

In [21]:
# dataset_name = "step7_logic_based_qa_output_processed_qc3_passed_2"
dataset_name = "path-vqa"
sem = asyncio.Semaphore(64)

In [22]:
await run_judging_async(dataset_name, sem, cache_dir="./cache")

100%|██████████| 100/100 [00:20<00:00,  4.91it/s]


[{'question_index': 0,
  'question': 'what are positively charged, thus allowing the compaction of the negatively charged dna?',
  'answer': 'the histone subunits',
  'depth_evaluation': {'reasoning': 'The question asks for the identification of a biological component (histone subunits) that interacts with DNA based on charge properties. The answer relies on recalling a fundamental biochemical principle: histones are positively charged proteins that bind to negatively charged DNA to enable chromatin compaction. This does not require any analysis of visual data, interpretation of patterns, or reasoning about mechanisms beyond rote knowledge. The cognitive demand is limited to retrieval of a fact from memory. No visual processing or inference is needed, even if an image were present. A lower level (DOK 1) is appropriate because the task does not involve organizing information, making inferences, or justifying a conclusion. There is no requirement for strategic thinking or synthesis, ruli

In [23]:
target_dataset_names = [
    "final_data",
    "microvqa",
    "MedXQA",
    "path-vqa"
]
sem = asyncio.Semaphore(16)
tasks = []
for dataset_name in target_dataset_names:
    async with sem:
        tasks.append(run_judging_async(dataset_name, sem, cache_dir="./cache"))
await tqdm_asyncio.gather(*tasks)

  0%|          | 0/4 [00:00<?, ?it/s]
[A

[A[A


[A[A[A


100%|██████████| 100/100 [00:01<00:00, 50.30it/s]
100%|██████████| 3614/3614 [22:22<00:00,  2.69it/s]
 50%|█████     | 2/4 [22:24<26:17, 788.94s/it]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

[A[A
[A
100%|██████████| 100/100 [22:56<00:00, 13.77s/it]
 75%|███████▌  | 3/4 [22:58<07:24, 444.44s/it]

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A



[[{'question_index': 0,
   'question': 'Klebe et al. [90] investigated the nuclear expression of the TTF-1 SP141 antibody clone in sarcomatoid mesothelioma and compared it with the TTF-1 8G7G3/1 clone to assess specificity in immunohistochemical diagnosis. Immunohistochemical analysis was performed on tissue samples from 19 sarcomatoid mesotheliomas using the TTF-1 SP141 antibody clone. [Image 1] shows the histological features of these tumors, including nuclear morphology and chromogenic signal. In a comparative experiment, immunohistochemical staining using the TTF-1 8G7G3/1 antibody clone was applied to the same or comparable sarcomatoid mesothelioma samples, and no nuclear immunoreactivity was detected. What can be concluded from these findings?',
   'answer': 'The experimental data demonstrate that the TTF-1 SP141 antibody clone produces nuclear immunoreactivity in 8 out of 19 (42%) cases of sarcomatoid mesothelioma, as evidenced by distinct brown chromogenic labeling localized to

In [None]:
dt = load_data(dataset_name)

In [None]:
for idx, entry in enumerate(dt):
    question = get_question(dataset_name, entry)
    answer = get_answer(dataset_name, entry)
    print(f"Index: {idx}")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print("-----")

Index: 0
Question: A cryo-electron tomography (cryo-ET) image showcases primary eukaryotic neurons cultured on a specialized substrate that guides cytoskeletal organization. In the tomographic slice, mitochondria appear in deep blue, and calcium granules are depicted in bright yellow. What potential impact could the accumulation of calcium granules within the mitochondria have on their function?
Options:
A. Regulation of mitochondrial metabolism to meet high ATP demand
B. Enhancement of oxidative phosphorylation efficiency under varying energy demands
C. Facilitation of mitochondrial biogenesis to support increased cell growth
D. Influence on mitochondrial calcium buffering capacity to maintain cellular homeostasis
E. Modification of mitochondrial lipid composition affecting membrane fluidity
Answer: Regulation of mitochondrial metabolism to meet high ATP demand
-----
Index: 1
Question: Cryo-electron tomography of primary Drosophila melanogaster neurons reveals mitochondria in close as