## Resource 

**Paper BRIDGE**: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text (https://arxiv.org/abs/2504.19467)

**Github repo**: BRIDGE (https://github.com/YLab-Open/BRIDGE)

**Dataset**: BRIDGE-Open (https://huggingface.co/datasets/YLab-Open/BRIDGE-Open)

**Leaderboards**: BRIDGE-Medical-Leaderboard (https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard)

## Download and organize files

(Optional) A. Download the dataset from Hugging Face via python script

See dataset_download.py for more details.

(Requirement: pip install huggingface_hub)

----------------------

(Optional) B. Manually Download the dataset from Hugging Face

1. Web: https://huggingface.co/datasets/YLab-Open/BRIDGE-Open/tree/main

2. Manually download **"Dataset.zip"** and **"Examples.zip"** files

3. Extract them to the **"dataset_raw"** directory

--------------

Finally, the directory structure should look like this:
```
dataset_raw/
├── task_1.SFT.json
├── task_1.SFT.json
└── example/
    ├── task_1.example.json
    ├── task_1.example.json
```

## Import Libraries

In [None]:
import os
import json
from typing import Any
from pathlib import Path

## Load data

In [None]:
def load_tasks(root: str | Path = "dataset_raw") -> dict[str, dict[str, Any]]:
    root = Path(root)
    
    def read_json(path: Path) -> Any:
        with path.open("r", encoding="utf-8") as f:
            return json.load(f)

    data_files    = {p.name.split(".SFT", 1)[0]: p for p in root.glob("*.SFT.json")}
    example_files = {p.name.split(".example", 1)[0]: p for p in (root / "example").glob("*.example.json")}

    assert set(data_files.keys()) == set(example_files.keys()), \
        "Data and example files must match in task names."

    dict_task_data: dict[str, dict[str, Any]] = {}

    for task in data_files:
        dict_task_data[task] = {
            "example": read_json(example_files[task]),
            "test":    read_json(data_files[task]),
        }
        dict_task_data[task]["corpus"] = [ data['input'] for data in dict_task_data[task]["example"] ] + [ data['input'] for data in dict_task_data[task]["test"] ]

    return dict_task_data

In [None]:
# Load all tasks from the dataset folder
path_dir_data = "dataset_raw"
dict_task_data = load_tasks(root=path_dir_data)

## Explore data for a specific task

In [None]:
# Example usage: get one specific task
task_name = "BrainMRI-AIS"
task_data = dict_task_data[task_name]
print(f"Task: {task_name}")
print(f"Number of 'examples': {len(task_data['example'])}")
print(f"Number of 'test' cases: {len(task_data['test'])}")
print(f"Number of 'corpus' entries: {len(task_data['corpus'])}")

In [None]:
# Review the corpus
idx = 1
example_one = task_data["corpus"][idx]
print(f"Corpus:\n{example_one}")

In [None]:
# A whole data from test
idx = 1
test_one = task_data["test"][idx]
for key, value in test_one.items():
    print(f"{key}: {value.strip() if isinstance(value, str) else value}")
    print()

In [None]:
# A whole data from example
idx = 1
example_one = task_data["example"][idx]
for key, value in example_one.items():
    print(f"{key}: {value.strip() if isinstance(value, str) else value}")
    print()

## End

In [None]:
print('Done.')