MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries

This is the code and dataset described in MedDec (Elgaar et al., Findings of ACL: ACL 2024).

MedDec is the first dataset specifically developed for extracting and classifying medical decisions from clinical notes. It includes 451 expert-annotated annotated discharge summaries from the MIMIC-III dataset, offering a valuable resource for understanding and facilitating clinical decision-making.

Dataset

Tip

The dataset has been released as of October 16, 2024.

The dataset is available through this link: https://physionet.org/content/meddec/1.0.0/. The user must sign a data usage agreement before accessing the dataset.

Phenotypes Annotations

The phenotype annotations used in the paper are available here: https://physionet.org/content/phenotype-annotations-mimic/1.20.03/.

Prerequisites

Requirements

Install the required packages using the following command:

pip install -r requirements.txt

Extract Notes Text from MIMIC-III

To extract the notes text from the MIMIC-III dataset, run the following command:

python extract_notes.py <data_dir> <notes_path (NOTEEVENTS.csv)>

data_dir: Directory where the dataset is stored.
notes_path: Path to the NOTEEVENTS.csv file. The texts will be written to the data_dir/raw_text directory.

Aggregate Phenotype Annotations

To preprocess the phenotype annotations, run the following command:

python preprocess_phenotypes.py <phenotypes_path (ACTdb102003.csv)>

phenotypes_path: Path to the phenotype annotations. The aggregated annotations will be written to phenos.csv in the same directory as the input file.

Running the Baselines

The code expects data, raw_text, and phenos.csv to be in the data_dir directory.

To train the baselines, run the following command:

python main.py --data_dir <data_dir> --label_encoding multiclass --model_name google/electra-base-discriminator --total_steps 5000 --lr 4e-5

To evaluate the baselines, run the following command:

python main.py --data_dir <data_dir> --eval_only --ckpt ./checkpoints/[datetime]-[model_name]

Arguments

data_dir: The directory where the dataset is stored. The default is ./data/.
pheno_path: The path to the phenotype annotations. The default is ./ACTdb102003.csv.
task: token is the token classification task (decision extraction), and seq is the sequence classification task (phenotype prediction). The default is token.
eval_only: Whether to evaluate the model only. --ckpt should be provided. The default is False.
label_encoding: multiclass, bo (beginning inside outside), or boe (beginning outside end). The default is multiclass.
truncate_train: Truncate the training sequences to a maximum length. Otherwise, the sequences are randomly chunked at training time. The default is False.
truncate_eval: Truncate the evaluation sequences to a maximum length. The default is False.
use_crf: Whether to use a CRF layer. The default is False.
model_name: The name of the model from Hugging Face Transformers
total_steps: The number of training steps
lr: The learning rate
batch_size: The batch size
seed: The random seed

Citation

If you use this dataset or code, please consider citing the following paper:

@inproceedings{elgaar-etal-2024-meddec,
    title = "{M}ed{D}ec: A Dataset for Extracting Medical Decisions from Discharge Summaries",
    author = "Elgaar, Mohamed and Cheng, Jiali and Vakil, Nidhi and Amiri, Hadi and Celi, Leo Anthony",
    editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.975",
    pages = "16442--16455",
}

Additionally, please cite the dataset as follows:

@misc{elgaar2024meddec,
    title = "MedDec: Medical Decisions for Discharge Summaries in the MIMIC-III Database",
    author = "Elgaar, Mohamed and Cheng, Jiali and Vakil, Nidhi and Amiri, Hadi and Celi, Leo Anthony",
    year = "2024",
    version = "1.0.0",
    publisher = "PhysioNet",
    url = "https://doi.org/10.13026/nqnw-7d62"
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
data.py		data.py
eval_gen.py		eval_gen.py
extract_texts.py		extract_texts.py
gen_span_detection.py		gen_span_detection.py
main.py		main.py
model.py		model.py
options.py		options.py
preprocess_phenos.py		preprocess_phenos.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries

Dataset

Phenotypes Annotations

Prerequisites

Requirements

Extract Notes Text from MIMIC-III

Aggregate Phenotype Annotations

Running the Baselines

Arguments

Citation

About

Releases

Packages

Languages

License

CLU-UML/MedDec

Folders and files

Latest commit

History

Repository files navigation

MedDec: A Dataset for Extracting Medical Decisions from Discharge Summaries

Dataset

Phenotypes Annotations

Prerequisites

Requirements

Extract Notes Text from MIMIC-III

Aggregate Phenotype Annotations

Running the Baselines

Arguments

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages