SYMPTOMIFY

Dataset and implementation of the paper SYMPTOMIFY: Transforming Symptom Annotations with Language Model Knowledge Harvesting (EMNLP Findings 2023). SYMPTOMIFY is a large-scale dataset of over 800k annotated reports, reflecting reactions to medications and vaccines. It includes MedDRA symptoms, annotation explanations, and background knowledge about symptoms, designed to facilitate development of systems that can aid human annotators operating in the critically important public health domain.

SYMPTOMIFY is built based on the Vaccine Adverse Event Reporting System (VAERS) database managed by the U.S. Centers for Disease Control and Prevention (CDC) the U.S. Food and Drug Administration (FDA). The dataset spans three years (2019-2021) and includes 839,215 reports, incorporating symptom texts, relevant MedDRA terms, and associated metadata like age, sex, and vaccine type. The original VAERS data can be found here.

Download

Download: https://drive.google.com/drive/folders/1w4zNKoYRWAnQGZv3hVAAyhttKk1RLKvQ?usp=share_link

Data Format

Each line of .json files is one report of VAERS.

vid: VAERS database entry id; you can map each example to VAERS's database using this id
symptom_text: Symptom text (VAERSreports)
symptom_ids: Symptom ids for classification models
symptoms:["injectionsiteerythema", "injectionsitepain", "injectionsiteswelling", "tenderness"]
age: 6 {0: 0-9, 1: 10-19, ..., 10} Age can be
sex: 1 {0: Male, 1: Female, 2: Unknonw}
vax_type: Vaccine type(s) that the patient received in the report
vax_type_ids: Vaccine type id(s) for classification models
vax_name: Vaccine name(s) that the patient received in the report
vax_name_ids: Vaccine name id(s) for classification models
symptoms_original: Symptoms names from MedDRA; same as 'symptoms' but with original names
n_symps: 4 the number of symptoms annotated in this example

Symptom entity list and definitions

data/symptoms.tsv

symptomid: Symptom id
symptom_norm: Normalized name (lower case, without spaces)
symptom_ori: Symptom name
count: Frequency of the symptom occurring in the dataset
cui: mapped CUI (from UMLS)
cui_name: Symptom name in UMLS
cui_source: How to map this symptom and the CUI
metamap_score: Metamap score when the cui_source is mapped via Metamap
cui_def: Definition of symptom from the UMLS dictionary
gpt3_def: Definition of symptom from the GPT3

Annotation Explanations from ChatGPT

data/explanations.json (jsonl format)

testid (not related to the test set nor any other sets)
symptom: Symptom name
explanation: Chain-of-thought explanation generated by ChatGPT (GPT-3.5)

Setup

To install requirements

pip install -r requirements.txt

All experiments are conducted on a NVIDIA GTX 1080 (12GB) with CUDA 10.2 and torch 1.12.

Pretrain-Finetune Baselines

Classification Approach

Train the ClinicalBERT+CNN model:

python run_classifier.py \
    --model_name_or_path emilyalsentzer/Bio_ClinicalBERT \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 8 \
    --learning_rate 3e-5 \
    --num_train_epochs 5 \
    --output_dir {out_dir} \
    --train_file {data_dir}/train.json \
    --validation_file {data_dir}/dev_1k.json \
    --test_file {data_dir}/test.json \
    --symptoms_file data/symptoms.tsv \
    --do_train

Generative Approach

Train the generative entity retrieval model with BART (Lewis et al., 2020)

python run_generative_model.py \
    --model_name_or_path facebook/bart-base \
    --max_source_length 256 \
    --max_target_length 128 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --learning_rate 2e-5 \
    --num_train_epochs 5 \
    --output_dir {output_dir} \
    --text_column symptom_text \
    --summary_column symptoms \
    --train_file data/train.json \
    --validation_file data/dev.json \
    --do_train

Test the model:

python run_generative_model.py \
    --model_name_or_path {test_model_name_or_path} \
    --max_source_length 256 \
    --max_target_length 128 \
    --per_device_eval_batch_size 16 \
    --text_column symptom_text \
    --summary_column symptoms \
    --test_file data/test.json \
    --do_predict

for the multi-GPU setting,

python -m torch.distributed.launch \
    --nproc_per_node=2 run_generative_model.py \
    --model_name_or_path facebook/bart-base \
    --max_source_length 256 \
    --max_target_length 128 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --learning_rate 2e-5 \
    --num_train_epochs 5 \
    --output_dir {output_dir} \
    --text_column symptom_text \
    --summary_column symptoms \
    --train_file data/train.json \
    --validation_file data/dev.json \
    --do_train

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
README.md		README.md
metrics_cls.py		metrics_cls.py
metrics_vsed.py		metrics_vsed.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_generative_model.py		run_generative_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SYMPTOMIFY

Download

Data Format

Symptom entity list and definitions

Annotation Explanations from ChatGPT

Setup

Pretrain-Finetune Baselines

About

Releases

Packages

Languages

bosung/SYMPTOMIFY

Folders and files

Latest commit

History

Repository files navigation

SYMPTOMIFY

Download

Data Format

Symptom entity list and definitions

Annotation Explanations from ChatGPT

Setup

Pretrain-Finetune Baselines

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages