GitHub - abachaa/MTS-Dialog: A new collection of 1.7k doctor-patient conversations and corresponding clinical notes/summaries.

Introduction

This repository contains the data and source code for the EACL 2023 paper: An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters

- An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. 
- Asma Ben Abacha, Wen-wai Yim, Yadan Fan and Thomas Lin. 
- EACL, May 3-5, 2023, Dubrovnik, Croatia. 

    @inproceedings{mts-dialog,
      title     = {An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters},
        author = "Ben Abacha, Asma  and
          Yim, Wen-wai  and
          Fan, Yadan  and
          Lin, Thomas",
        booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
        month = may,
        year = "2023",
        address = "Dubrovnik, Croatia",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.eacl-main.168",
        pages = "2291--2302"
    }

Datasets, Code & Annotations

Main Dataset

The MTS-Dialog dataset is a new collection of 1.7k short doctor-patient conversations and corresponding summaries (section headers and contents).

The training set consists of 1,201 pairs of conversations and associated summaries.
The validation set consists of 100 pairs of conversations and their summaries.
MTS-Dialog includes 2 test sets; each test set consists of 200 conversations and associated section headers and contents:
- MTS-Dialog-TestSet-1-MEDIQA-Chat-2023.csv: Official test set used in the MEDIQA-Chat 2023 challenge (Task A)
- MTS-Dialog-TestSet-2-MEDIQA-Sum-2023.csv: Official test set used in the MEDIQA-Sum 2023 challenge (Task A & Task B)

The full list of normalized section headers:

    1. fam/sochx [FAMILY HISTORY/SOCIAL HISTORY]
    2. genhx [HISTORY of PRESENT ILLNESS]
    3. pastmedicalhx [PAST MEDICAL HISTORY]
    4. cc [CHIEF COMPLAINT]
    5. pastsurgical [PAST SURGICAL HISTORY]
    6. allergy
    7. ros [REVIEW OF SYSTEMS]
    8. medications
    9. assessment
    10. exam
    11. diagnosis
    12. disposition
    13. plan
    14. edcourse [EMERGENCY DEPARTMENT COURSE]
    15. immunizations
    16. imaging
    17. gynhx [GYNECOLOGIC HISTORY]
    18. procedures
    19. other_history
    20. labs

Augmented dataset

The augmented dataset consists of 3.6k pairs of medical conversations and associated summaries created from the original 1.2k training pairs via back-translation using two languages French and Spanish, as described in the paper (cf. Section 4.2).

We provide the full augmented training set that we used in the experiments, as well as the separate datasets created using the French and Spanish translation models.

Source Code

The source code for the summarization of doctor-patient conversations and the automatic generation of clinical notes.

Manual Scores for Correlation Study

Manual fact-based scores for the evaluation of 400 automatic summaries generated using four summarization models from the validation set of 100 conversations and notes.
The Factual P/R/F1 Scores, Hallucination and Omission Rates, and Levenshtein Edit Distance are computed based on the fact-based manual counts and correction.
We used the manual scores to evaluate the performance of several evaluation metrics (e.g., ROUGE, BERTScore, and BLEURT) by computing the Pearson's correlation coefficients between the automatic and manual scores, as described in the paper (cf. Section 5.2 and Section 5.3).
We provide all the data needed to perform this correlation study on other evaluation metrics.

Challenges & Evaluation Scripts

MEDIQA-Chat 2023: https://github.com/abachaa/MEDIQA-Chat-2023
MEDIQA-Sum 2023: https://github.com/ImageCLEF/2023_ImageCLEFmed_Mediqa

License

This work is published under a Creative Commons Attribution 4.0 International Licence (CC BY). https://creativecommons.org/licenses/by/4.0/

Contact

-  Asma Ben abacha (abenabacha at microsoft dot com)
 - Wen-wai Yim (yimwenwai at microsoft dot com)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Augmented-Data		Augmented-Data
Correlation-Study		Correlation-Study
Main-Dataset		Main-Dataset
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augmented-Data

Augmented-Data

Correlation-Study

Correlation-Study

Main-Dataset

Main-Dataset

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Introduction

Datasets, Code & Annotations

Main Dataset

Augmented dataset

Source Code

Manual Scores for Correlation Study

Challenges & Evaluation Scripts

License

Contact

About

Releases

Packages

License

abachaa/MTS-Dialog

Folders and files

Latest commit

History

Repository files navigation

Introduction

Datasets, Code & Annotations

Main Dataset

Augmented dataset

Source Code

Manual Scores for Correlation Study

Challenges & Evaluation Scripts

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks