Skip to content

chizhikchi/CARES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CARES - A Corpus of Anonymised Radiological Evidences in Spanish

This repository contains code to reproduce the experiments described in the paper CARES: A Corpus for Classification of Spanish Radiological Reports. CARES is a high-quality text resource manually labeled with ICD-10 codes and reviewed by radiologists. These types of resources are essential for developing automatic text classification tools as they are necessary for training and fine-tuning our computational systems. The dataset is available on HuggingFace hub

Corpus description and statistics

The CARES corpus has been manually annotated using the ICD-10 ontology, which stands for for the 10th version of the International Classification of Diseases. For each radiological report, a minimum of one code and a maximum of 9 codes were assigned, while the average number of codes per text is 2.15 with the standard deviation of 1.12.

The corpus was additionally preprocessed in order to make its format coherent with the automatic text classification task. Considering the hierarchical structure of the ICD-10 ontology, each sub-code was mapped to its respective code and chapter, obtaining two new sets of labels for each report. The entire CARES collection contains 6,907 sub-code annotations among the 3,219 radiologic reports. There are 223 unique ICD-10 sub-codes within the annotations, which were mapped to 156 unique ICD-10 codes and 16 unique chapters of the cited ontology.

Text classification experiments

The main objective of this repository is to favor the repropuctibility of the experiments descibed in the paper. These experementation focused at developing system to classify ach report with corresponding ICD-10 chapters. For this purpose, we performed a stratified split of the corpus into train (70%) and test (30%) subsets where the former was used to fine-tune the models and the latter to evaluate them.

The experimentation process consisted in fine-tuning three pre-trained transformer models:

  • BETO - a general domain Spanish BERT model
  • RoBERTa-bne - a Spanish RoBERTa model considered SOTA in general domain Spanish NLP
  • BioBERT-Spanish - BERT general model extended with domain-specific knowledge by fine-tuning over a Chilean clinical corpus.
  • RoBERTa-biomedical-clinical - a RoBERTa-based model pretrained on a combination of biomedical and clinical corpora.

The table below summarises the results obtained by each developed system.

Micro-avg Macro-avg
Model Precision Recall F1-score Precision Recall F1-score
BETO 0.8687 0.8464 0.8574 0.8673 0.7975 0.8250
RoBERTa-BNE 0.9032 0.7937 0.8449 0.9303 0.6845 0.7673
BioBERT 0.7813 0.7396 0.7599 0.8296 0.6817 0.7365
RoBERTa Biomedical 0.8562 0.8794 0.8676 0.8740 0.8213 0.8328

You can reproduce the whole fine-tuning process with hyperparameter optimisation by running:

cd CARES
pip install requirements.txt
python main.py --model NAME OF THE MODEL --do optimise

train models with the hyperparameters we selected in our experimentation by running:

cd CARES
pip install requirements.txt
python main.py --model NAME OF THE MODEL --do train

or just run testing on the fine-tuned models available on HuggingFace by running:

cd CARES
pip install requirements.txt
python main.py --model NAME OF THE MODEL --do test

Note that you must select the model to train or test among the following options: bio-bert-spanish, roberta-biomedical-clinical, bert-base-spanish, roberta-bne.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages