Project: Named Entity Recognition using the NVIDIA BioMegatron model

Objective: To finetune BioMegatron, a BERT-like Megatron-LM model pre-trained on a large biomedical text corpus (PubMed abstracts and full-text commercial use collection) - on the NCBI Disease Dataset for Named Entity Recognition.

1. Install NeMo toolkit

In [None]:
# install NeMo
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
from nemo.collections import nlp as nemo_nlp
from nemo.utils.exp_manager import exp_manager

import os
import wget
import torch
import lightning.pytorch as pl
from omegaconf import OmegaConf

Task Description:

Named entity recognition (NER), also referred to as entity chunking identification or extraction, is the task of detecting and classifying key information (entities) in text.

Task: Given sentences from medical abstracts, what diseases are mentioned?

Input: data input is sentences from the abstracts
Output: the named disease entities in the abstract


Dataset:


The NCBI-disease corpus is a set of 793 PubMed abstracts, annotated by 14 annotators. The annotations take the form of HTML-style tags inserted into the abstract text using the clearly defined rules. The annotations identify named diseases, and can be used to fine-tune a language model to identify disease mentions in future abstracts, whether those diseases were part of the original training set or not.

In [None]:
# Download the dataset

DATA_DIR = "DATA_DIR"
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(os.path.join(DATA_DIR, 'NER'), exist_ok=True)

In [None]:
print('Downloading NCBI data...')
wget.download('https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/NCBI_corpus.zip', DATA_DIR)
! unzip -o {DATA_DIR}/NCBI_corpus.zip -d {DATA_DIR}

2. Pre-process the dataset

In [None]:
NER_DATA_DIR = f'{DATA_DIR}/NER'
wget.download('https://raw.githubusercontent.com/spyysalo/ncbi-disease/master/conll/train.tsv', NER_DATA_DIR)
wget.download('https://raw.githubusercontent.com/spyysalo/ncbi-disease/master/conll/devel.tsv', NER_DATA_DIR)
wget.download('https://raw.githubusercontent.com/spyysalo/ncbi-disease/master/conll/test.tsv', NER_DATA_DIR)

In [None]:
!ls -lh $NER_DATA_DIR

In [None]:
#Convert these to a format that is compatible with NeMo Token Classification module, using the conversion script.

! mv $NER_DATA_DIR/devel.tsv $NER_DATA_DIR/dev.tsv

In [None]:
wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/data/import_from_iob_format.py')
! python import_from_iob_format.py --data_file=$NER_DATA_DIR/train.tsv
! python import_from_iob_format.py --data_file=$NER_DATA_DIR/dev.tsv
! python import_from_iob_format.py --data_file=$NER_DATA_DIR/test.tsv

The NER task requires two files: the text sentences, and the labels. Run the next two cells to see a sample of the two files.


In [None]:
!head $NER_DATA_DIR/text_train.txt

In [None]:
!head $NER_DATA_DIR/labels_train.txt

IOB Tagging

The abstract has been broken into sentences. Each sentence is then further parsed into words with labels that correspond to the original HTML-style tags in the corpus.

The sentences and labels in the NER dataset map to each other with inside, outside, beginning (IOB) tagging. Anything separated by white space is a word, including punctuation. For the first sentence we have the following mapping:

Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .
O              O  O    O O O         O  O   B           I         I    I      O          O  
Recall the original corpus tags:

Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour</category> suppressor .
The beginning word of the tagged text, "adenomatous", is now IOB-tagged with a B (beginning) tag, the other parts of the disease, "polyposis coli tumour" tagged with I (inside) tags, and everything else tagged as O (outside).

Model configuration
Our Named Entity Recognition model is comprised of the pretrained BERT model followed by a Token Classification layer.

The model is defined in a config file which declares multiple important sections. They are:

model: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information

trainer: Any argument to be passed to PyTorch Lightning

In [None]:
MODEL_CONFIG = "token_classification_config.yaml"
WORK_DIR = "WORK_DIR"
os.makedirs(WORK_DIR, exist_ok=True)
# download the model's configuration file
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file is already exists')
# this line will print the entire config of the model
config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'
print(config_path)
config = OmegaConf.load(config_path)
# Note: these are small batch-sizes - increase as appropriate to available GPU capacity
config.model.train_ds.batch_size=8
config.model.validation_ds.batch_size=8
# in this tutorial train and dev datasets are located in the same folder, so it is enough to add the path of the data directory to the config
config.model.dataset.data_dir = os.path.join(DATA_DIR, 'NER')

# if you want to decrease the size of your datasets, uncomment the lines below:
# NUM_SAMPLES = 1000
# config.model.train_ds.num_samples = NUM_SAMPLES
# config.model.validation_ds.num_samples = NUM_SAMPLES
print(OmegaConf.to_yaml(config))

3. Model Training

Setting up Data within the config

The config file contains dictionaries called dataset, train_ds and validation_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.


In [None]:
config.model.dataset.data_dir = os.path.join(DATA_DIR, 'NER')


Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.

In [None]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

In [None]:

accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.devices = 1
config.trainer.accelerator = accelerator

# for PyTorch Native AMP set precision=16
config.trainer.precision = 16 if torch.cuda.is_available() else 32

# remove distributed training flags
config.trainer.strategy = 'auto'

trainer = pl.Trainer(**config.trainer)

Setting up the NeMo

In [None]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))
os.makedirs(WORK_DIR, exist_ok=True)

# the exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

In [None]:
from nemo.collections.nlp.models.language_modeling.megatron_bert_model import MegatronBertModel
print([model.pretrained_model_name for model in MegatronBertModel.list_available_models()])


config.model.language_model.lm_checkpoint = None
config.model.language_model.pretrained_model_name = 'biomegatron345m_biovocab_30k_cased'
config.model.tokenizer.tokenizer_name = None

In [None]:
model_ner = nemo_nlp.models.TokenClassificationModel(cfg=config.model, trainer=trainer)

Monitoring training progress

In [None]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

In [None]:
# start model training
trainer.fit(model_ner)

4. Inference

In [None]:
! head -n 100 $NER_DATA_DIR/text_dev.txt > $NER_DATA_DIR/sample_text_dev.txt
! head -n 100 $NER_DATA_DIR/labels_dev.txt > $NER_DATA_DIR/sample_labels_dev.txt

In [None]:
model_ner.half().evaluate_from_file(
    text_file=os.path.join(NER_DATA_DIR, 'sample_text_dev.txt'),
    labels_file=os.path.join(NER_DATA_DIR, 'sample_labels_dev.txt'),
    output_dir=exp_dir,
    add_confusion_matrix=False,
    normalize_confusion_matrix=True,
    batch_size=1
)