# Medical Entity Recognition with Pretrained Transformers

In this notebook we explore how we can use pretrained transformer models, such as BERT, to identify medical entities in text.

## Data

We're going to work with the [NCBI Disease](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/) corpus, a corpus of 793 PubMed abstracts with 6,892 annotated disease mentions. This dataset can be downloaded with the `datasets` package from Huggingface, which gives us easy access to hundreds of interesting datasets.

In [1]:
from datasets import load_dataset

#dataset = load_dataset('ncbi_disease')
dataset = load_dataset("surrey-nlp/PLOD-CW")

Let's see what this dataset looks like. The abstracts are split into sentences, which already have been tokenized for us. There are 5433 sentences in the training data, 924 in the validation data and another 941 in the test data.

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'pos_tags', 'ner_tags'],
        num_rows: 1072
    })
    validation: Dataset({
        features: ['tokens', 'pos_tags', 'ner_tags'],
        num_rows: 126
    })
    test: Dataset({
        features: ['tokens', 'pos_tags', 'ner_tags'],
        num_rows: 153
    })
})

In [3]:
# change dataset ner_tags from str to int

TEXT2ID = {
    "B-O": 0,
    "B-AC": 1,
    "B-LF": 2,
    "I-LF": 3,
}

# map the ner_tags to integers
dataset = dataset.map(lambda x: {"ner_tags": [TEXT2ID[tag] for tag in x["ner_tags"]]})

The first training example is the sentence 'Identification of APC2, a homologue of the adenomatous polyposis coli tumour suppressor.' The phrase 'adenomatous polyposis coli tumour' has been labeled as a disease. The first token has a different label than the other three, because it has been labeled as the start of the disease mention, following the BIO labeling scheme.

In [4]:
dataset["train"][0]

{'tokens': ['For',
  'this',
  'purpose',
  'the',
  'Gothenburg',
  'Young',
  'Persons',
  'Empowerment',
  'Scale',
  '(',
  'GYPES',
  ')',
  'was',
  'developed',
  '.'],
 'pos_tags': ['ADP',
  'DET',
  'NOUN',
  'DET',
  'PROPN',
  'PROPN',
  'PROPN',
  'PROPN',
  'PROPN',
  'PUNCT',
  'PROPN',
  'PUNCT',
  'AUX',
  'VERB',
  'PUNCT'],
 'ner_tags': [0, 0, 0, 0, 2, 3, 3, 3, 3, 0, 1, 0, 0, 0, 0]}

## Preprocessing the texts

Now we have our texts, we need to give them the correct preprocessing treatment. As our model, we choose one of the available PubMedBERTs — BERT models that have been pretrained on abstracts (and in this case, also full texts) from PubMed and therefore look perfect for the type of texts we're working with. We start by getting the tokenizer that was used for pretraining this model, because our texts need to be tokenized in exactly the same manner.

In [5]:
from transformers import AutoTokenizer

MODEL = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

model = "dslim/bert-base-NER"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Let's now use this tokenizer to tokenize our texts. Note that every sentence in our corpus is a list of words, so we need to tell the tokenizer the text has already been split into words. In addition, we'll also ask the tokenizer to pad and/or truncate the texts. Sentences that are longer than 256 tokens will be truncated, and all sentences will be padded to the length of the (resulting) longest one.

In [6]:
train_texts = [item["tokens"] for item in dataset["train"]]
dev_texts = [item["tokens"] for item in dataset["validation"]]
test_texts = [item["tokens"] for item in dataset["test"]]

train_texts_encoded = tokenizer(train_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)
dev_texts_encoded = tokenizer(dev_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)
test_texts_encoded = tokenizer(test_texts, padding=True, truncation=True, max_length=256, is_split_into_words=True)

We now have three lists of `Encoding`s, which contain all information that our model needs, in particular the ids of the tokens, their type id, and their attention mask. The mask is used to make sure that the model ignores padding tokens. The type id of the tokens is always `0`, because our input consists of single sentences, and not sentence pairs.

In [7]:
train_texts_encoded[0]

Encoding(num_tokens=256, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

If we look at the actual tokens, we see the tokenizer has applied a type of tokenization different from traditional tokenization. To keep the size of the vocabulary manageable, unknown words have been split up into known subword parts, such as `apc2`, which has been split up into `apc` and `##2`, where the `##` indicates this is a continuation. 

At the same time, the tokens also display one of the main benefits of using a BERT model that was pretrained on data from PubMed. While generic BERT would split up complex words such as `adenomatous` or `polyposis`, they occur frequently enough in PubMed data for PubMedBERT to treat them as one single token.

In [8]:
train_texts_encoded[0].tokens[:20]

['[CLS]',
 'for',
 'this',
 'purpose',
 'the',
 'got',
 '##hen',
 '##burg',
 'young',
 'persons',
 'empowerment',
 'scale',
 '(',
 'gy',
 '##pes',
 ')',
 'was',
 'developed',
 '.',
 '[SEP]']

## Preprocessing the labels

There's one remaining challenge. Because our new tokens are different from the original tokens in the corpus, we can't just train the model on the original labels: we need to align the labels with the new tokens. Luckily the tokenizer also provides us with a list of offsets for every new token, where we can easily identify tokens that do not correspond to the original words. 

For example, the offsets of the first training sentence tell us that `apc2` has been split up into two tokens, one for the first three characters of the word (indices 0 to, but not including, 3) and one for the last character of the word (indices 3 to, but not including, 4). 

Additionally, we can also identify special tokens, such as `[CLS]` and `[PAD]` by the offset pair `[(0,0)]`. 

In [9]:
train_texts_encoded[0].offsets[:20]

[(0, 0),
 (0, 3),
 (0, 4),
 (0, 7),
 (0, 3),
 (0, 3),
 (3, 6),
 (6, 10),
 (0, 5),
 (0, 7),
 (0, 11),
 (0, 5),
 (0, 1),
 (0, 2),
 (2, 5),
 (0, 1),
 (0, 3),
 (0, 9),
 (0, 1),
 (0, 0)]

There are only three labels in the corpus — `O`, `B-disease` and `I-disease` — which have already been mapped to their index for us.

In [10]:
all_labels = list(set([label for item in dataset["train"] for label in item["ner_tags"]]))
all_labels

[0, 1, 2, 3]

Now we have sufficient information to align the entity labels with the new tokens. For each sentence, we first create a numpy array filled with the label `-100`, a special label in the `transformers` library that will be ignored during training. Then we copy the original labels to the tokens at the start of every word. These have zero as their first offset position, and another number as their second position. This means the remaining tokens of the word will still have the label `-100`. This comes in handy during evaluation, as the subword tokenization will not lead to a higher number of entity labels. 

In [11]:
import numpy as np

def map_entities_to_tokens(items, encodings):
    
    labels = [item["ner_tags"] for item in items]
    offsets = [encoding.offsets for encoding in encodings]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, offsets):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        if len(doc_labels) != len(doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)]):
            print(len(doc_labels), len(doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)]))
        else:
            doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
            encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

train_labels = map_entities_to_tokens(dataset["train"], train_texts_encoded.encodings)
dev_labels = map_entities_to_tokens(dataset["validation"], dev_texts_encoded.encodings)
test_labels = map_entities_to_tokens(dataset["test"], test_texts_encoded.encodings)

323 204
139 134
216 207
192 188


This is the result for our first training example:

In [12]:
list(zip(train_texts_encoded[0].tokens[:20], train_labels[0][:20]))

[('[CLS]', -100),
 ('for', 0),
 ('this', 0),
 ('purpose', 0),
 ('the', 0),
 ('got', 2),
 ('##hen', -100),
 ('##burg', -100),
 ('young', 3),
 ('persons', 3),
 ('empowerment', 3),
 ('scale', 3),
 ('(', 0),
 ('gy', 1),
 ('##pes', -100),
 (')', 0),
 ('was', 0),
 ('developed', 0),
 ('.', 0),
 ('[SEP]', -100)]

## Setting up the dataset

We bring the encoded texts and their labels together in an NERDataset. This dataset returns for every item all the information in the encodings as a dictionary, and adds an additional key with the labels. All lists are converted to PyTorch tensors. 

In [13]:
import torch

class NERDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    

train_dataset = NERDataset(train_texts_encoded, train_labels)
dev_dataset = NERDataset(dev_texts_encoded, dev_labels)
test_dataset = NERDataset(test_texts_encoded, test_labels)

print(f"Train items: {len(train_dataset)}")
print(f"Dev items: {len(dev_dataset)}")
print(f"Test items: {len(test_dataset)}")

Train items: 1069
Dev items: 125
Test items: 153


Next, we set up the evaluation of the results. We compute an accuracy score on all labels, excluding `-100`. In named entity recognition, accuracy tends to be very high, because most tokens are not part of an entity mention. Even models that do not recognize a single token, will achieve a high accuracy on most datasets. Therefore we also compute precision, recall and F-score on the entity labels only (excluding label `0`). This is a much better measure of the model's success at identifying entities.

In [14]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    flat_labels, flat_preds = [], []
    flat_ent_labels, flat_ent_preds = [], []
    for label_row, pred_row in zip(labels, preds):
        for label, pred_label in zip(label_row, pred_row):
            if label != -100:
                flat_labels.append(label)
                flat_preds.append(pred_label)
                if label != 0 or pred_label != 0:
                    flat_ent_labels.append(label)
                    flat_ent_preds.append(pred_label)
                    
        
    precision, recall, f1, _ = precision_recall_fscore_support(flat_ent_labels, flat_ent_preds, average='micro')
    acc = accuracy_score(flat_labels, flat_preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Training the model

Now we're ready to train the model. This is easy to do with the `Trainer` class of the `transformers` package, which we feed with the model, the training and development dataset, the evaluation metrics, along with all training arguments. 

We'll train the model for 3 epochs, with a batch size of 8, and evaluate and save a checkpoint at every 200 training steps.

In [15]:
from transformers import Trainer, TrainingArguments, AutoModelForTokenClassification, BertForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(MODEL, num_labels=len(all_labels))


#device = torch.device("cpu")
#model.to(device)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_steps=int(len(train_dataset)/8),  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    evaluation_strategy="steps",
    eval_steps=200,
    save_steps=200,
    save_total_limit=10,
    load_best_model_at_end=True,
    no_cuda=False
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=dev_dataset,            # evaluation dataset
)

trainer.train()


Some weights of BertForTokenClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/402 [00:00<?, ?it/s]

KeyboardInterrupt: 

## Evaluating the results

Finally, we evaluate the model on the test dataset.

In [None]:
trainer.evaluate(test_dataset)

  0%|          | 0/20 [00:00<?, ?it/s]

{'eval_loss': 1.3932572603225708,
 'eval_accuracy': 0.24,
 'eval_f1': 0.024390243902439025,
 'eval_precision': 0.024390243902439025,
 'eval_recall': 0.024390243902439025,
 'eval_runtime': 3.5713,
 'eval_samples_per_second': 42.842,
 'eval_steps_per_second': 5.6}

For inference, we can load the model and combine it with the tokenizer in an `ner` pipeline. Now we can easily label new texts and inspect the results.

In [None]:
from transformers import pipeline

model = AutoModelForTokenClassification.from_pretrained("results/checkpoint-2000")
nlp = pipeline("ner", tokenizer=tokenizer, model=model)

OSError: results/checkpoint-2000 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [None]:
print(dataset["test"][1])

nlp(dataset["test"][1]["tokens"])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'id': '1', 'tokens': ['Ataxia', '-', 'telangiectasia', '(', 'A', '-', 'T', ')', 'is', 'a', 'recessive', 'multi', '-', 'system', 'disorder', 'caused', 'by', 'mutations', 'in', 'the', 'ATM', 'gene', 'at', '11q22', '-', 'q23', '(', 'ref', '.', '3', ')', '.'], 'ner_tags': [1, 2, 2, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


[[{'entity': 'LABEL_1',
   'score': 0.9875536,
   'index': 1,
   'word': 'ataxia',
   'start': 0,
   'end': 6}],
 [{'entity': 'LABEL_0',
   'score': 0.99985576,
   'index': 1,
   'word': '-',
   'start': 0,
   'end': 1}],
 [{'entity': 'LABEL_1',
   'score': 0.9942285,
   'index': 1,
   'word': 'telangiect',
   'start': 0,
   'end': 10},
  {'entity': 'LABEL_2',
   'score': 0.99018914,
   'index': 2,
   'word': '##asia',
   'start': 10,
   'end': 14}],
 [{'entity': 'LABEL_0',
   'score': 0.9999013,
   'index': 1,
   'word': '(',
   'start': 0,
   'end': 1}],
 [{'entity': 'LABEL_0',
   'score': 0.9998692,
   'index': 1,
   'word': 'a',
   'start': 0,
   'end': 1}],
 [{'entity': 'LABEL_0',
   'score': 0.99985576,
   'index': 1,
   'word': '-',
   'start': 0,
   'end': 1}],
 [{'entity': 'LABEL_0',
   'score': 0.9998398,
   'index': 1,
   'word': 't',
   'start': 0,
   'end': 1}],
 [{'entity': 'LABEL_0',
   'score': 0.99984837,
   'index': 1,
   'word': ')',
   'start': 0,
   'end': 1}],
 [{

In [None]:
true = [0, 1, 2]
pred = [0, 1, 2]

#confusion_matrix(true, pred)

from sklearn.metrics import confusion_matrix

true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
pred = [0, 1, 2, 0, 1, 2, 0, 1, 2]

confusion_matrix(true, pred)

from sklearn.metrics import classification_report

true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
pred = [0, 1, 2, 0, 1, 2, 0, 1, 2]

print(classification_report(true, pred))