## Exercise: Biomedical named entity recognition

We get the helper Python file and the files containing the training and validation data.

In [1]:
!rm -f *py *tsv
!wget http://www.cse.chalmers.se/~richajo/dat450/assignments/ner_util.py
!wget http://www.cse.chalmers.se/~richajo/dat450/assignments/a4/train.tsv
!wget http://www.cse.chalmers.se/~richajo/dat450/assignments/a4/devel.tsv


--2024-06-04 07:21:21--  http://www.cse.chalmers.se/~richajo/dat450/assignments/ner_util.py
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.222.93
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.222.93|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.cse.chalmers.se/~richajo/dat450/assignments/ner_util.py [following]
--2024-06-04 07:21:21--  https://www.cse.chalmers.se/~richajo/dat450/assignments/ner_util.py
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.222.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25178 (25K) [text/plain]
Saving to: ‘ner_util.py’


2024-06-04 07:21:21 (872 KB/s) - ‘ner_util.py’ saved [25178/25178]

--2024-06-04 07:21:21--  http://www.cse.chalmers.se/~richajo/dat450/assignments/a4/train.tsv
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.222.93
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.222.93|:80... con


The infrastructure code (vocabulary management, training loop, word embedding loading, evaluation, etc) has been moved into a separate file <code>ner_util.py</code> you can get [here](http://www.cse.chalmers.se/~richajo/dat450/assignments/ner_util.py). Download this file and put it into the same directory as where you are executing your notebook. Import this file and you are ready to go.

In [2]:
import ner_util

## Reading the data in a tabular format

The following function reads a file represented in a tabular format. In this format, each row corresponds to one token. For each token, there is a word and the BIO-coded named entity label, separated by whitespace. The sentences are separated by empty lines. Here is an example of a sentence.
```
In              O
conclusion      O
,               O
hyperammonemic  B-Disease
encephalopathy  I-Disease
can             O
occur           O
in              O
patients        O
receiving       O
continuous      O
infusion        O
of              O
5               B-Chemical
-               I-Chemical
FU              I-Chemical
.               O
```
The function `ner_util.read_data` reads the file in this format and returns sentences and their corresponding BIO labels.

In [3]:
Xtrain, Ytrain = ner_util.read_data('train.tsv')
Xval, Yval = ner_util.read_data('devel.tsv')

An example:

In [4]:
Xtrain[100]

['High',
 'doses',
 'of',
 'vitamin',
 'D',
 'are',
 'known',
 'to',
 'cause',
 'calcification',
 'of',
 'the',
 'artery',
 'media',
 'in',
 'as',
 'little',
 'as',
 '3',
 'to',
 '4',
 'days',
 '.']

In [5]:
Ytrain[100]

['O',
 'O',
 'O',
 'B-Chemical',
 'I-Chemical',
 'O',
 'O',
 'O',
 'O',
 'B-Disease',
 'I-Disease',
 'I-Disease',
 'I-Disease',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

## Loading pre-trained word embeddings

In [6]:
!wget https://www.cse.chalmers.se/~richajo/dat450/assignments/a4/PubMed-and-PMC-w2v-200000.bin

--2024-06-04 07:24:07--  https://www.cse.chalmers.se/~richajo/dat450/assignments/a4/PubMed-and-PMC-w2v-200000.bin
Resolving www.cse.chalmers.se (www.cse.chalmers.se)... 129.16.222.93
Connecting to www.cse.chalmers.se (www.cse.chalmers.se)|129.16.222.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161848950 (154M) [application/octet-stream]
Saving to: ‘PubMed-and-PMC-w2v-200000.bin’


2024-06-04 07:24:09 (96.6 MB/s) - ‘PubMed-and-PMC-w2v-200000.bin’ saved [161848950/161848950]



In [7]:
emb_model = ner_util.load_gensim_vectors('glove-wiki-gigaword-100', builtin=True)
#emb_model = ner_util.load_gensim_vectors('PubMed-and-PMC-w2v-200000.bin', builtin=False)

 done!


## Implementing a sequence labeling model

Below is a skeleton of a solution. Add the missing pieces.

In [14]:
import torch
from torch import nn

class SequenceLabelingModel(nn.Module):

    def __init__(self, seq_labeler):
        super().__init__()

        p = seq_labeler.params

        # The model consists of just a word embedding layer and a
        # linear output unit. We use the vocabulary to create the embedding layer.
        self.word_embedding = seq_labeler.word_voc.make_embedding_layer(finetune=p.finetune_word_emb,
                                                                        emb_dim=p.default_word_emb_dim)

        # The dimensionality of the word embedding model.
        word_emb_dim = self.word_embedding.weight.shape[1]
        output_dim = seq_labeler.n_labels

        # TODO: Create an output unit.
        self.lstm = nn.LSTM(word_emb_dim, 126)
        self.dropout = nn.Dropout(p=0.1)
        self.linear = nn.Linear(126, output_dim)


    def forward(self, words):
        # words is a tensor of integer-encoded words, with shape (batch_size, max_sen_length)

        # After embedding the words, the shape is (batch_size, max_sen_length, emb_dim).
        word_repr = self.word_embedding(words)

        # TODO: Complete the forward step.
        return self.linear(self.dropout(self.lstm(word_repr)[0]))


## Training the system

As usual, we create an object that contains all hyperparameters.

In [15]:
class NERParameters:

    # Random seed, for reproducibility.
    random_seed = 0

    # cuda or cpu
    device = 'cuda'

    # NB: this hyperparameter is only used if we are training the embedding
    # model from scratch.
    default_word_emb_dim = 128

    # Whether or not to fine-tune the word embedding model.
    finetune_word_emb = False

    # Training parameters
    n_epochs = 20
    batch_size = 32
    learning_rate = 0.005
    weight_decay = 0

    # Word dropout rate.
    word_dropout_prob = 0.0

    # Set the following to True to enable character tensors.
    use_characters = False

Now, we are ready to train. When creating the NER system, the first argument should be the hyperparameter container. The second should be a function that builds the neural network; we just use the constructor of the class we defined above.

In [16]:

ner_system = ner_util.SequenceLabeler(NERParameters(), SequenceLabelingModel,
                                      pretrained_word_emb=emb_model)
ner_system.fit(Xtrain, Ytrain, Xval, Yval)

Epoch 1: train loss = 0.3478, val f1: 0.4968, time = 1.8255
Epoch 2: train loss = 0.2441, val f1: 0.5417, time = 1.4899
Epoch 3: train loss = 0.2224, val f1: 0.5768, time = 1.3298
Epoch 4: train loss = 0.2096, val f1: 0.5750, time = 0.9233
Epoch 5: train loss = 0.1985, val f1: 0.5827, time = 0.8630
Epoch 6: train loss = 0.1909, val f1: 0.6016, time = 0.8631
Epoch 7: train loss = 0.1844, val f1: 0.6039, time = 0.8719
Epoch 8: train loss = 0.1792, val f1: 0.5879, time = 0.8638
Epoch 9: train loss = 0.1752, val f1: 0.6004, time = 0.8547
Epoch 10: train loss = 0.1715, val f1: 0.6145, time = 0.8449
Epoch 11: train loss = 0.1686, val f1: 0.5980, time = 0.8732
Epoch 12: train loss = 0.1649, val f1: 0.6114, time = 0.8734
Epoch 13: train loss = 0.1631, val f1: 0.6012, time = 0.8775
Epoch 14: train loss = 0.1610, val f1: 0.6025, time = 0.8555
Epoch 15: train loss = 0.1582, val f1: 0.6156, time = 1.3041
Epoch 16: train loss = 0.1556, val f1: 0.5986, time = 1.4776
Epoch 17: train loss = 0.1550, va

0.618962610334226

To exemplify the output of the system, the following cell shows how to predict the IOB tags:

In [17]:
ner_system.predict(['The patient suffered from malaria .'.split()])

[['O', 'O', 'O', 'O', 'B-Disease', 'O']]

Alternatively, a visual format:

In [18]:
ner_util.show_entities(ner_system, ['The patient suffered from malaria .'.split()])

### Using pre-trained BERT models

In [19]:
import transformers
from transformers import AutoTokenizer, AutoModel

Set `BERT_MODEL_NAME` to point to a HuggingFace identifier.

In [38]:
# BERT_MODEL_NAME = 'bert-base-uncased'
BERT_MODEL_NAME = 'dmis-lab/biobert-v1.1'

Implement the missing pieces of code here.

In [39]:
class BERTSequenceModel(nn.Module):

    def __init__(self, seq_labeler):
        super().__init__()
        p = seq_labeler.params
        self.bert = AutoModel.from_pretrained(BERT_MODEL_NAME)

        bert_output_size = self.bert.config.hidden_size
        output_dim = seq_labeler.n_labels

        # TODO add the output
        self.dropout = nn.Dropout(p=0.1)
        self.linear = nn.Linear(bert_output_size, output_dim)

    def forward(self, words):
        return self.linear(self.dropout(self.bert(words).last_hidden_state))

Now, let's train and evaluate the BERT-based NER model.

In [40]:
class SeqLabelerParameters:
    device = 'cuda'
    random_seed = 0
    n_epochs = 3
    batch_size = 32
    learning_rate = 5e-5
    weight_decay = 0
    word_dropout_prob = 0.0
    use_characters = False
    bert_max_len = 128


tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL_NAME)
ner_system = ner_util.SequenceLabeler(SeqLabelerParameters(), BERTSequenceModel,
                                      bert_tokenizer=tokenizer)
ner_system.fit(Xtrain, Ytrain, Xval, Yval)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

.................................................. (50)
.................................................. (100)
...........................................
.................................................. (50)
.................................................. (100)
............................................
Epoch 1: train loss = 0.4121, val f1: 0.8399, time = 90.0496
.................................................. (50)
.................................................. (100)
...........................................
.................................................. (50)
.................................................. (100)
............................................
Epoch 2: train loss = 0.0620, val f1: 0.8639, time = 89.9893
.................................................. (50)
.................................................. (100)
...........................................
.................................................. (50)
...................................

0.874538366844481