# Train BioSemantics NER model

In this notebook we fine-tune a pre-trained DistilBERT model on the
BioSemantics NER corpus (USPTO patents). We also measure precision, recall, F1-score
and accuracy on its test partition. Concretely:

* We download the BioSemantics corpus and DistilBERT from HuggingFace
* We pre-process the corpus, breaking each sentence into BPE (subword) tokens and propagating IOB labels
* We fine-tune the model for 3 epochs, keeping logs on Weights & Biases
* We evaluate it on the test set

## Setting up the notebook

We start a Python 3.10.x kernel, install dependencies, set up environment
variables and login to HuggingFace.

**Note:** this notebook assumes your are running Jupyter on a CUDA-enabled host.

In [None]:
%%capture
!pip install -f ./requirements.txt

In [None]:
import os
os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='0'
os.environ['CUDA_LAUNCH_BLOCKING']='1'
os.environ['TORCH_USE_CUDA_DSA']='1'

In [None]:
# Will launch a prompt to enter the API key
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Library imports

We import the main libraries and functions.

In [None]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
from torch.utils.data import DataLoader
from transformers import get_scheduler
from accelerate import Accelerator
from torch.optim import AdamW
from huggingface_hub import Repository, get_full_repo_name
from tqdm.auto import tqdm
import torch
import evaluate
from transformers import DataCollatorForTokenClassification
from datasets import load_dataset
import transformers
from transformers import TrainingArguments, Trainer
import numpy as np

We will fine-tune a DistilBERT checkpoint and save on HuggingFace. As dataset, we use the BioSemantics sample.

In [None]:
# Root DistilBERT model
model_checkpoint = "distilbert/distilbert-base-cased"

In [None]:
# Root dataset
data_checkpoint = "camilothorne/biosemantics_uspto"

## Download tokenizer and pre-process data

In [None]:
# Raw training data
raw_train = load_dataset(data_checkpoint, field='data', split='train')

# We split the training data into a training and validation set
train_data = raw_train.train_test_split(test_size=0.1)['train']
val_data = raw_train.train_test_split(test_size=0.1)['test']

# Test data
test_data = load_dataset(data_checkpoint, field='data', split='test')

# Labels
labs_train = load_dataset(data_checkpoint, field='maps', split='train')
labs_val   = labs_train
labs_test  = load_dataset(data_checkpoint, field='maps', split='test')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

train.json:   0%|          | 0.00/89.8M [00:00<?, ?B/s]

test.json:   0%|          | 0.00/8.00M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
train_data

Dataset({
    features: ['text', 'ner_tags', 'labels'],
    num_rows: 219691
})

In [None]:
val_data

Dataset({
    features: ['text', 'ner_tags', 'labels'],
    num_rows: 24411
})

In [None]:
test_data

Dataset({
    features: ['text', 'ner_tags', 'labels'],
    num_rows: 25220
})

In [None]:
"""
Load/download tokenizer
"""

# DistilBERT
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Once the data and tokenizer downloaded, we pre-process the data (tokenize) using BPE.

In [None]:
"""
Process data functions
"""

def align_labels_with_tokens(labels, word_ids):
    '''
    Breaks sentences into BPE tokens and aligns entity
    BIO labels accordingly. Note that the label of
    the original token will be propagated to all of
    its sub-word BPE tokens.
    '''
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

def tokenize_and_align_labels_d(examples):
    '''
    Runs the prior function across a complete
    dataset object.
    '''
    tokenized_inputs = tokenizer(
        examples["text"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["labels"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
    tokenized_inputs["labels"] = new_labels

    return tokenized_inputs

In [None]:
'''
Get labels
'''

label_names    = labs_train['tag']
id2label       = {i: label for i, label in enumerate(label_names)}
label2id       = {v: k for k, v in id2label.items()}

In [None]:
print(label_names)
print(id2label)
print(label2id)
print()
print(f'Number of labels: {len(label_names)}')
print(f'Number of labels identical across dicts? {len(id2label)==len(label_names)==len(label2id)}')

['I-C', 'I-M', 'B-OCRERRORSPELL', 'B-F', 'I-OCRERRORSPELL', 'B-I', 'I-D', 'I-B', 'B-C', 'B-B', 'I-G', 'I-R', 'I-T', 'B-M', 'B-MOA', 'B-D', 'B-Disease', 'B-T', 'B-R', 'O', 'B-G', 'I-Disease', 'I-MOA', 'I-F']
{0: 'I-C', 1: 'I-M', 2: 'B-OCRERRORSPELL', 3: 'B-F', 4: 'I-OCRERRORSPELL', 5: 'B-I', 6: 'I-D', 7: 'I-B', 8: 'B-C', 9: 'B-B', 10: 'I-G', 11: 'I-R', 12: 'I-T', 13: 'B-M', 14: 'B-MOA', 15: 'B-D', 16: 'B-Disease', 17: 'B-T', 18: 'B-R', 19: 'O', 20: 'B-G', 21: 'I-Disease', 22: 'I-MOA', 23: 'I-F'}
{'I-C': 0, 'I-M': 1, 'B-OCRERRORSPELL': 2, 'B-F': 3, 'I-OCRERRORSPELL': 4, 'B-I': 5, 'I-D': 6, 'I-B': 7, 'B-C': 8, 'B-B': 9, 'I-G': 10, 'I-R': 11, 'I-T': 12, 'B-M': 13, 'B-MOA': 14, 'B-D': 15, 'B-Disease': 16, 'B-T': 17, 'B-R': 18, 'O': 19, 'B-G': 20, 'I-Disease': 21, 'I-MOA': 22, 'I-F': 23}

Number of labels: 24
Number of labels identical across dicts? True


In [None]:
def print_example(datapoint):
    '''
    Pretty prints a sample datapoint in a dataset object.
    '''
    words = datapoint["text"]
    labels = datapoint["labels"]
    line1 = ""
    line2 = ""
    for word, label in zip(words, labels):
        full_label = id2label[label]
        max_length = max(len(word), len(full_label))
        line1 += word + " " * (max_length - len(word) + 1)
        line2 += full_label + " " * (max_length - len(full_label) + 1)
    print(line1)
    print(line2)

In [None]:
print_example(train_data[11])
print_example(test_data[10])
print()
print_example(train_data[1013])
print_example(val_data[210])
print()
print_example(train_data[333])
print_example(val_data[999])

Liposomes may also be used 
O         O   O    O  O    
The term “ thiol ” or “ sulfhydryl ” , alone or in combination , means a — SH group 
O   O    O B-G   O O  O B-G        O O O     O  O  O           O O     O O O  O     

, a statin , a synthetic statin , or an analog thereof 
O O B-G    O O O         O      O O  O  O      O       
The filtrate was concentrated to obtain a crude product 
O   O        O   O            O  O      O O     O       

26 M ) was cooled to − 78 ° C 
O  O O O   O      O  O O  O O 
If the confirmatory PSA value is less than the screening PSA value , then an additional test for rising PSA will be required to document progression ; Antiandrogen Withdrawal Patients who are receiving an antiandrogen as part of primary androgen ablation must demonstrate disease progression following discontinuation of antiandrogen 
O  O   O            O   O     O  O    O    O   O         O   O     O O    O  O          O    O   O      O   O    O  O        O  O        O           

In [None]:
'''
Preprocess data (training set)
'''

train_datasets = train_data.map(
    tokenize_and_align_labels_d,
    batched=True,
    remove_columns=train_data.column_names,
)

Map:   0%|          | 0/219691 [00:00<?, ? examples/s]

In [None]:
'''
Preprocess data (validation set)
'''

val_datasets = val_data.map(
    tokenize_and_align_labels_d,
    batched=True,
    remove_columns=train_data.column_names,
)

Map:   0%|          | 0/24411 [00:00<?, ? examples/s]

In [None]:
'''
Preprocess data (test set)
'''

test_datasets = test_data.map(
    tokenize_and_align_labels_d,
    batched=True,
    remove_columns=train_data.column_names,
)

Map:   0%|          | 0/25220 [00:00<?, ? examples/s]

In [None]:
'''
Preprocess data (data collator)
'''

data_collator    = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Download model, train and evaluate

In [None]:
"""
Load/download model
"""

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels=len(label_names),
    ignore_mismatched_sizes=True
)

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
metric = evaluate.load("seqeval") # Metrics for NER tasks.

def compute_metrics_x(eval_preds):
    '''
    Function to measure precison, recall,
    F1-score and accuracy on NER predictions (at training, validation
    and test time).
    '''
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions,
                                 references=true_labels,
                                 zero_division=0)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
# DistilBERT
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-ner-biosem",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    do_train=True,
    do_eval=True,
    fp16=True,
    save_strategy="epoch",
    load_best_model_at_end=True,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
)

In [None]:
'''
Instantiate model
'''

trainer = Trainer(
    model,
    args,
    train_dataset=train_datasets,
    eval_dataset=val_datasets,
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics_x
)

Note here that HuggingFace will log learning rates and all other training statistics 
using Weights&Biases, and will prompt you to enter your API key.

In [None]:
'''
Train
'''

trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1368,0.116285,0.757578,0.815597,0.785517,0.95697
2,0.1136,0.089479,0.807656,0.853109,0.829761,0.967235
3,0.0948,0.081546,0.82305,0.87182,0.846734,0.970635


TrainOutput(global_step=10299, training_loss=0.13440527093214877, metrics={'train_runtime': 4240.1253, 'train_samples_per_second': 155.437, 'train_steps_per_second': 2.429, 'total_flos': 4.37832164322887e+16, 'train_loss': 0.13440527093214877, 'epoch': 3.0})

In [None]:
trainer.push_to_hub() # Pushes the trained chekpoint to HuggingFace

CommitInfo(commit_url='https://huggingface.co/camilothorne/distilbert-base-cased-finetuned-ner-biosem/commit/43dfccaf766220207ea886d57dfb5d4c57b31eb7', commit_message='End of training', commit_description='', oid='43dfccaf766220207ea886d57dfb5d4c57b31eb7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/camilothorne/distilbert-base-cased-finetuned-ner-biosem', endpoint='https://huggingface.co', repo_type='model', repo_id='camilothorne/distilbert-base-cased-finetuned-ner-biosem'), pr_revision=None, pr_num=None)

In [None]:
'''
Evaluate on validation set
'''

trainer.evaluate()

{'eval_loss': 0.08154609799385071,
 'eval_precision': 0.8230502159369972,
 'eval_recall': 0.8718201715044311,
 'eval_f1': 0.846733515119308,
 'eval_accuracy': 0.9706347089121914,
 'eval_runtime': 65.0222,
 'eval_samples_per_second': 375.426,
 'eval_steps_per_second': 5.875,
 'epoch': 3.0}

In [None]:
'''
Evaluate on test set
'''


predictions, labels, _ = trainer.predict(test_datasets)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
results

{'B': {'precision': 0.825381679389313,
  'recall': 0.408983451536643,
  'f1': 0.5469490989566866,
  'number': 2115},
 'C': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 6},
 'D': {'precision': 0.76,
  'recall': 0.5092783505154639,
  'f1': 0.6098765432098765,
  'number': 485},
 'Disease': {'precision': 0.7526501766784452,
  'recall': 0.9128571428571428,
  'f1': 0.8250484183344093,
  'number': 700},
 'F': {'precision': 0.8708739684921231,
  'recall': 0.8910102657584189,
  'f1': 0.8808270498411344,
  'number': 10423},
 'G': {'precision': 0.72670944899314,
  'recall': 0.6727440335962307,
  'f1': 0.698686240093612,
  'number': 19526},
 'M': {'precision': 0.674208869203266,
  'recall': 0.8120059129764123,
  'f1': 0.7367193422356989,
  'number': 15559},
 'MOA': {'precision': 0.5492957746478874,
  'recall': 0.5735294117647058,
  'f1': 0.5611510791366907,
  'number': 136},
 'R': {'precision': 0.9,
  'recall': 0.5806451612903226,
  'f1': 0.7058823529411764,
  'number': 31},
 'T': {'prec