# Train an Adapter for NER

This notebook illustrates how you can train an adapter and head for a tagging task. We are using the CoNLL 2003 dataset to train the model on Named Entity Recognition (NER). Additionally, we will set and save the id2label dictionary so the model can easily be used by someone else. First, we need to install the 'adapters' and the 'datasets' package.

In [1]:
!pip install -Uq adapters
!pip install -q datasets
!pip install -q scikit-learn
!pip install -Uq accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.3/204.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

Next, we instantiate the model, add a tagging head, and set the right label2id dictionary. We add an adapter that will be trained on the task of NER.

In [2]:
from adapters import AutoAdapterModel
from transformers import AutoTokenizer, AutoConfig
from datasets import load_dataset
from torch.utils.data import Dataset
import torch
import torch.nn.functional as F
from tqdm.notebook import tqdm
from torch import nn
#The labels for the NER task and the dictionaries to map the to ids or
#the other way around
labels = ["O", 'B-LOC', "I-LOC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-MISC", "I-MISC"]
id2label = {id_: label for id_, label in enumerate(labels)}
label2id = {label: id_ for id_, label in enumerate(labels)}

model_name = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_name, num_label=len(labels), id2label=id2label, label2id=label2id)
model = AutoAdapterModel.from_pretrained(model_name)
model.add_adapter("ner")

model.add_tagging_head("ner_head", num_labels=len(labels), id2label=id2label)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(model.get_labels())


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['O', 'B-LOC', 'I-LOC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-MISC', 'I-MISC']


BERT expects a word piece tokenized text. The tokens provided by the dataset are tokenized differently. The `encode_labels` function maps the labels of the CoNLL 2003 dataset to the word piece tokens. The `encode_data` encodes the tokens as ids and adds the special tokens so the BERT model can handle the input.

In [3]:
def encode_data(data):
    encoded = tokenizer([" ".join(doc) for doc in data["tokens"]], pad_to_max_length=True, padding="max_length",
                        max_length=512, truncation=True, add_special_tokens=True)
    return (encoded)


def encode_labels(example):
    r_tags = []
    count = 0
    token2word = []
    for index, token in enumerate(tokenizer.tokenize(" ".join(example["tokens"]))):
        if token.startswith("##") or (token in example["tokens"][index - count - 1].lower() and index - count - 1 >= 0):
            # If the token is part of a larger token and not the first we need to differentiate.
            # If it is a B (beginning) label the next one needs to be assigned an I (intermediate) label.
            # Otherwise they can be labeled the same.
            if r_tags[-1] % 2 == 1:
                r_tags.append(r_tags[-1] + 1)
            else:
                r_tags.append(r_tags[-1])
            count += 1
        else:
            r_tags.append(example["ner_tags"][index - count])

        token2word.append(index - count)
    r_tags = torch.tensor(r_tags)
    labels = {}
    # Pad token to maximum length for using batches
    labels["labels"] = F.pad(r_tags, pad=(1, 511 - r_tags.shape[0]), mode='constant', value=0)
    # Truncate if the document is too long
    labels["labels"] = labels["labels"][:512]
    return labels

Next, we can load the dataset and use the previously defined functions to prepare the dataset for training. We then define two dataloaders: one for training and one for evaluation.

In [4]:
dataset = load_dataset("conll2003")
dataset = dataset.map(encode_labels)
dataset = dataset.map(encode_data, batched=True, batch_size=10)

dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

dataloader = torch.utils.data.DataLoader(dataset["train"])
evaluate_dataloader = torch.utils.data.DataLoader(dataset["test"])

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Before we can start training the model, we need to define some training parameters. We check if a GPU is available for training and set our device accordingly. Then we can tell the model which adapter we want to train with `model.train_adapters("<adaper_name>"))`. As loss function, we use Cross Entropy Loss. Finally, we need to define an optimizer for training with parameters and learning rate.

In [5]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
model.set_active_adapters("ner")
model.train_adapter("ner")

loss_function = nn.CrossEntropyLoss()
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
                {
                    "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                    "weight_decay": 1e-5,
                },
                {
                    "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                    "weight_decay": 0.0,
                },
            ]
optimizer = torch.optim.AdamW(params=optimizer_grouped_parameters, lr=1e-4)


Then we can start the training. In this case, we trained the model for 2 epochs. Feel free to play around with the hyperparameters like the number of epochs, the learning rate, ... But keep in mind that adapters often need a few more training epochs than full finetuning.

In [6]:
for epoch in range(2):
    for i, batch in enumerate(tqdm(dataloader)):

        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(batch["input_ids"] )
        # We need to reformat the tensors for the loss function.
        # They need to have the shape (N, C) and (N,) where N is the number
        # of tokens and C the number of classes.
        predictions = torch.flatten(outputs[0], 0, 1)
        expected = torch.flatten(batch["labels"].long(), 0, 1)

        loss = loss_function(predictions, expected)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        if i % 10000 == 0:
            print(f"loss: {loss}")

  0%|          | 0/14041 [00:00<?, ?it/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


loss: 3.087712049484253
loss: 0.0004019373736809939


  0%|          | 0/14041 [00:00<?, ?it/s]

loss: 0.002754304325208068
loss: 0.00028402122552506626


Then we can save the adapter and head we trained with `model.save_adapter` and `model.save_head` for future use.

In [7]:
model.save_adapter('adapter/', 'ner')
model.save_head("head/", "ner_head")

For evaluating our trained adapter, we use a confusion matrix to display how often a token with label x was classified as a class with label y. We can see that the predictions are in most cases correct. From the confusion matrix, we can additionally see which labels were wrongly predicted.

In [8]:
from sklearn.metrics import confusion_matrix
model.to(device)
model.eval()
predictions_list = []
expected_list = []
for i, batch in enumerate(tqdm(evaluate_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(batch["input_ids"], adapter_names=['ner'])
    predictions = torch.argmax(outputs[0], 2)
    expected = batch["labels"].float()

    predictions_list.append(predictions)
    expected_list.append(expected)

print(confusion_matrix(torch.flatten(torch.cat(expected_list)).cpu().numpy(),
                 torch.flatten(torch.cat(predictions_list)).cpu().numpy()))

  0%|          | 0/3453 [00:00<?, ?it/s]

[[1754582     114      96     118     184     127      71     126     140]
 [     55    1449      77      25       2       6       0       1       0]
 [    111       3    3394       2      39       0       4       0       5]
 [     90      18       0    1389      39      91       0      33       0]
 [    117       1      26       8    1659       1     105       0      28]
 [    100       5       0      67       8    1435      25      23       1]
 [     63       0       5       0      45       2     610       0       1]
 [    109       7       0      33       2      19       0     504      26]
 [     85       1      16       6      47       2      39      20     294]]
