# Train an Adapter for NER

This notebook illustrates how you can train a adapter and head for a tagging task. We are using the CoNLL 2003 dataset to train the model on Named Entity Recognition (NER). Additionally we will set and save the id2label dictionary so the model can easily be used by someone else. First we need to install the 'adapter-transformer' and the 'datasets' package.

First we instantiate the model, add a tagging head and set the right label2id dictionary. We add an adapter that will be trained on the task of NER.

In [3]:
from transformers import AutoModelWithHeads, AutoTokenizer, AutoConfig
from datasets import load_dataset
from torch.utils.data import Dataset
import torch
import torch.nn.functional as F
from tqdm.notebook import tqdm
from torch import nn
import copy
#The labels for the NER task and the dictionaries to map the to ids or 
#the other way around
labels = ["O", 'B-LOC', "I-LOC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-MISC", "I-MISC"]
id2label = {id_: label for id_, label in enumerate(labels)}
label2id = {label: id_ for id_, label in enumerate(labels)}

model_name = "bert-base-uncased"
config = AutoConfig.from_pretrained(model_name, num_label=len(labels), id2label=id2label, label2id=label2id)
model = AutoModelWithHeads.from_pretrained(model_name)
model.add_adapter("ner")

model.add_tagging_head("ner_head", num_labels=len(labels), id2label=id2label)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(model.get_labels())


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModelWithHeads: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['O', 'B-LOC', 'I-LOC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-MISC', 'I-MISC']


BERT expects a word piece tokenized text. The tokens provided by the dataset are tokenized differently. The `encode_labels` function maps the labels of the CoNLL 2003 dataset to the word piece tokens. The `encode_data` encodes the tokens as ids and adds the special tokens so the BERT model can handle the input.




In [4]:
def encode_data(data):
    encoded = tokenizer([" ".join(doc) for doc in data["tokens"]], pad_to_max_length=True, padding="max_length",
                        max_length=512, truncation=True, add_special_tokens=True)
    return (encoded)


def encode_labels(example):
    r_tags = []
    count = 0
    token2word = []
    for index, token in enumerate(tokenizer.tokenize(" ".join(example["tokens"]))):
        if token.startswith("##") or (token in example["tokens"][index - count - 1].lower() and index - count - 1 >= 0):
            # if the token is part of a larger token and not the first we need to differ 
            # if it is a B (beginning) label the next one needs to ba assigned a I (intermediate) label
            # otherwise they can be labeled the same
            if r_tags[-1] % 2 == 1:
                r_tags.append(r_tags[-1] + 1)
            else:
                r_tags.append(r_tags[-1])
            count += 1
        else:
            r_tags.append(example["ner_tags"][index - count])

        token2word.append(index - count)
    r_tags = torch.tensor(r_tags)
    labels = {}
    # Pad token to maximum length for using batches
    labels["labels"] = F.pad(r_tags, pad=(1, 511 - r_tags.shape[0]), mode='constant', value=0)
    # Truncate if the document is too long
    labels["labels"] = labels["labels"][:512]
    return labels

Next we can load the dataset and use the previously defined functions to prepare the dataset for training. We then define two dataloaders: one for training and one for evaluation.

In [5]:
dataset = load_dataset("conll2003")
dataset = dataset.map(encode_labels)
dataset = dataset.map(encode_data, batched=True, batch_size=10)

dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

dataloader = torch.utils.data.DataLoader(dataset["train"])
evaluate_dataloader = torch.utils.data.DataLoader(dataset["test"])

Reusing dataset conll2003 (/home/eason/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)


  0%|          | 0/14041 [00:00<?, ?ex/s]

  0%|          | 0/3250 [00:00<?, ?ex/s]

  0%|          | 0/3453 [00:00<?, ?ex/s]

  0%|          | 0/1405 [00:00<?, ?ba/s]

  0%|          | 0/325 [00:00<?, ?ba/s]

  0%|          | 0/346 [00:00<?, ?ba/s]

Before we can start training the model, we need to define some training parameters. We check if a GPU is available for training and set our device accordingly. Then we can tell the model which adapter we want to train with `model.train_adapters([["<adaper_name>"]]))`. As loss function we use Cross Entropy Loss. Finally we need to define an optimizer for training with parameters and learning rate.

In [6]:
model.device

device(type='cpu')

In [7]:
device = torch.device("cuda", 0) if torch.cuda.is_available() else 'cpu'
model.to(device)
model.set_active_adapters([["ner"]])
model.train_adapter(["ner"])

loss_function = nn.CrossEntropyLoss()
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
                {
                    "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                    "weight_decay": 1e-5,
                },
                {
                    "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                    "weight_decay": 0.0,
                },
            ]
optimizer = torch.optim.AdamW(params=optimizer_grouped_parameters, lr=1e-4)


Then we can start the training. In this case we trained the model for 2 epochs. Feel free to play aroud with the hyperparameters like number of epochs, the learning rate, ... But keep in mind that adapters often need a few more training epochs than full finetuning. 

In [9]:
batch["input_ids"]

tensor([[  101,  1011,  1011,  2414,  2739,  9954,  1009,  4008, 18225,  5139,
          2475,  6146, 27814,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

In [8]:
for epoch in range(2):
    for i, batch in enumerate(tqdm(dataloader)):
        
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(batch["input_ids"] )
        # we need to reformat the tensors for the loss function
        # they need to have the shape (N, C) and (N,) where N is the number
        # of tokens and C the number of classes
        predictions = torch.flatten(outputs[0], 0, 1)
        expected = torch.flatten(batch["labels"].long(), 0, 1)
        
        loss = loss_function(predictions, expected)
        loss.backward()
        
        optimizer.step()
        optimizer.zero_grad()
        if i % 10000 == 0:
            print(f"loss: {loss}")

  0%|          | 0/14041 [00:00<?, ?it/s]

loss: 2.1730833053588867


KeyboardInterrupt: 

Then we can save the adapter and head we trained with `model.save_adapter` and `model.save_head` for future use.

In [None]:
model.save_adapter('adapter/', 'ner')
model.save_head("head/", "ner_head")

For evaluation purpose of our trained adapter we use a confusion matrix to display how often a token with label x was classified as class with label y. We can see that the predictions are in most cases correct. From the cofusion matrix we can additionally see in which labels were wongly predicted.

In [None]:
from sklearn.metrics import confusion_matrix
model.to(device)
model.eval()
predictions_list = []
expected_list = []
for i, batch in enumerate(tqdm(evaluate_dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(batch["input_ids"], adapter_names=['ner'])
    predictions = torch.argmax(outputs[0], 2)
    expected = batch["labels"].float()
    
    predictions_list.append(predictions)
    expected_list.append(expected)
    
print(confusion_matrix(torch.flatten(torch.cat(expected_list)).cpu().numpy(),
                 torch.flatten(torch.cat(predictions_list)).cpu().numpy()))