# 1️⃣ Training an Adapter for a Transformer model

In this notebook, we train an adapter for a **RoBERTa** ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)) model for sequence classification on a **sentiment analysis** task using [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), the _AdapterHub_ adaptation of HuggingFace's _transformers_ library.

If you're unfamiliar with the theoretical parts of adapters or the AdapterHub framework, check out our [introductory blog post](https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/) first.

We train a **Task Adapter** for a pre-trained model here. Most of the code is identical to a full finetuning setup using HuggingFace's transformers. For comparison, have a look at the [same guide using full finetuning](https://colab.research.google.com/drive/1brXJg5Mokm8h3shxqPRnoIsRwHQoncus?usp=sharing).

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [263]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
dataset.num_rows

Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/home/eason/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/e06abb624abab47e1a64608fdfe65a913f5a68c66118408032644a3285208fb5)


{'train': 8530, 'validation': 1066, 'test': 1066}

Every dataset sample has an input text and a binary label:

In [264]:
dataset['train'][0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [265]:
from transformers import RobertaTokenizer
import numpy as np
import torch

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")


In [266]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["text"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
# The transformers model expects the target class column to be named "labels"
dataset.rename_column_("label", "labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Loading cached processed dataset at /home/eason/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/e06abb624abab47e1a64608fdfe65a913f5a68c66118408032644a3285208fb5/cache-8cb4f92aacf7946a.arrow
Loading cached processed dataset at /home/eason/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/e06abb624abab47e1a64608fdfe65a913f5a68c66118408032644a3285208fb5/cache-8b0df9401559bf06.arrow
Loading cached processed dataset at /home/eason/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/e06abb624abab47e1a64608fdfe65a913f5a68c66118408032644a3285208fb5/cache-1ced22106a303f34.arrow


In [267]:
def transform(a):
    a = np.array(a).reshape(-1,1)

    a = torch.tensor(np.hstack(((a,a))))

In [268]:
#dataset['train']["labels"] = transform(dataset['train']["labels"])

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model from HuggingFace. We use `RobertaModelWithHeads`, a class unique to `adapter-transformers`, which allows us to add and configure prediction heads in a flexibler way.

In [269]:
from transformers import RobertaConfig, RobertaModelWithHeads

config = RobertaConfig.from_pretrained(
    "roberta-base",
    num_labels=2,
)
model = RobertaModelWithHeads.from_pretrained(
    "roberta-base",
    config=config,
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModelWithHeads were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infere

**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and [the type of adapter](https://docs.adapterhub.ml/adapters.html#adapter-types) (task adapter). Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [270]:
# Add a new adapter
model.add_adapter("rotten_tomatoes")
# Add a matching classification head
model.add_classification_head(
    "rotten_tomatoes",
    num_labels=2,
    id2label={ 0: "👎", 1: "👍"}
  )

# Add a new adapter
model.add_adapter("rotten_tomatoes2")
# Add a matching classification head
model.add_classification_head(
    "rotten_tomatoes2",
    num_labels=2,
    id2label={ 0: "👎", 1: "👍"}
  )
# Activate the adapter
model.train_adapter(["rotten_tomatoes", "rotten_tomatoes"])

In [271]:
model.heads

ModuleDict(
  (rotten_tomatoes): ClassificationHead(
    (0): Dropout(p=0.1, inplace=False)
    (1): Linear(in_features=768, out_features=768, bias=True)
    (2): Activation_Function_Class()
    (3): Dropout(p=0.1, inplace=False)
    (4): Linear(in_features=768, out_features=2, bias=True)
  )
  (rotten_tomatoes2): ClassificationHead(
    (0): Dropout(p=0.1, inplace=False)
    (1): Linear(in_features=768, out_features=768, bias=True)
    (2): Activation_Function_Class()
    (3): Dropout(p=0.1, inplace=False)
    (4): Linear(in_features=768, out_features=2, bias=True)
  )
)

In [272]:
model

RobertaModelWithHeads(
  (roberta): RobertaModel(
    (invertible_adapters): ModuleDict()
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bia

In [273]:
from transformers.adapters.composition import Parallel

In [274]:
model.active_adapters

Stack[rotten_tomatoes, rotten_tomatoes]

In [275]:
model.set_active_adapters(Parallel('rotten_tomatoes', 'rotten_tomatoes2'))

In [276]:
model.active_head

['rotten_tomatoes', 'rotten_tomatoes2']

In [277]:
model = model.to("cpu")

In [278]:
input_ids = dataset["train"]["input_ids"][0:10].to(model.device)

In [279]:
attention_mask = dataset["train"]["attention_mask"][0:10].to(model.device)
labels = dataset["train"]["labels"][0:10].to(model.device)

In [310]:
out = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)

In [311]:
out

[SequenceClassifierOutput(loss=tensor(0.7670, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0735, -0.0645],
         [ 0.0560, -0.0668],
         [ 0.0657, -0.0535],
         [ 0.1288, -0.1413],
         [ 0.0660, -0.0587],
         [ 0.0662, -0.0674],
         [ 0.0646, -0.0674],
         [ 0.0387, -0.0693],
         [ 0.0657, -0.0689],
         [ 0.0660, -0.0732]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None),
 SequenceClassifierOutput(loss=tensor(0.7413, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0943,  0.0206],
         [ 0.2703, -0.0213],
         [ 0.1081,  0.0251],
         [ 0.0870,  0.0234],
         [ 0.0842,  0.0332],
         [ 0.0906,  0.0424],
         [ 0.1336,  0.0082],
         [ 0.0994,  0.0404],
         [ 0.0843,  0.0402],
         [ 0.1096,  0.0191]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)]

In [318]:
a = np.array(out)[np.array(model.active_head) != "rotten_tomatoes2"][0]

In [319]:
t = a.loss.grad_fn.next_functions[0][0]
while True:
    t = t.next_functions[0][0]
    print(t)

<ViewBackward object at 0x7fc4065bafd0>
<AddmmBackward object at 0x7fc4060ed810>
<AccumulateGrad object at 0x7fc406b61a90>


IndexError: tuple index out of range

In [320]:
a.loss.backward()

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

In [221]:
loss = out[0][0]

In [232]:
loss.backward()

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

In [231]:
out

[SequenceClassifierOutput(loss=tensor(0.7999, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0121, -0.0696],
         [ 0.0518, -0.0869],
         [ 0.0969, -0.1307],
         [ 0.1042, -0.1531],
         [ 0.0677, -0.1537],
         [ 0.0528, -0.0936],
         [ 0.0600, -0.0780],
         [ 0.0906, -0.3492],
         [ 0.0544, -0.1172],
         [ 0.0708, -0.1189]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None),
 SequenceClassifierOutput(loss=tensor(0.7089, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0941,  0.0745],
         [ 0.0075,  0.0395],
         [ 0.1267,  0.1500],
         [ 0.0908,  0.0015],
         [ 0.0713, -0.0091],
         [ 0.0844,  0.0230],
         [ 0.0265,  0.0084],
         [ 0.1152,  0.0897],
         [ 0.0856,  0.0165],
         [ 0.0431,  0.0426]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)]

In [230]:
model.active_adapters[0]

'rotten_tomatoes'

In [224]:
labels

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [287]:
loss.backward()

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full finetuning.** Adapter training usually required a few more training epochs than full finetuning.

In [121]:
import transformers

In [122]:
transformers.__version__

'2.0.1'

In [123]:
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)


def compute_accuracy(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return {"acc": (preds == p.label_ids).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

Start the training 🚀

In [133]:
type(dataset["test"])

datasets.arrow_dataset.Dataset

In [124]:
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

In [125]:
trainer.train()

ValueError: Caught ValueError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 941, in forward
    head_inputs, head_name=head, attention_mask=attention_mask, return_dict=return_dict, **kwargs
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/transformers/adapters/heads.py", line 534, in forward_head
    head_output = head_module(head_inputs, head_cls_input, attention_mask, return_dict, **kwargs)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/transformers/adapters/heads.py", line 90, in forward
    loss = loss_fct(logits.view(-1, self.config["num_labels"]), labels.view(-1))
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1048, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/functional.py", line 2693, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/eason/data/anaconda3/envs/adapter/lib/python3.7/site-packages/torch/nn/functional.py", line 2385, in nll_loss
    "Expected input batch_size ({}) to match target batch_size ({}).".format(input.size(0), target.size(0))
ValueError: Expected input batch_size (16) to match target batch_size (32).


Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [18]:
trainer.evaluate()

{'eval_loss': 0.2824031710624695,
 'eval_acc': 0.8846153846153846,
 'eval_runtime': 1.7728,
 'eval_samples_per_second': 601.319,
 'epoch': 6.0}

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

In [19]:
 from transformers import TextClassificationPipeline

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

classifier("This is awesome!")

[{'label': '👍', 'score': 0.9862988591194153}]

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'text'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'text'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'text'],
        num_rows: 1066
    })
})

device(type='cuda', index=1)

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [14]:
model.save_adapter("./final_adapter", "rotten_tomatoes")

!ls -lh final_adapter

total 5.8M
-rw-r--r-- 1 eason student  581 Jul  7 23:32 adapter_config.json
-rw-r--r-- 1 eason student  354 Jul  7 23:32 head_config.json
-rw-r--r-- 1 eason student 3.5M Jul  7 23:32 pytorch_adapter.bin
-rw-r--r-- 1 eason student 2.3M Jul  7 23:32 pytorch_model_head.bin


In [15]:
model

RobertaModelWithHeads(
  (roberta): RobertaModel(
    (invertible_adapters): ModuleDict()
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bia

**Share your work!**

The next step after training is to share our adapter with the world via _AdapterHub_. [Read our guide](https://docs.adapterhub.ml/contributing.html) on how to prepare the adapter module we just saved and contribute it to the Hub!

➡️ Also continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.