# **AdapterFusion for Sequence Classification**

In this example we will be fine-tuning the model *BERT-base-uncased* to classify a sequence of tokens. For this purpose, we will use a PEFT method called **Adapter Fusion**, which creates a mixture of mutiple pretrained adapters. We will use **transformers** to download tokenizers, **datasets** for data downdload **adapters** for creating adapters models and training, **evaluate** for loading evaluation metrics and **wandb** (Weights & Biases) to log the results.

You can also open this example in google colab:

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Wicwik/peft_tutorial/blob/main/examples/adapter_fusion.ipynb)

### **0. Install and import required modules**

In [2]:
# 4.36.0 for compatibility with adapters
%pip install -q --user transformers==4.36.0
%pip install -q --user datasets
%pip install -q --user adapters
%pip install -q --user evaluate
%pip install -q --user wandb

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch
import evaluate
import wandb
import logging

from datasets import load_dataset
from transformers import (
    BertTokenizer,
    BertConfig,
    TrainingArguments,
    default_data_collator
)

from adapters import BertAdapterModel, AdapterTrainer
from adapters.composition import Fuse

### **1. Set variables**

We will be fine-tuning the pre-trained version of model [bert-base-uncased](https://huggingface.co/google-bert/bert-large-uncased) which has **110M** parameters. We will set the max **input length to 128** tokens and train for **3 epochs** with **batch size of 32**.

In [2]:
device = "cuda"
model_name_or_path = "bert-base-uncased"
tokenizer_name_or_path = "bert-base-uncased"

max_length = 128
lr = 1e-3
num_epochs = 3
batch_size = 32 # in case of "unable to allocate" errors, decrease batch size to some lower number (e.g. 8 or 16)

logging.disable(logging.WARNING)

### **2. Dataset and preprocessing**

The dataset that we will be using is called [Commitment Bank](https://huggingface.co/datasets/super_glue/viewer/cb) (CB) from SuperGLUE benchmark. This dataset contains a set of premise-hypothesis pairs where the premise is a passage and the hypothesis is a clause. If the clause is contained within the passage and it is an entailment then the target is 0, for a contradiction it is 1, and 2 for a neutral clause.

The dataset contains **250 training samples and 56 validation samples**. We will also split the validation part of the dataset in half to create a test part for evaluation after training.

In [3]:
dataset = load_dataset("super_glue", "cb")

# test set is not labeled so we need to do custom splits
validtest = dataset["validation"].train_test_split(test_size=0.5)

dataset["validation"] = validtest["train"]
dataset["test"] = validtest["test"]

dataset["train"][0]

{'premise': 'It was a complex language. Not written down but handed down. One might say it was peeled down.',
 'hypothesis': 'the language was peeled down',
 'idx': 0,
 'label': 0}

Now we will tokenize the dataset. We only don't need to tokenize the labels this time, because we will train a classification head that returns numbers and not strings.

In [4]:
tokenizer = BertTokenizer.from_pretrained(tokenizer_name_or_path)

def preprocess_function(examples):
  return tokenizer(examples["premise"], examples["hypothesis"], max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")


dataset = dataset.map(preprocess_function, batched=True)
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=["premise", "hypothesis", "idx"],
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

processed_datasets = processed_datasets.rename_column("label", "labels")

train_dataset = processed_datasets["train"].shuffle()
eval_dataset = processed_datasets["validation"]
test_dataset = processed_datasets["test"]


Map:   0%|          | 0/28 [00:00<?, ? examples/s]

Map:   0%|          | 0/28 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/250 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/28 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/28 [00:00<?, ? examples/s]

### **3. Create Adapters model**

Next, we will create the Adapters model and adapter fusion. At frist we load *BertAdapterModel* using the Adapters module, and after that we will load **3 pre-trained adapters** to the model. These adapters are pretrained on **MNLI** (cross genre NLI), **QQP** (question paraphrase) and **QNLI** (QA NLI). As we don't need their prediction heads, we pass **with_head=False** to the loading method.

After that we will add an adapter fusion layer, that combines all 3 adapter layers, activate it and set it trainable. The *train_adapter_fusion()* does two things: It freezes all weights of the model (including adapters!) except for the fusion layer and classification head. It also activates the given adapter setup to be used in very forward pass.

Here is how AdapterFusion layer looks like in the model:
<p align="center">
<img src="../img/af.png" alt="adapter_fusion_arch" width="auto" height="350">
<img src="../img/af_arch.png" alt="adapter_fusion" width="auto" height="350">
</p>

In [5]:
id2label = {id: label for (id, label) in enumerate(processed_datasets["train"].features["labels"].names)}

config = BertConfig.from_pretrained(model_name_or_path, id2label=id2label)
model = BertAdapterModel.from_pretrained(model_name_or_path, config=config)

model.load_adapter("nli/multinli@ukp", load_as="multinli", with_head=False)
model.load_adapter("sts/qqp@ukp", with_head=False)
model.load_adapter("nli/qnli@ukp", with_head=False)

model.add_adapter_fusion(Fuse("multinli", "qqp", "qnli"))
model.set_active_adapters(Fuse("multinli", "qqp", "qnli"))

model.add_classification_head("cb", num_labels=len(id2label))

adapter_setup = Fuse("multinli", "qqp", "qnli")
model.train_adapter_fusion(adapter_setup)

print(model.adapter_summary())

model

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
multinli                 bottleneck          894,528       0.684       1       0
qqp                      bottleneck          894,528       0.684       1       0
qnli                     bottleneck          894,528       0.684       1       0
--------------------------------------------------------------------------------
Full model                               130,734,336     100.000               0


BertAdapterModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttentionWithAdapters(
              (query): LoRALinear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): LoRALinear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): LoRALinear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (dr

We don't have a function to gen number of trainable parameters as in Hugging Face PEFT module, but we can use [their implemetation](https://github.com/huggingface/peft/blob/main/src/peft/peft_model.py#L492) also for this model.

In [6]:
trainable_params = 0
all_param = 0
for n, param in model.named_parameters():
    num_params = param.numel()
    if num_params == 0 and hasattr(param, "ds_numel"):
        num_params = param.ds_numel

    if param.__class__.__name__ == "Params4bit":
        num_params = num_params * 2

    all_param += num_params
    if param.requires_grad:
        # print(n)
        trainable_params += num_params


print(f"trainable params: {trainable_params:,d} || all params: {all_param:,d} || trainable%: {100 * trainable_params / all_param}")

trainable params: 22,467,645 || all params: 134,633,469 || trainable%: 16.688008685269782


### **4. Training and evaluation**

We will be using Hugging Face [TrainingArguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments) and [AdapterTrainer](https://docs.adapterhub.ml/training.html#adaptertrainer) from Adapter Hub (this adapter is based on transformers adapter). The BERT model is not generating tokens, therefore we don't need to do any postprocessing and can use the metric form *evaluate.load()*. The trainer will take a *compute_metrics* method that will be used to compute metrics during the evaluation. 

For SuperGLUE CB dataset *evaluate* computes F1 and accuracy.

In [7]:
metric = evaluate.load("super_glue", "cb")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    preds = preds.argmax(axis=1)

    return metric.compute(predictions=preds, references=labels)

training_args = TrainingArguments(
    "out",
    per_device_train_batch_size=batch_size,
    learning_rate=lr,
    num_train_epochs=num_epochs,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="no",
)

Now we will do the training and evaluation, similar to the [LoRA notebook](https://github.com/Wicwik/peft_tutorial/blob/main/examples/lora_seq2seq.ipynb). We can again have a look at the memory usage.

Since *AdapterTrainer* class is inherited from the Transformers *Trainer* class, we can see the trainer uploading results to the *wandb*.

In [9]:
trainer = AdapterTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

trainer.evaluate(eval_dataset=test_dataset, metric_key_prefix="test")

if wandb.run is not None:
    wandb.finish()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.9957,0.804874,0.75,0.518315
2,0.5654,0.662531,0.821429,0.569195
3,0.3623,0.696662,0.821429,0.569425




VBox(children=(Label(value='0.040 MB of 0.106 MB uploaded\r'), FloatProgress(value=0.3746522494620558, max=1.0…

0,1
eval/accuracy,▁██
eval/f1,▁██
eval/loss,█▁▃
eval/runtime,█▁▄
eval/samples_per_second,▁█▅
eval/steps_per_second,▁█▅
test/accuracy,▁
test/f1,▁
test/loss,▁
test/runtime,▁

0,1
eval/accuracy,0.82143
eval/f1,0.56942
eval/loss,0.69666
eval/runtime,0.1194
eval/samples_per_second,234.483
eval/steps_per_second,33.498
test/accuracy,0.75
test/f1,0.52199
test/loss,1.12445
test/runtime,0.1157


### **5. Save and load**

Now we can save the model with *save_adapter_fusion* and *save_all_adapters* methods to save the AdapterFusion layer and all trained adapters.

In [10]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

adapter_model_id = f"{model_name_or_path}_adapterfusion_seqcls"

model.save_pretrained(adapter_model_id)
model.save_adapter_fusion(adapter_model_id, "multinli,qqp,qnli")
model.save_all_adapters(adapter_model_id)


Now we can load the model and give it a custom example. It is important to **set active adapters** and to **speicfy the head** that we want to use.

In [11]:
model = BertAdapterModel.from_pretrained(adapter_model_id)
model.set_active_adapters(Fuse("multinli", "qqp", "qnli"))

print(model.active_adapters)

inputs = tokenizer("A pity. For myself, a great pity. But no one can say Bishop Malduin has not received latitude.", 
                   "Bishop Malduin has not received latitude", 
                   return_tensors="pt"
                   )
print(inputs)
with torch.no_grad():
    logits = model(**inputs, head="cb")[0]
    class_id = torch.argmax(logits).item()
    pred_class = id2label[class_id]
    print(pred_class, class_id)

Fuse[multinli, qqp, qnli]
{'input_ids': tensor([[  101,  1037, 12063,  1012,  2005,  2870,  1010,  1037,  2307, 12063,
          1012,  2021,  2053,  2028,  2064,  2360,  3387, 15451,  8566,  2378,
          2038,  2025,  2363, 15250,  1012,   102,  3387, 15451,  8566,  2378,
          2038,  2025,  2363, 15250,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
entailment 0
