# 7️⃣ Training Complex Adapter Combinations

In this notebook, we explore how to easily configure complex combinations of different adapter methods with `ConfigUnion`. We show how to re-build the adapter setup used in Mao et al., 2022](https://arxiv.org/pdf/2110.07577.pdf) (UniPELT).
For a basic introduction into the training setup with _Adapters_, please first refer to [the introductory training notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/01_Adapter_Training.ipynb).

As training task, we chose abstractive summarization on the **XSum** dataset ([Narayan et al., 2018](https://arxiv.org/pdf/1808.08745.pdf)). As base model, we select **T5**.

## Installation

First, let's install the required libraries:

In [11]:
!pip install -qq adapters datasets evaluate nltk accelerate rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. The XSum dataset can be loaded via HuggingFace `datasets`.

**Note:** To keep training time short in this notebook, we only load a small subset of the full dataset. For good results, make sure to train on the full dataset.

In [None]:
from datasets import load_dataset

train_dataset = load_dataset("xsum", split="train[:5000]")
val_dataset = load_dataset("xsum", split="validation[:500]")
train_dataset.num_rows

Every dataset sample has an input document and a short summary:

In [3]:
train_dataset[0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

In this example, we use a `t5-small` model for faster training. Feel free to use a larger checkpoint for better results.

In [4]:
model_id = "t5-small"

Now, we need to encode all dataset samples to valid inputs for our Transformer model. We load the tokenizer matching the model we want to train. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

prefix = "summarize: "
max_input_length = 1024
max_target_length = 128

def encode_batch(examples):
  """Encodes a batch of input data using the model tokenizer."""
  inputs = [prefix + doc for doc in examples["document"]]
  model_inputs = tokenizer(inputs, max_length=None, padding="max_length", truncation=True)

  # Setup the tokenizer for targets
  labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, padding="max_length", truncation=True)

  labels["input_ids"] = [
      [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
  ]

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

# Encode the input data
train_dataset = train_dataset.map(encode_batch, batched=True)
val_dataset = val_dataset.map(encode_batch, batched=True)
# Transform to pytorch tensors and only output the required columns
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Now we're ready to train our model...

## Training

We use a pre-trained T5 model checkpoint from the Hugging Face Hub. We load it with [`AutoAdapterModel`](https://docs.adapterhub.ml/classes/models/auto.html), which comes built-in with all adapter functionality. [Learn more](https://docs.adapterhub.ml/prediction_heads.html#adaptermodel-classes).

In [286]:
from adapters import AutoAdapterModel

model = AutoAdapterModel.from_pretrained(model_id)

**Now we're ready to build our adapter setup!**

The UniPELT framework ([Mao et al., 2022](https://arxiv.org/pdf/2110.07577.pdf)) presented one approach of combining multiple types of adapter blocks in a single, combined setup.
Visualized, UniPELT looks roughly like this:

<div>
<img src="https://docs.adapterhub.ml/_images/unipelt.png" width="250"/>
</div>

We can see that UniPELT is built from three well-known single adapter methods: 1) LoRA, 2) Prefix Tuning and 3) Sequential Bottleneck.

_Adapters_ provides an easy way to flexibly build these composed configurations: [`ConfigUnion`](https://docs.adapterhub.ml/classes/adapter_config.html#adapters.ConfigUnion). `ConfigUnion` basically acts as a container holding multiple child adapter configs. [Learn more](https://docs.adapterhub.ml/method_combinations.html).

With `ConfigUnion`, we can define UniPELT as follows:

In [None]:
from adapters import ConfigUnion, PrefixTuningConfig, SeqBnConfig, LoRAConfig

config = ConfigUnion(
    LoRAConfig(r=8, use_gating=True),
    PrefixTuningConfig(prefix_length=10, use_gating=True),
    SeqBnConfig(reduction_factor=16, use_gating=True),
)


We now add a new adapter to our model by calling `add_adapter()`. We pass the name (`"xsum"`) and the adapter configuration we defined using `ConfigUnion`.

Next, we add a seq2seq head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model, so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [288]:
model.add_adapter("xsum", config=config)

# Add a matching classification head
model.add_seq2seq_lm_head("xsum")

# Activate the adapter
model.train_adapter("xsum")

For training an adapter, we make use of the `Seq2SeqAdapterTrainer` class built-in into _Adapters_. This class is largely identical to _Transformer_'s `Seq2SeqTrainer`, with some helpful tweaks e.g. for checkpointing only adapter weights.

We configure the training process using a `Seq2SeqTrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full fine-tuning.** Adapter training usually requires a few more training epochs than full fine-tuning.

In [289]:
import numpy as np
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from adapters import Seq2SeqAdapterTrainer

training_args = Seq2SeqTrainingArguments(
    learning_rate=5e-5,
    num_train_epochs=6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy = "epoch",
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    predict_with_generate=True,
    fp16=True,
    remove_unused_columns=False,
    label_names=["labels"]
)

Some additional logic for computing metrics:

In [290]:
# This is copied & adapted from https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb
from evaluate import load
import nltk

nltk.download("punkt")

metric = load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [291]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqAdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Start the training 🚀

In [293]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,4.7499,3.782296,22.1087,3.8261,17.7176,17.7006,18.83
2,4.0141,3.535523,22.9132,4.2045,17.9206,17.9027,18.976
3,3.9066,3.440636,23.4915,4.4941,18.4859,18.5154,18.964
4,3.8267,3.392781,23.6444,4.4616,18.4768,18.5023,18.994
5,3.8025,3.366829,23.5161,4.3976,18.2771,18.2917,18.998
6,3.7694,3.359185,23.3824,4.4385,18.2622,18.2751,18.99




TrainOutput(global_step=1878, training_loss=3.978136690279927, metrics={'train_runtime': 985.7264, 'train_samples_per_second': 30.434, 'train_steps_per_second': 1.905, 'total_flos': 5071442595840000.0, 'train_loss': 3.978136690279927, 'epoch': 6.0})

Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [294]:
trainer.evaluate()

{'eval_loss': 3.359184503555298,
 'eval_rouge1': 23.3824,
 'eval_rouge2': 4.4385,
 'eval_rougeL': 18.2622,
 'eval_rougeLsum': 18.2751,
 'eval_gen_len': 18.99,
 'eval_runtime': 38.6004,
 'eval_samples_per_second': 12.953,
 'eval_steps_per_second': 0.829,
 'epoch': 6.0}

We can put our trained model into a _Transformers_ pipeline to be able to make new predictions conveniently:

In [295]:
from transformers import SummarizationPipeline

summarizer = SummarizationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

summarizer("""The film about a princess's mythical journey in ancient Polynesia took an estimated $81.1m (£65.3m) on its debut. That makes it the second-highest Thanksgiving debut of all time, behind Disney's Frozen, which took $93.6m (£75.3m) on its release in 2013. Some observers have said that Moana and its merchandise are appropriating Pacific Island culture. Disney withdrew a children's costume promoting the film after activists branded it "brownface", or mocking of their culture by stereotyping. The costume, a full-body suit with brown skin, traditional tattoos, grass skirt and bone necklace, represented the character Maui, considered a demi-god and ancestor by many Polynesians. Disney said it regretted any offence. JK Rowling's Fantastic Beasts and Where to Find Them fell to second on the US chart, taking $65.8m (£53m). Gossip surrounding Brad Pitt's marriage break-up failed to spark a huge amount of interest in his World War Two romance Allied, which also stars Marion Cotillard. It took $18m (£14.4m) over the long weekend, having cost $85m (£68.5m) to make, landing in fourth spot behind Doctor Strange. Kyle Davies, Paramount's head of domestic distribution, said the film appealed to "older audiences" but noted those "don't storm the theatres [on] weekend one". "I think they're going to take their time," he added. Warren Beatty fared worse - his first film in 15 years, the 1950s Hollywood comedy Rules Don't Apply, took just $2.2m (£1.7m). The film is Beatty's first directed feature since 1998's Bulworth. Bad Santa 2, released 13 years after the original and again starring Billy Bob Thornton, did a little better, taking $9m (£7.3m). Follow us on Facebook, on Twitter @BBCNewsEnts, or on Instagram at bbcnewsents. If you have a story suggestion email entertainment.news@bbc.co.uk.""")

The model 'T5AdapterModel' is not supported for . Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].


[{'summary_text': 'The Disney Princess has a new film that starred in the film "The Darkness of'}]

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [285]:
model.save_adapter("./final_adapter", "xsum")

**Share your work!**

The final step after successful training is to share our adapter with the world!
_Adapters_ seamlessly integrates with the [Hugging Face Model Hub](https://huggingface.co/models), so you can publish your trained adapter with a single method call:

**Important:** Make sure you're properly authenticated with your Hugging Face account before running this method. You can log in by running `huggingface-cli login` on your terminal.

In [None]:
model.push_adapter_to_hub(
    "my-awesome-adapter",
    "xsum",
)

This will create a repository _my-awesome-adapter_ under your username, generate a default adapter card as README.md and upload the adapter named `xsum` together with the adapter card to the new repository. [Learn more](https://docs.adapterhub.ml/huggingface_hub.html).