# 7️⃣ Training Complex Adapter Combinations

In this notebook, we explore how to easily configure complex combinations of different adapter methods with `ConfigUnion`. We show how to re-build the adapter setups used in [He et al., 2022](https://arxiv.org/pdf/2110.04366.pdf) (Mix-and-Match adapter) and [Mao et al., 2022](https://arxiv.org/pdf/2110.07577.pdf) (UniPELT).
For a basic introduction into the training setup with _Adapters_, please first refer to [the introductory training notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/01_Adapter_Training.ipynb).

As training task, we chose abstractive summarization on the **XSum** dataset ([Narayan et al., 2018](https://arxiv.org/pdf/1808.08745.pdf)). As base model, we select **T5**.

## Installation

First, let's install the required libraries:

In [1]:
!pip install -qq adapters datasets evaluate nltk

## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [9]:
from datasets import load_dataset

train_dataset = load_dataset("xsum", split="train[:5000]")
val_dataset = load_dataset("xsum", split="validation[:500]")
train_dataset.num_rows

Found cached dataset xsum (D:/Clifton/.torch/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)
Found cached dataset xsum (D:/Clifton/.torch/huggingface/datasets/xsum/default/1.2.0/082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71)


5000

Every dataset sample has an input text and a binary label:

In [10]:
train_dataset[0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [11]:
model_id = "t5-small"

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

prefix = "summarize: "
max_input_length = 1024
max_target_length = 128

def encode_batch(examples):
  """Encodes a batch of input data using the model tokenizer."""
  inputs = [prefix + doc for doc in examples["document"]]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

  # Setup the tokenizer for targets
  labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

# Encode the input data
train_dataset = train_dataset.map(encode_batch, batched=True)
val_dataset = val_dataset.map(encode_batch, batched=True)
# Transform to pytorch tensors and only output the required columns
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model checkpoint from the Hugging Face Hub. We load it with [`AutoAdapterModel`](https://docs.adapterhub.ml/classes/models/auto.html), a class unique to `adapters`. In addition to regular _Transformers_ classes, this class comes with all sorts of adapter-specific functionality, allowing flexible management and configuration of multiple adapters and prediction heads. [Learn more](https://docs.adapterhub.ml/prediction_heads.html#adaptermodel-classes).

In [13]:
from adapters import AutoAdapterModel

model = AutoAdapterModel.from_pretrained(model_id)

**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and an adapter configuration. `"bn_seq"` denotes a [sequential bottleneck adapter](https://docs.adapterhub.ml/methods.html#bottleneck-adapters) configuration.
_Adapters_ supports a diverse range of different adapter configurations. For example, `config="lora"` can be passed for training a [LoRA](https://docs.adapterhub.ml/methods.html#lora) adapter or `config="prefix_tuning"` for a [prefix tuning](https://docs.adapterhub.ml/methods.html#prefix-tuning). You can find all currently supported configs [here](https://docs.adapterhub.ml/methods.html#prefix-tuning).

Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model, so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

In [14]:
# Add a new adapter
model.add_adapter("xsum", config="seq_bn")
# Alternatively, e.g.:
# model.add_adapter("rotten_tomatoes", config="lora")

# Add a matching classification head
model.add_seq2seq_lm_head("xsum")

# Activate the adapter
model.train_adapter("xsum")

For training an adapter, we make use of the `AdapterTrainer` class built-in into _Adapters_. This class is largely identical to _Transformer_'s `Trainer`, with some helpful tweaks e.g. for checkpointing only adapter weights.

We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full fine-tuning.** Adapter training usually requires a few more training epochs than full fine-tuning.

In [25]:
import numpy as np
from transformers import Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from adapters import Seq2SeqAdapterTrainer

training_args = Seq2SeqTrainingArguments(
    learning_rate=5e-5,
    num_train_epochs=6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy = "epoch",
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    predict_with_generate=True,
    fp16=True,
)

In [26]:
# This is copied & adapted from https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb
from evaluate import load
import nltk

metric = load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

In [27]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqAdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Start the training 🚀

In [28]:
trainer.train()

  0%|          | 0/1878 [00:00<?, ?it/s]

{'loss': 0.0, 'learning_rate': 4.9068157614483495e-05, 'epoch': 0.64}




  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 42.0924, 'eval_samples_per_second': 11.879, 'eval_steps_per_second': 0.76, 'epoch': 1.0}
{'loss': 0.0, 'learning_rate': 4.37433439829606e-05, 'epoch': 1.28}
{'loss': 0.0, 'learning_rate': 3.8418530351437704e-05, 'epoch': 1.92}




  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 41.2063, 'eval_samples_per_second': 12.134, 'eval_steps_per_second': 0.777, 'epoch': 2.0}
{'loss': 0.0, 'learning_rate': 3.3093716719914805e-05, 'epoch': 2.56}


  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 42.6673, 'eval_samples_per_second': 11.719, 'eval_steps_per_second': 0.75, 'epoch': 3.0}
{'loss': 0.0, 'learning_rate': 2.7768903088391906e-05, 'epoch': 3.19}
{'loss': 0.0, 'learning_rate': 2.244408945686901e-05, 'epoch': 3.83}




  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 42.753, 'eval_samples_per_second': 11.695, 'eval_steps_per_second': 0.748, 'epoch': 4.0}
{'loss': 0.0, 'learning_rate': 1.7119275825346114e-05, 'epoch': 4.47}




  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 40.9707, 'eval_samples_per_second': 12.204, 'eval_steps_per_second': 0.781, 'epoch': 5.0}
{'loss': 0.0, 'learning_rate': 1.1794462193823217e-05, 'epoch': 5.11}
{'loss': 0.0, 'learning_rate': 6.469648562300319e-06, 'epoch': 5.75}


  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 0.0, 'eval_runtime': 41.6873, 'eval_samples_per_second': 11.994, 'eval_steps_per_second': 0.768, 'epoch': 6.0}
{'train_runtime': 1227.0532, 'train_samples_per_second': 24.449, 'train_steps_per_second': 1.53, 'train_loss': 0.0, 'epoch': 6.0}


TrainOutput(global_step=1878, training_loss=0.0, metrics={'train_runtime': 1227.0532, 'train_samples_per_second': 24.449, 'train_steps_per_second': 1.53, 'train_loss': 0.0, 'epoch': 6.0})

Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [20]:
trainer.evaluate()

  0%|          | 0/32 [00:00<?, ?it/s]

{'eval_loss': nan,
 'eval_rouge1': 0.0,
 'eval_rouge2': 0.0,
 'eval_rougeL': 0.0,
 'eval_rougeLsum': 0.0,
 'eval_gen_len': 0.0,
 'eval_runtime': 35.1868,
 'eval_samples_per_second': 14.21,
 'eval_steps_per_second': 0.909,
 'epoch': 6.0}

We can put our trained model into a _Transformers_ pipeline to be able to make new predictions conveniently:

In [24]:
from transformers import SummarizationPipeline

summarizer = SummarizationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

summarizer("""The film about a princess's mythical journey in ancient Polynesia took an estimated $81.1m (£65.3m) on its debut. That makes it the second-highest Thanksgiving debut of all time, behind Disney's Frozen, which took $93.6m (£75.3m) on its release in 2013. Some observers have said that Moana and its merchandise are appropriating Pacific Island culture. Disney withdrew a children's costume promoting the film after activists branded it "brownface", or mocking of their culture by stereotyping. The costume, a full-body suit with brown skin, traditional tattoos, grass skirt and bone necklace, represented the character Maui, considered a demi-god and ancestor by many Polynesians. Disney said it regretted any offence. JK Rowling's Fantastic Beasts and Where to Find Them fell to second on the US chart, taking $65.8m (£53m). Gossip surrounding Brad Pitt's marriage break-up failed to spark a huge amount of interest in his World War Two romance Allied, which also stars Marion Cotillard. It took $18m (£14.4m) over the long weekend, having cost $85m (£68.5m) to make, landing in fourth spot behind Doctor Strange. Kyle Davies, Paramount's head of domestic distribution, said the film appealed to "older audiences" but noted those "don't storm the theatres [on] weekend one". "I think they're going to take their time," he added. Warren Beatty fared worse - his first film in 15 years, the 1950s Hollywood comedy Rules Don't Apply, took just $2.2m (£1.7m). The film is Beatty's first directed feature since 1998's Bulworth. Bad Santa 2, released 13 years after the original and again starring Billy Bob Thornton, did a little better, taking $9m (£7.3m). Follow us on Facebook, on Twitter @BBCNewsEnts, or on Instagram at bbcnewsents. If you have a story suggestion email entertainment.news@bbc.co.uk.""")

The model 'T5AdapterModel' is not supported for . Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].


[{'summary_text': ''}]

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [None]:
model.save_adapter("./final_adapter", "xsum")

Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


**Share your work!**

The final step after successful training is to share our adapter with the world!
_Adapters_ seamlessly integrates with the [Hugging Face Model Hub](https://huggingface.co/models), so you can publish your trained adapter with a single method call:

**Important:** Make sure you're properly authenticated with your Hugging Face account before running this method. You can log in by running `huggingface-cli login` on your terminal.

In [None]:
model.push_adapter_to_hub(
    "my-awesome-adapter",
    "xsum",
    adapterhub_tag="sum/xsum",
)

This will create a repository _my-awesome-adapter_ under your username, generate a default adapter card as README.md and upload the adapter named `rotten_tomatoes` together with the adapter card to the new repository. Passing `adapterhub_tag` is required to make sure your adapter is features on [adapterhub.ml/explore](https://adapterhub.ml/explore), our Hub page. [Learn more](https://docs.adapterhub.ml/huggingface_hub.html).

➡️ Continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.