# 4️⃣ Zero-Shot Cross-Lingual Transfer using Adapters

Beyond AdapterFusion, which we trained in [the previous notebook](https://github.com/Adapter-Hub/adapters/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb), we can compose adapters for zero-shot cross-lingual transfer between tasks. We will use the stacked adapter setup presented in **MAD-X** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00052.pdf)) for this purpose.

In this example, the base model is a pre-trained multilingual **XLM-R** (`xlm-roberta-base`) ([Conneau et al., 2019](https://arxiv.org/pdf/1911.02116.pdf)) model. Additionally, two types of adapters, language adapters and task adapters, are used. Here's how the MAD-X process works in detail:

1. Train language adapters for the source and target language on a language modeling task. In this notebook, we won't train them ourselves but use [pre-trained language adapters from the Hub](https://adapterhub.ml/explore/text_lang/).
2. Train a task adapter on the target task dataset. This task adapter is **stacked** upon the previously trained language adapter. During this step, only the weights of the task adapter are updated.
3. Perform zero-shot cross-lingual transfer. In this last step, we simply replace the source language adapter with the target language adapter while keeping the stacked task adapter.

Now to our concrete example: we select **XCOPA** ([Ponti et al., 2020](https://ducdauge.github.io/files/xcopa.pdf)), a multilingual extension of the **COPA** commonsence reasoning dataset ([Roemmele et al., 2011](https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF)) as our target task. The setup is trained on the original **English** dataset and then transferred to **Chinese**.

## Installation

Besides `adapters`, we use HuggingFace's `datasets` library for loading the data. So let's install both first:

In [1]:
!pip install -Uq adapters
!pip install -Uq datasets
!pip install -Uq accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.3/204.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

## Dataset Preprocessing

We need the English COPA dataset for training our task adapter. It is part of the SuperGLUE benchmark and can be loaded via `datasets` using one line of code:

In [2]:
from datasets import load_dataset
from adapters.composition import Stack

dataset_en = load_dataset("super_glue", "copa")
dataset_en.num_rows

Downloading builder script:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/400 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

{'train': 400, 'validation': 100, 'test': 500}

Every dataset sample has a premise, a question and two possible answer choices:

In [3]:
dataset_en['train'].features

{'premise': Value(dtype='string', id=None),
 'choice1': Value(dtype='string', id=None),
 'choice2': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(names=['choice1', 'choice2'], id=None)}

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def encode_batch(examples):
  """Encodes a batch of input data using the model tokenizer."""
  all_encoded = {"input_ids": [], "attention_mask": []}
  # Iterate through all examples in this batch
  for premise, question, choice1, choice2 in zip(examples["premise"], examples["question"], examples["choice1"], examples["choice2"]):
    sentences_a = [premise + " " + question for _ in range(2)]
    # Both answer choices are passed in an array according to the format needed for the multiple-choice prediction head
    sentences_b = [choice1, choice2]
    encoded = tokenizer(
        sentences_a,
        sentences_b,
        max_length=60,
        truncation=True,
        padding="max_length",
    )
    all_encoded["input_ids"].append(encoded["input_ids"])
    all_encoded["attention_mask"].append(encoded["attention_mask"])
  return all_encoded

def preprocess_dataset(dataset):
  # Encode the input data
  dataset = dataset.map(encode_batch, batched=True)
  # The transformers model expects the target class column to be named "labels"
  dataset = dataset.rename_column("label", "labels")
  # Transform to pytorch tensors and only output the required columns
  dataset.set_format(columns=["input_ids", "attention_mask", "labels"])
  return dataset

dataset_en = preprocess_dataset(dataset_en)

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## Task Adapter Training

In this section, we will train the task adapter on the English COPA dataset. We use a pre-trained XLM-R model from HuggingFace and instantiate our model using `AutoAdapterModel`.

In [5]:
from adapters import AutoAdapterModel
from transformers import AutoConfig

config = AutoConfig.from_pretrained(
    "xlm-roberta-base",
)
model = AutoAdapterModel.from_pretrained(
    "xlm-roberta-base",
    config=config,
)

Downloading model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Now we only need to set up the adapters. As described, we need two language adapters (which are assumed to be pre-trained in this example) and a task adapter (which will be trained in a few moments).

First, we load both the language adapters for our source language English (`"en"`) and our target language Chinese (`"zh"`) from the Hub. Then we add a new task adapter (`"copa"`) for our target task.

Finally, we add a multiple-choice head with the same name as our task adapter on top.

In [6]:
from adapters import AdapterConfig

# Load the language adapters
lang_adapter_config = AdapterConfig.load("seq_bn", reduction_factor=2)
model.load_adapter("en/wiki@ukp", config=lang_adapter_config)
model.load_adapter("zh/wiki@ukp", config=lang_adapter_config)

# Add a new task adapter
model.add_adapter("copa")

# Add a classification head for our target task
model.add_multiple_choice_head("copa", num_choices=2)

Downloading (…)lm-roberta-base.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading (…)eiffer/en_relu_2.zip:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

Downloading (…)eiffer/zh_relu_2.zip:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

Using `train_adapter()`, we tell our model to only train the task adapter in the following. This call will freeze the weights of the pre-trained model and the weights of the language adapters to prevent them from further finetuning.

In [7]:
model.train_adapter("copa")

We want the task adapter to be stacked on top of the language adapter, so we have to tell our model to use this setup via the `active_adapters` property.

A stack of adapters is represented by the `Stack` class, which takes the names of the adapters to be stacked as arguments.
Of course, there are various other possibilities to compose adapters beyond stacking. Learn more about those [in our documentation](https://docs.adapterhub.ml/adapter_composition.html).

In [8]:
# Unfreeze and activate stack setup
model.active_adapters = Stack("en", "copa")

Great! Now, the input will be passed through the English language adapter first and the COPA task adapter second in every forward pass.

For training, we make use of the `AdapterTrainer` class built-in into `adapters`. We configure the training process using a `TrainingArguments` object.

As the dataset splits of English COPA in the SuperGLUE are slightly different, we train on both the train and validation split of the dataset. Later, we will evaluate on the test split of XCOPA.

In [9]:
from adapters import AdapterTrainer
from transformers import TrainingArguments
from datasets import concatenate_datasets

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=100,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

train_dataset = concatenate_datasets([dataset_en["train"], dataset_en["validation"]])

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

Start the training 🚀 (this will take a while)

In [10]:
trainer.train()



Step,Training Loss
100,0.6971


TrainOutput(global_step=128, training_loss=0.6968793570995331, metrics={'train_runtime': 55.5306, 'train_samples_per_second': 72.032, 'train_steps_per_second': 2.305, 'total_flos': 293495135040000.0, 'train_loss': 0.6968793570995331, 'epoch': 8.0})

## Cross-lingual transfer

With the model and all adapters trained and ready, we can come to the cross-lingual transfer step here. We will evaluate our setup on the Chinese split of the XCOPA dataset.
Therefore, we'll first download the data and preprocess it using the same method as the English dataset:

In [11]:
dataset_zh = load_dataset("xcopa", "zh", ignore_verifications=True)
dataset_zh = preprocess_dataset(dataset_zh)
print(dataset_zh["test"][0])



Downloading builder script:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/60.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/19.0k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/5.32k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.3k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'labels': 0, 'input_ids': [[0, 6, 4941, 55359, 1173, 7825, 30638, 90132, 44507, 5702, 1562, 30, 22304, 2, 2, 6, 3800, 2165, 15068, 69175, 30, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 6, 4941, 55359, 1173, 7825, 30638, 90132, 44507, 5702, 1562, 30, 22304, 2, 2, 6, 3800, 2165, 1128, 30, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}


Next, let's adapt our setup to the new language. We simply replace the English language adapter with the Chinese language adapter we already loaded previously. The task adapter we just trained is kept. Again, we set this architecture using `active_adapters`:

In [12]:
model.active_adapters = Stack("zh", "copa")

Finally, let's see how well our adapter setup performs on the new language. We measure the zero-shot accuracy on the test split of the target language dataset. Evaluation is also performed using the built-in `Trainer` class.

In [13]:
import numpy as np
from transformers import EvalPrediction

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}
eval_trainer = AdapterTrainer(
    model=model,
    args=TrainingArguments(output_dir="./eval_output", remove_unused_columns=False,),
    eval_dataset=dataset_zh["test"],
    compute_metrics=compute_accuracy,
)
eval_trainer.evaluate()

{'eval_loss': 0.6925755739212036,
 'eval_acc': 0.518,
 'eval_runtime': 4.9675,
 'eval_samples_per_second': 100.654,
 'eval_steps_per_second': 12.682}

You should get an overall accuracy of about 56 which is on-par with full finetuning on COPA only but below the state-of-the-art which is sequentially finetuned on an additional dataset before finetuning on COPA.

For results on different languages and a sequential finetuning setup which yields better results, make sure to check out [the MAD-X paper](https://arxiv.org/pdf/2005.00052.pdf).