# 4️⃣ Zero-Shot Cross-Lingual Transfer using Adapters

Beyond AdapterFusion, which we trained in [the previous notebook](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb), we can compose adapters for zero-shot cross-lingual transfer between tasks. We will use the stacked adapter setup presented in **MAD-X** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00052.pdf)) for this purpose.

In this example, the base model is a pre-trained multilingual **XLM-R** (`xlm-roberta-base`) ([Conneau et al., 2019](https://arxiv.org/pdf/1911.02116.pdf)) model. Additionally, two types of adapters, language adapters and task adapters, are used. Here's how the MAD-X process works in detail:

1. Train language adapters for the source and target language on a language modeling task. In this notebook, we won't train them ourselves but use [pre-trained language adapters from the Hub](https://adapterhub.ml/explore/text_lang/).
2. Train a task adapter on the target task dataset. This task adapter is **stacked** upon the previously trained language adapter. During this step, only the weights of the task adapter are updated.
3. Perform zero-shot cross-lingual transfer. In this last step, we simply replace the source language adapter with the target language adapter while keeping the stacked task adapter.

Now to our concrete example: we select **XCOPA** ([Ponti et al., 2020](https://ducdauge.github.io/files/xcopa.pdf)), a multilingual extension of the **COPA** commonsence reasoning dataset ([Roemmele et al., 2011](https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF)) as our target task. The setup is trained on the original **English** dataset and then transferred to **Chinese**.

## Installation

Besides `adapter-transformers`, we use HuggingFace's `datasets` library for loading the data. So let's install both first:

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="3"
# os.environ["CUDA_VISIBLE_DEVICES"]="-1"

import torch
print(torch.cuda.device_count())

1


In [2]:
SEED = 42
MODEL_DIR = './training_output/deu_disrpt_vanilla_bert'
BERT_MODEL = 'bert-base-german-cased'
BATCH_SIZE = 8

## Dataset Preprocessing

We need the English COPA dataset for training our task adapter. It is part of the SuperGLUE benchmark and can be loaded via `datasets` using one line of code:

In [3]:
import pandas as pd
from datasets import Dataset

def read_df_custom(file):
    header = 'doc     unit1_toks      unit2_toks      unit1_txt       unit2_txt       s1_toks s2_toks unit1_sent      unit2_sent      dir     nuc_children    sat_children    genre   u1_discontinuous        u2_discontinuous       u1_issent        u2_issent       u1_length       u2_length       length_ratio    u1_speaker      u2_speaker      same_speaker    u1_func u1_pos  u1_depdir       u2_func u2_pos  u2_depdir       doclen  u1_position      u2_position     percent_distance        distance        lex_overlap_words       lex_overlap_length      unit1_case      unit2_case      label'
    extracted_columns = ['unit1_txt', 'unit1_sent', 'unit2_txt', 'unit2_sent', 'dir', 'label', 'distance', 'u1_depdir', 'u2_depdir', 'u2_func', 'u1_position', 'u2_position', 'sat_children', 'nuc_children', 'genre', 'unit1_case', 'unit2_case',
                            'u1_discontinuous', 'u2_discontinuous', 'same_speaker', 'lex_overlap_length', 'u1_func']
    header = header.split()
    df = pd.DataFrame(columns=extracted_columns)
    file = open(file, 'r')

    rows = []
    count = 0 
    for line in file:
        line = line[:-1].split('\t')
        count+=1
        if count ==1: continue
        row = {}
        for column in extracted_columns:
            index = header.index(column)
            row[column] = line[index]
        rows.append(row)

    df = pd.concat([df, pd.DataFrame.from_records(rows)])
    return df

de_train_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_train_enriched.rels'))
de_test_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_test_enriched.rels'))
de_valid_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_dev_enriched.rels'))

len(de_train_dataset_df), len(de_test_dataset_df), len(de_valid_dataset_df)

(2164, 260, 241)

In [4]:
de_train_dataset_df.features

{'unit1_txt': Value(dtype='string', id=None),
 'unit1_sent': Value(dtype='string', id=None),
 'unit2_txt': Value(dtype='string', id=None),
 'unit2_sent': Value(dtype='string', id=None),
 'dir': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None),
 'distance': Value(dtype='string', id=None),
 'u1_depdir': Value(dtype='string', id=None),
 'u2_depdir': Value(dtype='string', id=None),
 'u2_func': Value(dtype='string', id=None),
 'u1_position': Value(dtype='string', id=None),
 'u2_position': Value(dtype='string', id=None),
 'sat_children': Value(dtype='string', id=None),
 'nuc_children': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None),
 'unit1_case': Value(dtype='string', id=None),
 'unit2_case': Value(dtype='string', id=None),
 'u1_discontinuous': Value(dtype='string', id=None),
 'u2_discontinuous': Value(dtype='string', id=None),
 'same_speaker': Value(dtype='string', id=None),
 'lex_overlap_length': Value(dtype='string', id=None),
 'u1_func':

In [5]:
from transformers.adapters.composition import Stack

Every dataset sample has a premise, a question and two possible answer choices:

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [6]:
from transformers import AutoTokenizer, BertTokenizer
from datasets import ClassLabel


# tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
class SNLIDataset(torch.utils.data.Dataset):
    """A customized dataset to load the SNLI dataset."""
    def __init__(self, dataset, labels, raw_text=False):
        self.text = []
        self.raw_text = []
        self.raw_label = []
        self.raw_text_flag = raw_text
        self.num_rows = len(dataset)
        for premise, hypothesis in zip(dataset['unit1_txt'], dataset['unit2_txt']):
            self.text.append(tokenizer.encode_plus(premise, hypothesis, padding="max_length", truncation=True, max_length=512, return_token_type_ids=True))
            if raw_text: self.raw_text.append([premise, hypothesis])
        # self.labels = torch.tensor(labels.str2int(dataset['label'])).to(device)
        self.labels = labels.str2int(dataset['label'])
        if raw_text: self.raw_label = dataset['label']
        print('read ' + str(len(self.text)) + ' examples')

    def __getitem__(self, idx):
        if self.raw_text_flag:  
            return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'raw_text': self.raw_text[idx],
                'label':self.labels[idx],
                'raw_label': self.raw_label[idx]}

        return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'label':self.labels[idx]}

    def __len__(self):
        return len(self.text)

    def combine_with_another_dataset(self, dataset2):
        print('Adding datasets of sizes ', self.num_rows, ', ', dataset2.num_rows)
        self.text = self.text + dataset2.text
        self.raw_text = self.raw_text + dataset2.raw_text
        self.raw_label = self.raw_label + dataset2.raw_label
        self.num_rows = len(self.text) #update texts is now longer
        assert self.raw_text_flag == dataset2.raw_text_flag
        print('To new dataset size: ', self.num_rows)


def load_data_snli(batch_size, train_dataset_df, valid_dataset_df, test_dataset_df, labels):
    """Download the SNLI dataset and return data iterators and vocabulary."""
    train_data = train_dataset_df
    valid_data = valid_dataset_df
    test_data = test_dataset_df
    train_set = SNLIDataset(train_data, labels, raw_text=False)
    valid_set = SNLIDataset(valid_data, labels, raw_text=False)
    test_set = SNLIDataset(test_data, labels, raw_text=False)
    return train_set, valid_set, test_set #TODO: MAKE INTO DICT

de_labels = ClassLabel(names=list(set(de_train_dataset_df['label'])|set(de_test_dataset_df['label'])|set(de_valid_dataset_df['label'])))
de_train_dataset, de_valid_dataset, de_test_dataset = load_data_snli(BATCH_SIZE, de_train_dataset_df, de_valid_dataset_df, de_test_dataset_df, de_labels)

read 2164 examples
read 241 examples
read 260 examples


## Task Adapter Training

In this section, we will train the task adapter on the English COPA dataset. We use a pre-trained XLM-R model from HuggingFace and instantiate our model using `AutoAdapterModel`.

In [7]:
from transformers import AutoConfig, AutoAdapterModel

config = AutoConfig.from_pretrained(
    "xlm-roberta-base",
)
model = AutoAdapterModel.from_pretrained(
    "xlm-roberta-base",
    config=config,
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaAdapterModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaAdapterModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaAdapterModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaAdapterModel were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for prediction

This is the base AutoModel. If you print model before adding the adapters, you can see that the embeddings+12 transformer layers will be shared by En and Zh adapters.

Now we only need to set up the adapters. As described, we need two language adapters (which are assumed to be pre-trained in this example) and a task adapter (which will be trained in a few moments).

First, we load both the language adapters for our source language English (`"en"`) and our target language Chinese (`"zh"`) from the Hub. Then we add a new task adapter (`"copa"`) for our target task.

Finally, we add a multiple-choice head with the same name as our task adapter on top.

In [8]:
from transformers import AdapterConfig

# Load the language adapters
lang_adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.load_adapter("en/wiki@ukp", config=lang_adapter_config)
model.load_adapter("de/wiki@ukp", config=lang_adapter_config)

# Add a new task adapter
model.add_adapter("disrpt")

# Add a classification head for our target task
num_labels=de_labels.num_classes
print([num_labels])
# model.add_multiple_choice_head("disrpt", num_choices=num_labels)
model.add_classification_head("disrpt", num_labels=num_labels)

[26]


Using `train_adapter()`, we tell our model to only train the task adapter in the following. This call will freeze the weights of the pre-trained model and the weights of the language adapters to prevent them from further finetuning.

In [9]:
model.train_adapter(["disrpt"])

We want the task adapter to be stacked on top of the language adapter, so we have to tell our model to use this setup via the `active_adapters` property.

A stack of adapters is represented by the `Stack` class, which takes the names of the adapters to be stacked as arguments.
Of course, there are various other possibilities to compose adapters beyonde stacking. Learn more about those [in our documentation](https://docs.adapterhub.ml/adapter_composition.html).

In [10]:
# Unfreeze and activate stack setup
model.active_adapters = Stack("de", "disrpt")

Great! Now, the input will be passed through the English language adapter first and the COPA task adapter second in every forward pass.

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object.

As the dataset splits of English COPA in the SuperGLUE are slightly different, we train on both the train and validation split of the dataset. Later, we will evaluate on the test split of XCOPA.

In [11]:
# TODO: SHUFFLE print smples: https://discuss.huggingface.co/t/how-to-print-a-few-examples-at-the-beginning-of-training-when-using-trainer/7597/2
# https://github.com/huggingface/datasets/blob/cd3169f3f35afcf73a36a8276113e1881d92e5e0/src/datasets/iterable_dataset.py#L223

In [12]:
from transformers import TrainingArguments, AdapterTrainer
from itertools import chain


training_args = TrainingArguments(
    learning_rate=1e-5,
    num_train_epochs=30,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=100,
    output_dir=MODEL_DIR,
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
)

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=de_train_dataset,
    eval_dataset=de_valid_dataset
)

2023-01-01 21:27:03.245160: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-01 21:27:03.456968: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-01-01 21:27:03.456995: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-01

# Train

Start the training 🚀 (this will take a while)

In [13]:
print(model)

XLMRobertaAdapterModel(
  (shared_parameters): ModuleDict()
  (roberta): RobertaModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict(
      (en): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): ReLU()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): ReLU()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
      )
      (de): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): ReLU()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequen

In [14]:
trainer.train()

***** Running training *****
  Num examples = 2164
  Num Epochs = 30
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4080
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33merzaliator[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,3.2113
200,3.0774
300,3.014
400,3.0016
500,2.9916
600,2.9732
700,2.969
800,2.9746
900,2.9666
1000,2.964


Saving model checkpoint to ./training_output/deu_disrpt_vanilla_bert/checkpoint-500
Configuration saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/en/adapter_config.json
Module weights saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/en/pytorch_adapter.bin
Configuration saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/de/adapter_config.json
Module weights saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/de/pytorch_adapter.bin
Configuration saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/disrpt/adapter_config.json
Module weights saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/disrpt/pytorch_adapter.bin
Configuration saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/disrpt/head_config.json
Module weights saved in ./training_output/deu_disrpt_vanilla_bert/checkpoint-500/disrpt/pytorch_model_head.bin
Configuration saved in ./training_output/deu_disrpt_vanilla_bert/checkpo

TrainOutput(global_step=4080, training_loss=2.961725178886862, metrics={'train_runtime': 1991.8496, 'train_samples_per_second': 32.593, 'train_steps_per_second': 2.048, 'total_flos': 2.032778523451392e+16, 'train_loss': 2.961725178886862, 'epoch': 30.0})

In [15]:
import numpy as np
from transformers import EvalPrediction

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}
eval_trainer = AdapterTrainer(
    model=model,
    args=TrainingArguments(output_dir="./eval_output_de", remove_unused_columns=False,),
    eval_dataset=de_test_dataset,
    compute_metrics=compute_accuracy,
)
eval_trainer.evaluate()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Evaluation *****
  Num examples = 260
  Batch size = 8


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


{'eval_loss': 2.9827046394348145,
 'eval_acc': 0.13076923076923078,
 'eval_runtime': 3.6404,
 'eval_samples_per_second': 71.421,
 'eval_steps_per_second': 9.065}