# 4️⃣ Zero-Shot Cross-Lingual Transfer using Adapters

Beyond AdapterFusion, which we trained in [the previous notebook](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb), we can compose adapters for zero-shot cross-lingual transfer between tasks. We will use the stacked adapter setup presented in **MAD-X** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00052.pdf)) for this purpose.

In this example, the base model is a pre-trained multilingual **XLM-R** (`xlm-roberta-base`) ([Conneau et al., 2019](https://arxiv.org/pdf/1911.02116.pdf)) model. Additionally, two types of adapters, language adapters and task adapters, are used. Here's how the MAD-X process works in detail:

1. Train language adapters for the source and target language on a language modeling task. In this notebook, we won't train them ourselves but use [pre-trained language adapters from the Hub](https://adapterhub.ml/explore/text_lang/).
2. Train a task adapter on the target task dataset. This task adapter is **stacked** upon the previously trained language adapter. During this step, only the weights of the task adapter are updated.
3. Perform zero-shot cross-lingual transfer. In this last step, we simply replace the source language adapter with the target language adapter while keeping the stacked task adapter.

Now to our concrete example: we select **XCOPA** ([Ponti et al., 2020](https://ducdauge.github.io/files/xcopa.pdf)), a multilingual extension of the **COPA** commonsence reasoning dataset ([Roemmele et al., 2011](https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF)) as our target task. The setup is trained on the original **English** dataset and then transferred to **Chinese**.

## Installation

Besides `adapter-transformers`, we use HuggingFace's `datasets` library for loading the data. So let's install both first:

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="5"
# os.environ["CUDA_VISIBLE_DEVICES"]="-1"

import torch
import wandb

In [2]:
SEED = 42
BERT_MODEL = 'bert-base-multilingual-cased'#'bert-base-german-cased'
BATCH_SIZE = 8

In [3]:
experiment_name = 'AdapterAlphaDisco'
lr = 'lr=AdafactorDefault'
model_name = 'plm='+BERT_MODEL
additional_info = 'Info=v3_lrtest1'
name = '_'.join([experiment_name, lr, model_name, additional_info])

MODEL_DIR = name

In [4]:
#WANDB
wandb.init(reinit=True)
wandb.run.name = name

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33merzaliator[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [5]:
# LOGGING
from utils.logger import Logger
logger = Logger(MODEL_DIR, wandb, wandb_flag=True)
print('Running with params: BERT_MODEL='+ BERT_MODEL+ ' lr='+ str(lr))

Running with params: BERT_MODEL=bert-base-multilingual-cased lr=lr=AdafactorDefault


## Dataset Preprocessing

We need the English COPA dataset for training our task adapter. It is part of the SuperGLUE benchmark and can be loaded via `datasets` using one line of code:

In [6]:
import pandas as pd
from datasets import Dataset

def read_df_custom(file):
    header = 'doc     unit1_toks      unit2_toks      unit1_txt       unit2_txt       s1_toks s2_toks unit1_sent      unit2_sent      dir     nuc_children    sat_children    genre   u1_discontinuous        u2_discontinuous       u1_issent        u2_issent       u1_length       u2_length       length_ratio    u1_speaker      u2_speaker      same_speaker    u1_func u1_pos  u1_depdir       u2_func u2_pos  u2_depdir       doclen  u1_position      u2_position     percent_distance        distance        lex_overlap_words       lex_overlap_length      unit1_case      unit2_case      label'
    extracted_columns = ['unit1_txt', 'unit1_sent', 'unit2_txt', 'unit2_sent', 'dir', 'label', 'distance', 'u1_depdir', 'u2_depdir', 'u2_func', 'u1_position', 'u2_position', 'sat_children', 'nuc_children', 'genre', 'unit1_case', 'unit2_case',
                            'u1_discontinuous', 'u2_discontinuous', 'same_speaker', 'lex_overlap_length', 'u1_func']
    header = header.split()
    df = pd.DataFrame(columns=extracted_columns)
    file = open(file, 'r')

    rows = []
    count = 0 
    for line in file:
        line = line[:-1].split('\t')
        count+=1
        if count ==1: continue
        row = {}
        for column in extracted_columns:
            index = header.index(column)
            row[column] = line[index]
        rows.append(row)

    df = pd.concat([df, pd.DataFrame.from_records(rows)])
    return df

en_train_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_train_enriched.rels')[:100])
en_test_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_test_enriched.rels'))
en_valid_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_dev_enriched.rels'))

de_train_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_train_enriched.rels')[:100])
de_test_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_test_enriched.rels'))
de_valid_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_dev_enriched.rels'))

len(en_train_dataset_df), len(en_test_dataset_df), len(en_valid_dataset_df)

(100, 2155, 1621)

In [7]:
en_train_dataset_df.features

{'unit1_txt': Value(dtype='string', id=None),
 'unit1_sent': Value(dtype='string', id=None),
 'unit2_txt': Value(dtype='string', id=None),
 'unit2_sent': Value(dtype='string', id=None),
 'dir': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None),
 'distance': Value(dtype='string', id=None),
 'u1_depdir': Value(dtype='string', id=None),
 'u2_depdir': Value(dtype='string', id=None),
 'u2_func': Value(dtype='string', id=None),
 'u1_position': Value(dtype='string', id=None),
 'u2_position': Value(dtype='string', id=None),
 'sat_children': Value(dtype='string', id=None),
 'nuc_children': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None),
 'unit1_case': Value(dtype='string', id=None),
 'unit2_case': Value(dtype='string', id=None),
 'u1_discontinuous': Value(dtype='string', id=None),
 'u2_discontinuous': Value(dtype='string', id=None),
 'same_speaker': Value(dtype='string', id=None),
 'lex_overlap_length': Value(dtype='string', id=None),
 'u1_func':

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [8]:
from transformers import AutoTokenizer, BertTokenizer
from datasets import ClassLabel

tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL)

class SNLIDataset(torch.utils.data.Dataset):
    """A customized dataset to load the SNLI dataset."""
    def __init__(self, dataset, labels, raw_text=False):
        self.text = []
        self.raw_text = []
        self.raw_label = []
        self.raw_text_flag = raw_text
        self.num_rows = len(dataset)
        for premise, hypothesis in zip(dataset['unit1_txt'], dataset['unit2_txt']):
            self.text.append(tokenizer.encode_plus(premise, hypothesis, padding="max_length", truncation=True, max_length=512, return_token_type_ids=True))
            if raw_text: self.raw_text.append([premise, hypothesis])
        # self.labels = torch.tensor(labels.str2int(dataset['label'])).to(device)
        self.labels = labels.str2int(dataset['label'])
        if raw_text: self.raw_label = dataset['label']
        print('read ' + str(len(self.text)) + ' examples')

    def __getitem__(self, idx):
        if self.raw_text_flag:  
            return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'raw_text': self.raw_text[idx],
                'label':self.labels[idx],
                'raw_label': self.raw_label[idx]}

        return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'label':self.labels[idx]}

    def __len__(self):
        return len(self.text)

    def combine_with_another_dataset(self, dataset2):
        print('Adding datasets of sizes ', self.num_rows, ', ', dataset2.num_rows)
        self.text = self.text + dataset2.text
        self.raw_text = self.raw_text + dataset2.raw_text
        self.raw_label = self.raw_label + dataset2.raw_label
        self.num_rows = len(self.text) #update texts is now longer
        assert self.raw_text_flag == dataset2.raw_text_flag
        print('To new dataset size: ', self.num_rows)


def load_data_snli(batch_size, train_dataset_df, valid_dataset_df, test_dataset_df, labels):
    """Download the SNLI dataset and return data iterators and vocabulary."""
    train_data = train_dataset_df
    valid_data = valid_dataset_df
    test_data = test_dataset_df
    train_set = SNLIDataset(train_data, labels, raw_text=False)
    valid_set = SNLIDataset(valid_data, labels, raw_text=False)
    test_set = SNLIDataset(test_data, labels, raw_text=False)
    return train_set, valid_set, test_set #TODO: MAKE INTO DICT

en_labels = ClassLabel(names=list(set(en_train_dataset_df['label'])|set(en_test_dataset_df['label'])|set(en_valid_dataset_df['label'])))
en_train_dataset, en_valid_dataset, en_test_dataset = load_data_snli(BATCH_SIZE, en_train_dataset_df, en_valid_dataset_df, en_test_dataset_df, en_labels)
de_labels = ClassLabel(names=list(set(de_train_dataset_df['label'])|set(de_test_dataset_df['label'])|set(de_valid_dataset_df['label'])))
de_train_dataset, de_valid_dataset, de_test_dataset = load_data_snli(BATCH_SIZE, de_train_dataset_df, de_valid_dataset_df, de_test_dataset_df, de_labels)

read 100 examples
read 1621 examples
read 2155 examples
read 100 examples
read 241 examples
read 260 examples


## Task Adapter Training

In this section, we will train the task adapter on the English COPA dataset. We use a pre-trained XLM-R model from HuggingFace and instantiate our model using `AutoAdapterModel`.

In [9]:
# import copy
# from typing import Any, Dict, Optional
# from transformers import BertModel, AutoTokenizer, BertConfig
# from transformers.models.bert.modeling_bert import BertPooler
# import torch.nn as nn
# from tensorflow.keras.layers import TimeDistributed
# from featurefulbertembedder_custom2 import FeaturefulBertEmbedder
# from featureful_bert_custom2 import get_combined_feature_tensor_2 as get_combined_feature_tensor_forward
# from featureful_bert_custom2 import get_feature_modules

# class CustomPooler2(nn.Module):
#     def __init__(self, *,
#                         requires_grad: bool = True,
#                         dropout: float = 0.0,
#                         transformer_kwargs: Optional[Dict[str, Any]] = None, ) -> None:
#         super().__init__()
#         bert = BertModel.from_pretrained(BERT_MODEL) #only used to pass config. BertAttentionClass used in FeatureFulBert
#         self._dropout = torch.nn.Dropout(p=dropout)
#         self.pooler = copy.deepcopy(bert.pooler)
#         for param in self.pooler.parameters():
#             param.requires_grad = requires_grad
#         self._embedding_dim = bert.config.hidden_size

#     def get_input_dim(self) -> int:
#         return self._embedding_dim

#     def get_output_dim(self) -> int:
#         return self._embedding_dim

#     def forward(self, tokens: torch.Tensor, mask: torch.BoolTensor = None, num_wrapping_dims: int = 0):
#         pooler = self.pooler
        
#         for _ in range(num_wrapping_dims):
#             pooler = TimeDistributed(pooler)
#         pooled = pooler(tokens)
#         pooled = self._dropout(pooled)
#         return pooled

# class MyModule(nn.Module):    
#     def __init__(self, feature_list, vocab):
#         super(MyModule, self).__init__()
#         self.feature_list = feature_list
#         self.feature_modules, self._feature_module_size = get_feature_modules(feature_list, vocab)

#     def forward(self, features):
#         return get_combined_feature_tensor_forward(features, self.feature_list, self.feature_modules)

# class CustomBERTModel(nn.Module):
#     def __init__(self, num_labels, vocab):
#           super(CustomBERTModel, self).__init__()
#           self.num_classes = num_labels
#           self.feature_list = mnli_dataset.feature_list
#           print('ASSIGN:', self.num_classes)

#           self.embedder = self.create_featureful_bert()
#           self.encoder = CustomPooler2()
#           self.module1 = MyModule(self.feature_list, vocab)
#           self.dropout1 = nn.Dropout(p=0.0)
#         #   self.dropout_decoder = nn.Dropout(p=0.5)
#           self._decoder_input_size = self.encoder._embedding_dim + self.module1._feature_module_size
#           self.relation_decoder = nn.Linear(self._decoder_input_size, self.num_classes)

#     def forward(self, pair_token_ids, token_type_ids, attention_mask, feat):
#         direction_tensor = feat['dir'].to(device)
#         embedded_sentence = self.embedder(token_ids=pair_token_ids, #featurefulmebedder
#                         mask=attention_mask, 
#                         type_ids=token_type_ids,
#                         segment_concat_mask = None,
#                         direction_tensor = direction_tensor,
#                         feature_list = self.feature_list,
#                         features = feat)
#         mask = token_type_ids
#         bertpooler_output = self.encoder(tokens=embedded_sentence, mask=mask)
#         feat = self.convert_to_feature_list(feat)
#         feat = self.dropout1(feat)
#         feat = self.module1(feat)
#         # print(bertpooler_output.shape, self.module1._feature_module_size, feat.shape)
#         try:
#             feat_concat = torch.concat((bertpooler_output, feat),-1)
#         except:
#             print(bertpooler_output.shape, feat.shape)
#             raise ValueError()
#         assert feat_concat.shape[-1] == self._decoder_input_size
#         feat_concat = self.dropout1(feat_concat)
#         # feat_concat = self.dropout_decoder(feat_concat)
#         linear1_output = self.relation_decoder(feat_concat)
#         return linear1_output


#     def create_bert_without_activations(self):
#         config = BertConfig.from_pretrained(BERT_MODEL, hidden_act='gelu')
#         bert = BertModel.from_pretrained(BERT_MODEL, config=config)
#         return bert

#     def create_featureful_bert(self):
#         featureful_bert = FeaturefulBertEmbedder(model_name = BERT_MODEL,
#                                 hidden_activation_allen = 'gelu',
#                                 feature_list = self.feature_list, 
#                                 vocab=mnli_dataset.vocab)
#         return featureful_bert

#     def convert_to_feature_list(self, feat):
#         feature_linear = [feat[feature_name] for feature_name in self.feature_list]
#         feature_linear = torch.stack(feature_linear, dim=-1)
#         return feature_linear
        

# model = CustomBERTModel(mnli_dataset.num_labels, mnli_dataset.vocab)
# model.to(device)
# optimizer = AdamW(model.parameters(), lr=5e-6, correct_bias=False) # original 2e-5
# scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.8, mode='max', patience=35, min_lr=5e-7, verbose=True) #original factor=0.6, min_lr=5e-7

In [10]:
from transformers import AutoConfig, AutoAdapterModel

config = AutoConfig.from_pretrained(
    BERT_MODEL,
)
model = AutoAdapterModel.from_pretrained(
    BERT_MODEL,
    config=config,
)

RuntimeError: Failed to import transformers.adapters.models.bert.adapter_model because of the following error (look up to see its traceback):
name 'BERT_START_DOCSTRING' is not defined

wandb: ERROR Error while calling W&B API: internal database error (<Response [500]>)


In [None]:
print(model)

BertAdapterModel(
  (shared_parameters): ModuleDict()
  (bert): BertModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (key): Linear(
                in_features=768, out_features=768, bias=True
                (loras): ModuleDict()
              )
              (value): Linear(
                in_features=768, out_features=768

In [None]:
raise ValueError()

ValueError: 

This is the base AutoModel. If you print model before adding the adapters, you can see that the embeddings+12 transformer layers will be shared by En and Zh adapters.

Now we only need to set up the adapters. As described, we need two language adapters (which are assumed to be pre-trained in this example) and a task adapter (which will be trained in a few moments).

First, we load both the language adapters for our source language English (`"en"`) and our target language Chinese (`"zh"`) from the Hub. Then we add a new task adapter (`"copa"`) for our target task.

Finally, we add a multiple-choice head with the same name as our task adapter on top.

In [None]:
from transformers import AdapterConfig

# Load the language adapters
lang_adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.load_adapter("en/wiki@ukp", config=lang_adapter_config)
model.load_adapter("de/wiki@ukp", config=lang_adapter_config)

# Add a new task adapter
# model.add_adapter_task("disrpt", labels=[1,0,0,0,0,0,0])
model.add_adapter("disrpt")

# Add a classification head for our target task
num_labels=de_labels.num_classes
print([num_labels])
# model.add_multiple_choice_head("disrpt", num_choices=num_labels)
model.add_classification_head("disrpt", num_labels=num_labels)

[26]


Using `train_adapter()`, we tell our model to only train the task adapter in the following. This call will freeze the weights of the pre-trained model and the weights of the language adapters to prevent them from further finetuning.

In [None]:
model.train_adapter(["disrpt"])

We want the task adapter to be stacked on top of the language adapter, so we have to tell our model to use this setup via the `active_adapters` property.

A stack of adapters is represented by the `Stack` class, which takes the names of the adapters to be stacked as arguments.
Of course, there are various other possibilities to compose adapters beyonde stacking. Learn more about those [in our documentation](https://docs.adapterhub.ml/adapter_composition.html).

In [None]:
# Unfreeze and activate stack setup
from transformers.adapters.composition import Stack

lang = 'en'
model.active_adapters = Stack(lang, "disrpt")

Great! Now, the input will be passed through the English language adapter first and the COPA task adapter second in every forward pass.

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object.

As the dataset splits of English COPA in the SuperGLUE are slightly different, we train on both the train and validation split of the dataset. Later, we will evaluate on the test split of XCOPA.

In [None]:
# TODO: SHUFFLE print smples: https://discuss.huggingface.co/t/how-to-print-a-few-examples-at-the-beginning-of-training-when-using-trainer/7597/2
# https://github.com/huggingface/datasets/blob/cd3169f3f35afcf73a36a8276113e1881d92e5e0/src/datasets/iterable_dataset.py#L223

In [None]:
from copy import deepcopy
from transformers import TrainerCallback

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train@"+lang)
            return control_copy

2023-01-05 14:35:34.770274: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-05 14:35:35.069295: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-01-05 14:35:35.069340: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-05

In [None]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, log_loss
from torch.nn import CrossEntropyLoss
import numpy as np

def compute_metrics(pred):
    global num_labels
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    loss_fct = CrossEntropyLoss()
    logits = torch.tensor(pred.predictions)
    labels = torch.tensor(labels)
    loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))
    return {
        'accuracy@'+lang: acc,
        'f1@'+lang: f1,
        'precision@'+lang: precision,
        'recall@'+lang: recall,
        'loss@'+lang: loss,
    }

In [None]:
from transformers import TrainingArguments, AdapterTrainer
# from datasets import concatenate_datasets
from itertools import chain


def concatenate_datasets(dataset1, dataset2, shuffle=True):
    y_iter = chain(dataset1, dataset2)
    return y_iter
    # tmp = list(yielding(y_iter))
    # random.shuffle(tmp)
    # for i in tmp:
    #     print i

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir=MODEL_DIR+'_EN',
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    save_total_limit=1,
)

# train_dataset.combine_with_another_dataset(valid_dataset)
# train_dataset = concatenate_datasets([dataset_en["train"], dataset_en["validation"]])

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=en_train_dataset,
    eval_dataset=en_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

# Train

Start the training 🚀 (this will take a while)

In [None]:
train_result = trainer.train()

metrics = train_result.metrics
trainer.log_metrics("train_log", metrics)
trainer.save_metrics("train_log", metrics)

***** Running training *****
  Num examples = 100
  Num Epochs = 8
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 56
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Accuracy@en,F1@en,Precision@en,Recall@en,Loss@en
1,No log,2.964214,0.41,0.066789,0.051959,0.097917,2.964214
1,3.159000,2.994197,0.354719,0.041642,0.031404,0.062048,2.994197
2,3.159000,2.678128,0.34,0.05,0.069737,0.075,2.678128
2,2.839300,2.735241,0.394818,0.039306,0.034829,0.06083,2.73524
3,2.839300,2.424406,0.34,0.05,0.069737,0.075,2.424406
3,2.604200,2.500849,0.395435,0.038474,0.034664,0.060375,2.500849
4,2.604200,2.294416,0.37,0.059576,0.052827,0.085417,2.294416
4,2.466900,2.398073,0.389883,0.040335,0.03401,0.061198,2.398073
5,2.466900,2.237481,0.38,0.059766,0.047286,0.088542,2.237482
5,2.311300,2.386429,0.368908,0.04255,0.032873,0.062763,2.386429


***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1621
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1621
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1621
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 1621
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(resul

***** train_log metrics *****
  epoch                    =        8.0
  total_flos               =   233292GF
  train_loss               =     2.5238
  train_runtime            = 0:03:11.30
  train_samples_per_second =      4.182
  train_steps_per_second   =      0.293


In [None]:
# eval_trainer = AdapterTrainer(
#     model=model,
#     args=TrainingArguments(output_dir="./eval_output_en", remove_unused_columns=False,),
#     eval_dataset=en_test_dataset,
#     compute_metrics=compute_metrics,
# )
trainer.evaluate(metric_key_prefix='test_en',
                eval_dataset=en_test_dataset)

***** Running Evaluation *****
  Num examples = 2155
  Batch size = 16


  _warn_prf(average, modifier, msg_start, len(result))


{'test_en_loss': 2.4515175819396973,
 'test_en_accuracy@en': 0.33178654292343385,
 'test_en_f1@en': 0.04071513997224092,
 'test_en_precision@en': 0.030522834978779965,
 'test_en_recall@en': 0.06247606318035189,
 'test_en_loss@en': 2.451517343521118,
 'test_en_runtime': 26.5909,
 'test_en_samples_per_second': 81.043,
 'test_en_steps_per_second': 5.077,
 'epoch': 8.0}

## Cross-lingual transfer

With the model and all adapters trained and ready, we can come to the cross-lingual transfer step here. We will evaluate our setup on the Chinese split of the XCOPA dataset.
Therefore, we'll first download the data and preprocess it using the same method as the English dataset:

Next, let's adapt our setup to the new language. We simply replace the English language adapter with the Chinese language adapter we already loaded previously. The task adapter we just trained is kept. Again, we set this architecture using `active_adapters`:

In [None]:
lang = 'de'
model.active_adapters = Stack(lang, "disrpt")

Finally, let's see how well our adapter setup performs on the new language. We measure the zero-shot accuracy on the test split of the target language dataset. Evaluation is also performed using the built-in `Trainer` class.

### Set training args for de

In [None]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    output_dir=MODEL_DIR+'_DE',
    overwrite_output_dir=False,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    save_total_limit=1,
    # resume_from_checkpoint=MODEL_DIR+'/last-checkpoint',
)

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=de_train_dataset,
    eval_dataset=de_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer.train()

metrics = train_result.metrics
trainer.log_metrics("train_logger", metrics)
trainer.save_metrics("train_logger", metrics)

***** Running training *****
  Num examples = 100
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 104
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Accuracy@de,F1@de,Precision@de,Recall@de,Loss@de
1,No log,2.982893,0.16,0.011994,0.006957,0.043478,2.982893
1,3.257700,3.243593,0.053942,0.004265,0.002248,0.041667,3.243592
2,3.257700,2.869488,0.26,0.03684,0.026087,0.070652,2.869488
2,2.937400,3.129823,0.107884,0.011176,0.006491,0.040408,3.129823
3,2.937400,2.75795,0.24,0.031817,0.021413,0.065217,2.757949
3,2.879800,3.132065,0.112033,0.017372,0.010923,0.050481,3.132065
4,2.879800,2.691318,0.27,0.037037,0.025135,0.07337,2.691318
4,2.745600,3.114051,0.103734,0.011256,0.00658,0.038919,3.114051
5,2.745600,2.631517,0.27,0.037826,0.026087,0.07337,2.631517
5,2.687200,3.108267,0.107884,0.011433,0.00666,0.040408,3.108267


***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Ru

***** train_logger metrics *****
  epoch                    =        8.0
  total_flos               =   233292GF
  train_loss               =     2.5238
  train_runtime            = 0:03:11.30
  train_samples_per_second =      4.182
  train_steps_per_second   =      0.293


In [None]:
trainer.evaluate(metric_key_prefix='test_de',
                eval_dataset=de_test_dataset)

***** Running Evaluation *****
  Num examples = 260
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'test_de_loss': 3.1481635570526123,
 'test_de_accuracy@de': 0.1076923076923077,
 'test_de_f1@de': 0.01305642759397891,
 'test_de_precision@de': 0.0077206123924042315,
 'test_de_recall@de': 0.04330065359477125,
 'test_de_loss@de': 3.1481635570526123,
 'test_de_runtime': 3.4364,
 'test_de_samples_per_second': 75.66,
 'test_de_steps_per_second': 9.603,
 'epoch': 8.0}

In [None]:
lang = 'en(de)'

trainer.evaluate(metric_key_prefix='test_en(de)',
                eval_dataset=en_test_dataset)

***** Running Evaluation *****
  Num examples = 2155
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'test_en(de)_loss': 2.8030734062194824,
 'test_en(de)_accuracy@en(de)': 0.362877030162413,
 'test_en(de)_f1@en(de)': 0.03016975308641975,
 'test_en(de)_precision@en(de)': 0.020846662401364895,
 'test_en(de)_recall@en(de)': 0.054578447794528195,
 'test_en(de)_loss@en(de)': 2.8030736446380615,
 'test_en(de)_runtime': 28.3676,
 'test_en(de)_samples_per_second': 75.967,
 'test_en(de)_steps_per_second': 9.518,
 'epoch': 8.0}

You should get an overall accuracy of about 56 which is on-par with full finetuning on COPA only but below the state-of-the-art which is sequentially finetuned on an additional dataset before finetuning on COPA.

For results on different languages and a sequential finetuning setup which yields better results, make sure to check out [the MAD-X paper](https://arxiv.org/pdf/2005.00052.pdf).

So you mean to say that even the SOTAs struggle to match randomized baseline for Zh.