# 4️⃣ Zero-Shot Cross-Lingual Transfer using Adapters

Beyond AdapterFusion, which we trained in [the previous notebook](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/04_Cross_Lingual_Transfer.ipynb), we can compose adapters for zero-shot cross-lingual transfer between tasks. We will use the stacked adapter setup presented in **MAD-X** ([Pfeiffer et al., 2020](https://arxiv.org/pdf/2005.00052.pdf)) for this purpose.

In this example, the base model is a pre-trained multilingual **XLM-R** (`xlm-roberta-base`) ([Conneau et al., 2019](https://arxiv.org/pdf/1911.02116.pdf)) model. Additionally, two types of adapters, language adapters and task adapters, are used. Here's how the MAD-X process works in detail:

1. Train language adapters for the source and target language on a language modeling task. In this notebook, we won't train them ourselves but use [pre-trained language adapters from the Hub](https://adapterhub.ml/explore/text_lang/).
2. Train a task adapter on the target task dataset. This task adapter is **stacked** upon the previously trained language adapter. During this step, only the weights of the task adapter are updated.
3. Perform zero-shot cross-lingual transfer. In this last step, we simply replace the source language adapter with the target language adapter while keeping the stacked task adapter.

Now to our concrete example: we select **XCOPA** ([Ponti et al., 2020](https://ducdauge.github.io/files/xcopa.pdf)), a multilingual extension of the **COPA** commonsence reasoning dataset ([Roemmele et al., 2011](https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF)) as our target task. The setup is trained on the original **English** dataset and then transferred to **Chinese**.

## Installation

Besides `adapter-transformers`, we use HuggingFace's `datasets` library for loading the data. So let's install both first:

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="6"

import torch
import wandb

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
SEED = 42
BERT_MODEL = 'bert-base-multilingual-cased'
BATCH_SIZE = 8

In [3]:
experiment_name = 'Mad'
lr = 'lr=AdafactorDefault'
model_name = 'plm='+BERT_MODEL
additional_info = 'Info=v1_test1'
lang1 = 'en'
lang2 = 'de'
name = '_'.join([experiment_name, lr, model_name, additional_info, lang1, lang2])

MODEL_DIR = 'runs/'+name

In [4]:
#WANDB
wandb.init(reinit=True)
wandb.run.name = name

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33merzaliator[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [5]:
# LOGGING
from utils.logger import Logger
logger = Logger(MODEL_DIR, wandb, wandb_flag=True)
print('Running with params: BERT_MODEL='+ BERT_MODEL+ ' lr='+ str(lr))

Running with params: BERT_MODEL=bert-base-multilingual-cased lr=lr=AdafactorDefault


## Dataset Preprocessing

We need the English COPA dataset for training our task adapter. It is part of the SuperGLUE benchmark and can be loaded via `datasets` using one line of code:

In [6]:
from datasets import Dataset

In [7]:
import pandas as pd
from datasets import Dataset

def read_df_custom(file):
    header = 'doc     unit1_toks      unit2_toks      unit1_txt       unit2_txt       s1_toks s2_toks unit1_sent      unit2_sent      dir     nuc_children    sat_children    genre   u1_discontinuous        u2_discontinuous       u1_issent        u2_issent       u1_length       u2_length       length_ratio    u1_speaker      u2_speaker      same_speaker    u1_func u1_pos  u1_depdir       u2_func u2_pos  u2_depdir       doclen  u1_position      u2_position     percent_distance        distance        lex_overlap_words       lex_overlap_length      unit1_case      unit2_case      label'
    extracted_columns = ['unit1_txt', 'unit1_sent', 'unit2_txt', 'unit2_sent', 'dir', 'label', 'distance', 'u1_depdir', 'u2_depdir', 'u2_func', 'u1_position', 'u2_position', 'sat_children', 'nuc_children', 'genre', 'unit1_case', 'unit2_case',
                            'u1_discontinuous', 'u2_discontinuous', 'same_speaker', 'lex_overlap_length', 'u1_func']
    header = header.split()
    df = pd.DataFrame(columns=extracted_columns)
    file = open(file, 'r')

    rows = []
    count = 0 
    for line in file:
        line = line[:-1].split('\t')
        count+=1
        if count ==1: continue
        row = {}
        for column in extracted_columns:
            index = header.index(column)
            row[column] = line[index]
        rows.append(row)

    df = pd.concat([df, pd.DataFrame.from_records(rows)])
    return df

en_train_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_train_enriched.rels'))
en_test_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_test_enriched.rels'))
en_valid_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/eng.rst.rstdt_dev_enriched.rels'))

de_train_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_train_enriched.rels'))
de_test_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_test_enriched.rels'))
de_valid_dataset_df = Dataset.from_pandas(read_df_custom('../../processed/deu.rst.pcc_dev_enriched.rels'))

len(en_train_dataset_df), len(en_test_dataset_df), len(en_valid_dataset_df)

(16002, 2155, 1621)

In [8]:
en_train_dataset_df.features

{'unit1_txt': Value(dtype='string', id=None),
 'unit1_sent': Value(dtype='string', id=None),
 'unit2_txt': Value(dtype='string', id=None),
 'unit2_sent': Value(dtype='string', id=None),
 'dir': Value(dtype='string', id=None),
 'label': Value(dtype='string', id=None),
 'distance': Value(dtype='string', id=None),
 'u1_depdir': Value(dtype='string', id=None),
 'u2_depdir': Value(dtype='string', id=None),
 'u2_func': Value(dtype='string', id=None),
 'u1_position': Value(dtype='string', id=None),
 'u2_position': Value(dtype='string', id=None),
 'sat_children': Value(dtype='string', id=None),
 'nuc_children': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None),
 'unit1_case': Value(dtype='string', id=None),
 'unit2_case': Value(dtype='string', id=None),
 'u1_discontinuous': Value(dtype='string', id=None),
 'u2_discontinuous': Value(dtype='string', id=None),
 'same_speaker': Value(dtype='string', id=None),
 'lex_overlap_length': Value(dtype='string', id=None),
 'u1_func':

In this example, we model COPA as a multiple-choice task with two choices. Thus, we encode the premise and question as well as both choices as one input to our `xlm-roberta-base` model. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [9]:
from transformers import AutoTokenizer, BertTokenizer
from datasets import ClassLabel

tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL)

class SNLIDataset(torch.utils.data.Dataset):
    """A customized dataset to load the SNLI dataset."""
    def __init__(self, dataset, labels, raw_text=False):
        self.text = []
        self.raw_text = []
        self.raw_label = []
        self.raw_text_flag = raw_text
        self.num_rows = len(dataset)
        for premise, hypothesis in zip(dataset['unit1_txt'], dataset['unit2_txt']):
            self.text.append(tokenizer.encode_plus(premise, hypothesis, padding="max_length", truncation=True, max_length=512, return_token_type_ids=True))
            if raw_text: self.raw_text.append([premise, hypothesis])
        # self.labels = torch.tensor(labels.str2int(dataset['label'])).to(device)
        self.labels = labels.str2int(dataset['label'])
        if raw_text: self.raw_label = dataset['label']
        print('read ' + str(len(self.text)) + ' examples')

    def __getitem__(self, idx):
        if self.raw_text_flag:  
            return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'raw_text': self.raw_text[idx],
                'label':self.labels[idx],
                'raw_label': self.raw_label[idx]}

        return {'input_ids':self.text[idx]['input_ids'], 
                'token_type_ids':self.text[idx]['token_type_ids'], 
                'attention_mask':self.text[idx]['attention_mask'], 
                'label':self.labels[idx]}

    def __len__(self):
        return len(self.text)

    def combine_with_another_dataset(self, dataset2):
        print('Adding datasets of sizes ', self.num_rows, ', ', dataset2.num_rows)
        self.text = self.text + dataset2.text
        self.raw_text = self.raw_text + dataset2.raw_text
        self.raw_label = self.raw_label + dataset2.raw_label
        self.num_rows = len(self.text) #update texts is now longer
        assert self.raw_text_flag == dataset2.raw_text_flag
        print('To new dataset size: ', self.num_rows)


def load_data_snli(batch_size, train_dataset_df, valid_dataset_df, test_dataset_df, labels):
    """Download the SNLI dataset and return data iterators and vocabulary."""
    train_data = train_dataset_df
    valid_data = valid_dataset_df
    test_data = test_dataset_df
    train_set = SNLIDataset(train_data, labels, raw_text=False)
    valid_set = SNLIDataset(valid_data, labels, raw_text=False)
    test_set = SNLIDataset(test_data, labels, raw_text=False)
    return train_set, valid_set, test_set

en_labels = ClassLabel(names=list(set(en_train_dataset_df['label'])|set(en_test_dataset_df['label'])|set(en_valid_dataset_df['label'])))
en_train_dataset, en_valid_dataset, en_test_dataset = load_data_snli(BATCH_SIZE, en_train_dataset_df, en_valid_dataset_df, en_test_dataset_df, en_labels)
de_labels = ClassLabel(names=list(set(de_train_dataset_df['label'])|set(de_test_dataset_df['label'])|set(de_valid_dataset_df['label'])))
de_train_dataset, de_valid_dataset, de_test_dataset = load_data_snli(BATCH_SIZE, de_train_dataset_df, de_valid_dataset_df, de_test_dataset_df, de_labels)

read 16002 examples
read 1621 examples
read 2155 examples
read 2164 examples
read 241 examples
read 260 examples


## Task Adapter Training

In this section, we will train the task adapter on the English COPA dataset. We use a pre-trained XLM-R model from HuggingFace and instantiate our model using `AutoAdapterModel`.

In [10]:
from transformers import AutoConfig, AutoAdapterModel

config = AutoConfig.from_pretrained(
    BERT_MODEL,
)
model = AutoAdapterModel.from_pretrained(
    BERT_MODEL,
    config=config,
)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertAdapterModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertAdapterModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertAdapterModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
from transformers import AdapterConfig

# Load the language adapters
lang_adapter_config = AdapterConfig.load("pfeiffer", reduction_factor=2)
model.load_adapter("en/wiki@ukp", config=lang_adapter_config)
model.load_adapter("de/wiki@ukp", config=lang_adapter_config)

# Add a new task adapter
model.add_adapter("disrpt")

# Add a classification head for our target task
num_labels=de_labels.num_classes
print([num_labels])
model.add_classification_head("disrpt", num_labels=num_labels)

[26]


In [12]:
model.train_adapter(["disrpt"])

In [13]:
# Unfreeze and activate stack setup
from transformers.adapters.composition import Stack

lang = 'en'
model.active_adapters = Stack(lang, "disrpt")

In [14]:
print(model)

BertAdapterModel(
  (shared_parameters): ModuleDict()
  (bert): BertModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict(
      (en): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): ReLU()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): ReLU()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
      )
      (de): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class(
            (f): NewGELUActivation()
          )
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Seque

In [15]:
from copy import deepcopy
from transformers import TrainerCallback

class CustomCallback(TrainerCallback):
    
    def __init__(self, trainer) -> None:
        super().__init__()
        self._trainer = trainer
    
    def on_epoch_end(self, args, state, control, **kwargs):
        if control.should_evaluate:
            control_copy = deepcopy(control)
            self._trainer.evaluate(eval_dataset=self._trainer.train_dataset, metric_key_prefix="train@"+lang)
            return control_copy

# CM

In [16]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, log_loss
from torch.nn import CrossEntropyLoss
import numpy as np
import os
    
def compute_metrics(pred):
    global num_labels
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    # confusion matrix
    class_names = en_labels if lang=='en' else de_labels
    wandb.log({"conf_mat" : wandb.plot.confusion_matrix(probs=None, y_true=labels, preds=preds, class_names=class_names._int2str)})
    labels2 = [class_names._int2str[x.item()] for x in labels]
    preds2 = [class_names._int2str[x.item()] for x in preds]
    # log predictions
    output_predict_folder = MODEL_DIR+'_'+lang
    if not os.path.exists(output_predict_folder): os.makedirs(output_predict_folder)
    output_predict_file = os.path.join(output_predict_folder, "predictions.txt")
    with open(output_predict_file, "w") as writer:
        writer.write(str({'prefix': lang, 'labels': labels2, 'preds': preds2}))
    wandb.save(os.path.join(output_predict_file))

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    loss_fct = CrossEntropyLoss()
    logits = torch.tensor(pred.predictions)
    labels = torch.tensor(labels)
    loss = loss_fct(logits.view(-1, num_labels), labels.view(-1))

    return {
        'accuracy@'+lang: acc,
        'f1@'+lang: f1,
        'precision@'+lang: precision,
        'recall@'+lang: recall,
        'loss@'+lang: loss,
    }

In [17]:
from transformers import TrainingArguments, AdapterTrainer
from itertools import chain


training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir=MODEL_DIR+'_'+lang,
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    save_total_limit=1,
)

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=en_train_dataset, #CHANGE
    eval_dataset=en_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

# Train

In [18]:
train_result = trainer.train()

metrics = train_result.metrics
trainer.log_metrics("train_log", metrics)
trainer.save_metrics("train_log", metrics)

***** Running training *****
  Num examples = 16002
  Num Epochs = 8
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 8008
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss


Saving model checkpoint to runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/en/adapter_config.json
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/en/adapter_config.json
Module weights saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/en/pytorch_adapter.bin
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/de/adapter_config.json
Module weights saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/de/pytorch_adapter.bin
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_en/checkpoint-500/disrpt/adapter_con

***** train_log metrics *****
  epoch                    =        8.0
  total_flos               = 37331542GF
  train_loss               =     1.0452
  train_runtime            = 1:24:02.53
  train_samples_per_second =     25.387
  train_steps_per_second   =      1.588


In [19]:
trainer.evaluate(metric_key_prefix='test_en',
                eval_dataset=en_test_dataset)

***** Running Evaluation *****
  Num examples = 2155
  Batch size = 16


{'test_en_loss': 1.2087045907974243,
 'test_en_accuracy@en': 0.6519721577726219,
 'test_en_f1@en': 0.4828209714935802,
 'test_en_precision@en': 0.5587316549447741,
 'test_en_recall@en': 0.45135070136556504,
 'test_en_loss@en': 1.2087047100067139,
 'test_en_runtime': 26.2237,
 'test_en_samples_per_second': 82.178,
 'test_en_steps_per_second': 5.148,
 'epoch': 8.0}

In [20]:
lang = 'de(en)'

trainer.evaluate(metric_key_prefix='test_de(en)',
                eval_dataset=de_test_dataset)

***** Running Evaluation *****
  Num examples = 260
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'test_de(en)_loss': 6.265538215637207,
 'test_de(en)_accuracy@de(en)': 0.05,
 'test_de(en)_f1@de(en)': 0.019683919944789508,
 'test_de(en)_precision@de(en)': 0.017743589743589742,
 'test_de(en)_recall@de(en)': 0.0317526395173454,
 'test_de(en)_loss@de(en)': 6.265538692474365,
 'test_de(en)_runtime': 3.4177,
 'test_de(en)_samples_per_second': 76.075,
 'test_de(en)_steps_per_second': 4.974,
 'epoch': 8.0}

## Cross-lingual transfer

With the model and all adapters trained and ready, we can come to the cross-lingual transfer step here. We will evaluate our setup on the Chinese split of the XCOPA dataset.
Therefore, we'll first download the data and preprocess it using the same method as the English dataset:

In [21]:
lang = 'de'
model.active_adapters = Stack(lang, "disrpt")

### Set training args for de

In [22]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=1e-4,
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    output_dir=MODEL_DIR+'_'+lang,
    overwrite_output_dir=False,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    save_total_limit=1,
    # resume_from_checkpoint=MODEL_DIR+'/last-checkpoint',
)

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=de_train_dataset,
    eval_dataset=de_valid_dataset,
    compute_metrics=compute_metrics
)

trainer.add_callback(CustomCallback(trainer)) 

Exception in thread SystemMonitor:
Traceback (most recent call last):
  File "/home/VD/kaveri/anaconda3/envs/adapters/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/VD/kaveri/anaconda3/envs/adapters/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/VD/kaveri/anaconda3/envs/adapters/lib/python3.10/site-packages/wandb/sdk/internal/system/system_monitor.py", line 118, in _start
    asset.start()
  File "/home/VD/kaveri/anaconda3/envs/adapters/lib/python3.10/site-packages/wandb/sdk/internal/system/assets/cpu.py", line 166, in start
    self.metrics_monitor.start()
  File "/home/VD/kaveri/anaconda3/envs/adapters/lib/python3.10/site-packages/wandb/sdk/internal/system/assets/interfaces.py", line 168, in start
    logger.info(f"Started {self._process.name}")
AttributeError: 'NoneType' object has no attribute 'name'


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [23]:
trainer.train()

metrics = train_result.metrics
trainer.log_metrics("train_logger", metrics)
trainer.save_metrics("train_logger", metrics)

***** Running training *****
  Num examples = 2164
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2168
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss


***** Running Evaluation *****
  Num examples = 2164
  Batch size = 8
  Num examples = 2164
  Batch size = 8
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
***** Running Evaluation *****
  Num examples = 241
  Batch size = 8
Saving model checkpoint to runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_de/checkpoint-500
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_de/checkpoint-500/en/adapter_config.json
Module weights saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_de/checkpoint-500/en/pytorch_adapter.bin
Configuration saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_de/checkpoint-500/de/adapter_config.json
Module weights saved in runs/Mad_lr=AdafactorDefault_plm=bert-base-multilingual-cased_Info=v1_test1_en_de_de/checkpoint-500/de/pytorch_adapter.bin
Configuration saved in runs/Mad_lr=Adafa

***** train_logger metrics *****
  epoch                    =        8.0
  total_flos               = 37331542GF
  train_loss               =     1.0452
  train_runtime            = 1:24:02.53
  train_samples_per_second =     25.387
  train_steps_per_second   =      1.588


In [24]:
trainer.evaluate(metric_key_prefix='test_de',
                eval_dataset=de_test_dataset)

***** Running Evaluation *****
  Num examples = 260
  Batch size = 8
  Num examples = 260
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'test_de_loss': 2.5744662284851074,
 'test_de_accuracy@de': 0.2923076923076923,
 'test_de_f1@de': 0.23475593004467768,
 'test_de_precision@de': 0.2478756341697518,
 'test_de_recall@de': 0.2699608662691827,
 'test_de_loss@de': 2.5744662284851074,
 'test_de_runtime': 3.6001,
 'test_de_samples_per_second': 72.22,
 'test_de_steps_per_second': 9.166,
 'epoch': 8.0}

In [25]:
lang = 'en(de)'

trainer.evaluate(metric_key_prefix='test_en(de)',
                eval_dataset=en_test_dataset)

***** Running Evaluation *****
  Num examples = 2155
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'test_en(de)_loss': 3.009084463119507,
 'test_en(de)_accuracy@en(de)': 0.17540603248259862,
 'test_en(de)_f1@en(de)': 0.06975860652048695,
 'test_en(de)_precision@en(de)': 0.08361451184580698,
 'test_en(de)_recall@en(de)': 0.0916845074726499,
 'test_en(de)_loss@en(de)': 3.0090839862823486,
 'test_en(de)_runtime': 27.5999,
 'test_en(de)_samples_per_second': 78.08,
 'test_en(de)_steps_per_second': 9.783,
 'epoch': 8.0}