# Training on Wikiann

In this notebook, we use [MAD-X 2.0](https://arxiv.org/pdf/2012.15562.pdf) with a stacked language and task adapter setup to zero-shot cross-lingual transfer for NER.
We use a NER adapter from [AdapterHub.ml](https://adapterhub.ml/explore) pre-trained on the **English** portion of the [WikiAnn](https://www.aclweb.org/anthology/P17-1178.pdf) dataset and transfer to **Guarani** with a pre-trained language adapter.
This notebook is similar to the 'run_ner.py' example script in 'examples/pytorch/token-classification/'.

First, let's install 'adapter-transformers' and other required packages

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"

Next, we initialize the tokenizer and the model with the correct labels.

In [2]:
from transformers import AutoModelWithHeads

model = AutoModelWithHeads.from_pretrained("bert-base-uncased")
adapter_name = model.load_adapter("AdapterHub/bert-base-uncased-pf-snli", source="hf")
model.active_adapters = adapter_name

2022-12-19 18:04:31.677226: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-19 18:04:31.973227: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/VD/kaveri/anaconda3/envs/py310/lib/
2022-12-19 18:04:31.973287: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not hav

ImportError: cannot import name 'AutoModelWithHeads' from 'transformers' (/home/VD/kaveri/anaconda3/envs/py310/lib/python3.10/site-packages/transformers/__init__.py)

Now, we load the task and the language adapter. For both adapters, we drop the adapter in the last layer following MAD-X 2.0. We then set both adapters as active adapters.

In [None]:
from transformers import AdapterConfig
target_language = "gn" # choose any language that a bert-base-multilingual-cased language adapter is available for
source_language = "en" # We support  "en", "ja", "zh", and "ar"

adapter_config = AdapterConfig.load(
    None,
    leave_out=[11]
)

model.load_adapter(
    "wikiann/" + source_language + "@ukp",
    config=adapter_config,
    load_as="wikiann",
)
    
lang_adapter_name = model.load_adapter(
    target_language + "/wiki@ukp",
    load_as=target_language,
    leave_out=[11],
)
# Set the adapters to be used in every forward pass
model.set_active_adapters([lang_adapter_name, "wikiann"])

Downloading bert-base-multilingual-cased.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading bert-base-multilingual-cased_wikiann_ner_en_pfeiffer.zip:   0%|          | 0.00/3.16M [00:00<?, ?B…

Downloading gn_relu_2.zip:   0%|          | 0.00/28.2M [00:00<?, ?B/s]

Next, we can download the dataset and initialize the trainings arguments.

In [None]:
from datasets import load_dataset
from transformers import TrainingArguments

datasets = load_dataset('wikiann', target_language)

training_args = TrainingArguments(
    per_device_eval_batch_size=64,
    do_predict=True,
    output_dir="ner_models/madx/",
)

2022-12-19 13:50:08.079786: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-19 13:50:08.276155: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib/:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/VD/kaveri/anaconda3/envs/py310/lib/
2022-12-19 13:50:08.276189: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not hav

Downloading builder script:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

Downloading and preparing dataset wikiann/gn (download: 223.17 MiB, generated: 79.70 KiB, post-processed: Unknown size, total: 223.25 MiB) to /root/.cache/huggingface/datasets/wikiann/gn/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e...


Downloading data:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset wikiann downloaded and prepared to /root/.cache/huggingface/datasets/wikiann/gn/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

This method is taken from the example script 'run_ner.py'. It prepares the input tokens such that they are tokenized by the correct tokenizer and the labels are adapted to the new tokenization.

In [None]:
# This method is adapted from the huggingface transformers run_ner.py example script
# Tokenize all texts and align the labels with them.
def tokenize_and_align_labels(examples):
    text_column_name = "tokens"
    label_column_name = "ner_tags"
    tokenized_inputs = tokenizer(
        examples[text_column_name],
        padding=False,
        truncation=True,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples[label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

We apply the previous method to the test dataset to prepare it for prediction. 

In [None]:
from transformers import DataCollatorForTokenClassification
test_dataset = datasets["test"]
test_dataset = test_dataset.map(
    tokenize_and_align_labels,
    batched=True,
)

data_collator = DataCollatorForTokenClassification(tokenizer,)



  0%|          | 0/1 [00:00<?, ?ba/s]

We use HuggingFace's `Trainer` class to evaluate zero-shot transfer on the WikiAnn test dataset.

In [None]:
from transformers import TrainingArguments, AdapterTrainer, EvalPrediction
from datasets import load_metric
import numpy as np


# Metrics
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    label_list = id_2_label

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

trainer = AdapterTrainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

Finally we can predict the labels for the test set and evaluate he predictions.

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: langs, tokens, ner_tags, spans. If langs, tokens, ner_tags, spans are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 64


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33merzaliator[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 1.0064687728881836,
 'eval_precision': 0.437125748502994,
 'eval_recall': 0.6952380952380952,
 'eval_f1': 0.5367647058823529,
 'eval_accuracy': 0.784037558685446,
 'eval_runtime': 1.7892,
 'eval_samples_per_second': 55.89,
 'eval_steps_per_second': 1.118}