# CANINE \& Question Answering

This colab notebook is meant to be a showcase on how to run the code related to Question Answering task developed in [this Github repository](https://github.com/chloeskt/nlp_ensae/tree/main). 

This project has been done by Chloé SEKKAT (ENSAE \& ENS Paris-Saclay) and Jocelyn BEAUMANOIR (ENSAE \& ESSEC).  

## Description of experiments done

In this section, we are interested in the capacities of CANINE versus BERT-like models such as BERT, mBERT and XLM-RoBERTa 
on Question Answering tasks. CANINE is a pre-trained tokenization-free and vocabulary-free encoder, that operates directly 
on character sequences without explicit tokenization. It seeks to generalize beyond the orthographic forms encountered 
during pre-training.

We evaluate its capacities on extractive question answering (select minimal span answer within a context) on SQuAD dataset. 
The latter is a unilingual (English) dataset available in Hugging Face (simple as load_dataset("squad_v2")). Obtained 
F1-scores are being compared to BERT-like models (BERT, mBERT and XLM-RoBERTa).

A second step is to assess its capacities of generalization in the context of zero-shot transfer. Finetuned on an English 
dataset and then directly evaluated on a multi-lingual dataset with 11 languages of various morphologies (XQuAD).

A third experiment is to test the abilities of CANINE to handle noisy inputs, especially noisy questions as in real life 
settings the questions are often noisy (misspellings, wrong grammar, etc - think of ASR systems or keyboard error while 
typing).

Our fourth experiment consists in measuring the abilities of CANINE to adapt to new target domain by only doing few-shot 
learning. This means that we want to take a finetuned CANINE model (on SQuADv2 which is a general wikipedia-based dataset) 
and measure its performance on another domain-specific dataset (for instance medical or legal datasets which are two 
domains with very specific wording and concepts) after having train it for a small number of epochs (3 or less) on a very 
small number of labeled data (less than 250 for instance). These performances will be compared to those of the other 
models we have chosen along this study. 

Last, we will stay again in the few-shot learning domain but test the abilities of CANINE to resist to adversarial 
attacks knowing that it has not been trained for that and that it will only be trained for few epochs and a small number 
of adversarial examples. 

## Setup

In [None]:
# Clone the repository containing all the code

!rm -rf nlp_ensae
!git clone https://github.com/chloeskt/nlp_ensae.git

Cloning into 'nlp_ensae'...
remote: Enumerating objects: 341, done.[K
remote: Counting objects: 100% (341/341), done.[K
remote: Compressing objects: 100% (240/240), done.[K
remote: Total 341 (delta 188), reused 242 (delta 93), pack-reused 0[K
Receiving objects: 100% (341/341), 351.77 KiB | 10.66 MiB/s, done.
Resolving deltas: 100% (188/188), done.


In [None]:
# check GPU
!nvidia-smi

Fri Apr 22 12:25:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install some dependences
! pip install --quiet pandas datasets transformers nlpaug

[K     |████████████████████████████████| 325 kB 14.0 MB/s 
[K     |████████████████████████████████| 4.0 MB 50.1 MB/s 
[K     |████████████████████████████████| 410 kB 87.1 MB/s 
[K     |████████████████████████████████| 136 kB 82.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 60.8 MB/s 
[K     |████████████████████████████████| 212 kB 85.2 MB/s 
[K     |████████████████████████████████| 77 kB 7.2 MB/s 
[K     |████████████████████████████████| 127 kB 91.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 59.1 MB/s 
[K     |████████████████████████████████| 895 kB 86.9 MB/s 
[K     |████████████████████████████████| 596 kB 59.8 MB/s 
[K     |████████████████████████████████| 94 kB 4.0 MB/s 
[K     |████████████████████████████████| 271 kB 68.9 MB/s 
[K     |████████████████████████████████| 144 kB 77.8 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [None]:
import transformers

transformers.logging.set_verbosity_error()

In [None]:
# Mount your google drive to save results

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Download pretrained models and custom datasets
# please note that sometimes the download fails for unknown reasons
# But you can access all files here: https://drive.google.com/drive/folders/1L9Su25qatgdmoz-rZbeY_tA2bXq9T9EG?usp=sharing

!gdown --folder https://drive.google.com/drive/folders/1L9Su25qatgdmoz-rZbeY_tA2bXq9T9EG?usp=sharing -O /content/drive/MyDrive/

Retrieving folder list
Retrieving folder 1lvB8pXlcdAj64xan6bED-9c_xsWQ-e8y bert-finetuned
Processing file 12aGuB0DiwptxNY3xAUt759-BeX_UiL9V best_model.pt
Retrieving folder 12S395w5ztN_d2fmVkwYvbJnQcZJWU43O canine-c-finetuned
Processing file 1l7EMi2n3ibRcMYQ5mVOaGRF0XOEfncR4 canine-c_best_model.pt
Retrieving folder 1jNDtNF69uK7jZ-hFVPV2Ocou0SKZPQx0 canine-s-finetuned
Processing file 1Wduh6YE2nOm6LwLMy13nTyDRkrUS3ZXQ canine-s_best_model.pt
Retrieving folder 1Hjr-Im_cUAcGFtKLroNkYiN38tcxUJ2e cuad_original
Retrieving folder 1-juZUfjyz4xkdg05EpEWvPOvHOU9rL-6 train
Processing file 1-rKE2DTLvzJ-Vj5gBaX-fqFPJXtfLjWS dataset_info.json
Processing file 1-sUH55NaIsm_5qnt9xkoCsSQ4dfpRoIY dataset.arrow
Processing file 1-sFjT0Uy8SJgCAaUDkEdxGuUMk3A0za4 state.json
Retrieving folder 1-jKeWvKJJRBZ3dlrnQSFPPkKleV__viK validation
Processing file 1-kJyA-NsPJEmt_RE9fOvLfQcEjT22gfn dataset_info.json
Processing file 1-lTM1zK9bERj0JM8rotLicvHWa5HieDH dataset.arrow
Processing file 1-kyZZstvaubuo5jqvjZiUO0QpEBr1

In [None]:
# Noisy version of SQuADv2
# please note that sometimes the download fails for unknown reasons
# (it says the link is not accessible by everyone, but it is, I suspect this is
# because the folders are too large)
# But you can access all files here: https://drive.google.com/drive/folders/1ff5GBaY9Bp3y3J6MLXKoMTMmxdd0ERBl?usp=sharing

!gdown --folder https://drive.google.com/drive/folders/1ff5GBaY9Bp3y3J6MLXKoMTMmxdd0ERBl?usp=sharing -O /content/drive/MyDrive/

Retrieving folder list
Retrieving folder 1CgmNDwuuYDJD0ye7tZ1ptajn73r1XFcU noisy_data_10
Retrieving folder 1KDvKe6fSHO-1mdKWkbJNqLLjJQBmBLL0 train
Processing file 1iKRVuohjWx--VNzMKXCT8nJI5u1pd6pC cache-0fdcb22fae4850c8.arrow
Processing file 1QYxEynfnCtrMfmKr7QU5RMrWU_GvALJU cache-1f2266c96ddb6db8.arrow
Processing file 15MhwgAyTRn1YWmTCnsN0eSES2q3koJP5 cache-3f9a21a01b6905c7.arrow
Processing file 13rORI61kBKSkJ4KuGCkAYH5H_SD6wA9U cache-45821d544f49e60d.arrow
Processing file 1Sm76OutpyPDs_u1VRP_j0RuGxji-8CFv cache-a21ece08dc5456e5.arrow
Processing file 1uft4EsijtR2C6kKToeYNQ5HdpSl1X-eS cache-ca406433db74306a.arrow
Processing file 1HB_0DrVrrnxnRvEjJgPanuKGreAOZ_1G cache-cbe2ea965cc57889.arrow
Processing file 1kw5RG03OA7ewnh6L2KQFjIpI4bP0B7yq cache-f5167625bdc65a22.arrow
Processing file 1cqnmwXetraD7jppR_ICfCEA3hFutBoOw dataset_info.json
Processing file 1xM3dUPXeu1nXRot7-epZqE3IkuuxLYc0 dataset.arrow
Processing file 1W3bcWYbXf9EE8HLSYNVCQYwTOUPSy2e4 state.json
Retrieving folder 15BAc9ayQk

In [None]:
# Cd into the question_answering folder to access the Python package

%cd nlp_ensae/source/question_answering/

/content/nlp_ensae/source/question_answering


## Imports

In [None]:
import argparse

import logging
import os
from abc import ABC, abstractmethod
from typing import Any, Dict, List, OrderedDict, Callable, Tuple, Union, Optional
from tqdm import tqdm
import random
from dataclasses import dataclass, field

tqdm.pandas()

import datasets
from datasets import load_dataset, load_metric, load_from_disk, DatasetDict, Dataset

import torch
import torch.nn as nn

from transformers.data.data_collator import InputDataClass
from transformers.modeling_outputs import QuestionAnsweringModelOutput
from transformers.trainer_utils import PredictionOutput

from nlpaug import Augmenter

from transformers import (
    CanineForQuestionAnswering,
    CanineTokenizer,
    RobertaTokenizerFast,
    default_data_collator,
    BertTokenizerFast,
    BertForQuestionAnswering,
    RobertaForQuestionAnswering,
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering,
    XLMRobertaTokenizerFast,
    CanineConfig,
    CanineModel,
    PretrainedConfig,
    BatchEncoding,
    IntervalStrategy,
    SchedulerType,
    EarlyStoppingCallback,
    Trainer,
    TrainingArguments,
    PreTrainedTokenizer,
    HfArgumentParser,
)

from question_answering import (
    DataArguments,
    DatasetCharacterBasedTokenizer,
    DatasetTokenBasedTokenizer,
    Preprocessor,
    TrainerArguments,
    CharacterBasedModelTrainer,
    TokenBasedModelTrainer,
    remove_examples_longer_than_threshold,
    to_pandas,
    remove_answer_end,
    set_seed,
    Model,
)


## Caveats

The code developped for the Question Answering task was mainly made to be run on a remote server and not on a jupyter notebook, which is why the following notebook is mostly made of bash command. To see more in details the code in itself, we strongly advise you to look at the package and the corresponding `README.md` to get more information.

However, we will show some core classes in the following cell so that you might get a better feeling of what is going on.

## Parts of the Python package

#### main.py script

In [None]:
SEED = 0
set_seed(SEED)

CANINE_S_MODEL = "canine-s"
CANINE_C_MODEL = "canine-c"
BERT_MODEL = "bert"
MBERT_MODEL = "mbert"
XLM_ROBERTA_MODEL = "xlm_roberta"
ROBERTA_MODEL = "roberta"
DISTILBERT_MODEL = "distilbert"

SQUAD_V2_DATASET_NAME = "squad_v2"
SQUAD_DATASET_NAME = "squad"
XQUAD_DATASET_NAME = "xquad"
NOISY_DATASET_NAME = "noisy"
DYNABENCH_DATASET_NAME = "dynabench/qa"

logger = logging.getLogger(__name__)


def train_model(
        model_name: str,
        learning_rate: float,
        weight_decay: float,
        type_lr_scheduler: SchedulerType,
        warmup_ratio: float,
        save_strategy: IntervalStrategy,
        save_steps: int,
        num_epochs: int,
        early_stopping_patience: int,
        output_dir: str,
        device: str,
        dataset_name: str,
        batch_size: int,
        max_length: int,
        doc_stride: int,
        n_best_size: int,
        max_answer_length: int,
        squad_v2: bool,
        eval_only: bool,
        path_to_finetuned_model: str,
        dir_data_noisy: str,
        xquad_subdataset_name: str,
        few_shot_learning: bool,
) -> None:
    logger.info(f"Loading dataset {dataset_name}")
    if dataset_name == "xquad":
        datasets = load_dataset(dataset_name, xquad_subdataset_name)
    elif dataset_name == "noisy":
        datasets = load_from_disk(dir_data_noisy)
    elif dataset_name == "dynabench/qa":
        datasets = load_dataset(dataset_name, "dynabench.qa.r1.dbert")
    else:
        datasets = load_dataset(dataset_name)

    if few_shot_learning:
        logger.info("Selecting 50% of the train dataset, keeping only these to train")
        indices = [x for x in range(1000)]
        random_indices = random.sample(indices, 500)
        datasets["train"] = datasets["train"].select(random_indices)

    logger.info("Adding end_answers to each question")
    preprocessor = Preprocessor(datasets)
    datasets = preprocessor.preprocess()

    logger.info(f"Preparing for model {model_name}")
    if model_name in [CANINE_C_MODEL, CANINE_S_MODEL]:
        df_train = to_pandas(datasets["train"])
        df_validation = to_pandas(datasets["validation"])

        logger.info(f"Removing examples longer than threshold for model {model_name}")
        df_train = remove_examples_longer_than_threshold(
            df_train, max_length=max_length * 2, doc_stride=doc_stride
        )
        df_validation = remove_examples_longer_than_threshold(
            df_validation, max_length=max_length * 2, doc_stride=doc_stride
        )
        logger.info("Done removing examples longer than threshold")

        datasets["train"] = Dataset.from_pandas(df_train)
        datasets["validation"] = Dataset.from_pandas(df_validation)

        del df_train, df_validation

        pretrained_model_name = f"google/{model_name}"
        tokenizer = CanineTokenizer.from_pretrained(pretrained_model_name)
        model = CanineForQuestionAnswering.from_pretrained(pretrained_model_name)

        tokenizer_dataset_train = DatasetCharacterBasedTokenizer(
            tokenizer,
            max_length,
            doc_stride,
            train=True,
            squad_v2=squad_v2,
            language="en",
        )
        tokenizer_dataset_val = DatasetCharacterBasedTokenizer(
            tokenizer,
            max_length,
            doc_stride,
            train=False,
            squad_v2=squad_v2,
            language="en",
        )
    else:
        if model_name == BERT_MODEL:
            pretrained_model_name = "bert-base-uncased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForQuestionAnswering.from_pretrained(pretrained_model_name)

        elif model_name == MBERT_MODEL:
            pretrained_model_name = "bert-base-multilingual-cased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForQuestionAnswering.from_pretrained(pretrained_model_name)

        elif model_name == XLM_ROBERTA_MODEL:
            pretrained_model_name = "xlm-roberta-base"
            tokenizer = XLMRobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForQuestionAnswering.from_pretrained(pretrained_model_name)

        elif model_name == ROBERTA_MODEL:
            pretrained_model_name = "roberta-base"
            tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForQuestionAnswering.from_pretrained(pretrained_model_name)

        elif model_name == DISTILBERT_MODEL:
            pretrained_model_name = "distilbert-base-uncased"
            tokenizer = DistilBertTokenizerFast.from_pretrained(pretrained_model_name)
            model = DistilBertForQuestionAnswering.from_pretrained(
                pretrained_model_name
            )

        else:
            raise NotImplementedError

        tokenizer_dataset_train = DatasetTokenBasedTokenizer(
            tokenizer, max_length, doc_stride, train=True
        )
        tokenizer_dataset_val = DatasetTokenBasedTokenizer(
            tokenizer, max_length, doc_stride, train=False
        )

    tokenized_datasets = datasets.map(
        tokenizer_dataset_train.tokenize,
        batched=True,
        remove_columns=datasets["validation"].column_names,
    )

    validation_features = datasets["validation"].map(
        tokenizer_dataset_val.tokenize,
        batched=True,
        remove_columns=datasets["validation"].column_names,
    )

    data_collator = default_data_collator
    metric = load_metric("squad_v2" if squad_v2 else "squad")

    if eval_only or few_shot_learning:
        logger.info("Loading own finetuned model")
        model.load_state_dict(torch.load(path_to_finetuned_model, map_location=device))

    trainer_args = TrainerArguments(
        model=model,
        learning_rate=learning_rate,
        lr_scheduler=type_lr_scheduler,
        warmup_ratio=warmup_ratio,
        save_strategy=save_strategy,
        save_steps=save_steps,
        epochs=num_epochs,
        output_dir=output_dir,
        metric=metric,
        evaluation_strategy=save_strategy,
        weight_decay=weight_decay,
        data_collator=data_collator,
        model_save_path=os.path.join(
            output_dir, f"{model_name}-finetuned", "best_model.pt"
        ),
        device=device,
        early_stopping_patience=early_stopping_patience,
        few_shot_learning=few_shot_learning,
    )
    data_args = DataArguments(
        datasets=datasets,
        dataset_name=dataset_name,
        validation_features=validation_features,
        batch_size=batch_size,
        tokenizer=tokenizer,
        n_best_size=n_best_size,
        max_answer_length=max_answer_length,
        tokenized_datasets=tokenized_datasets,
        squad_v2=squad_v2,
    )

    if model_name in [CANINE_C_MODEL, CANINE_S_MODEL]:
        trainer = CharacterBasedModelTrainer(trainer_args, data_args, model_name)
    elif model_name in [
        DATA2VEC_MODEL,
        BERT_MODEL,
        MBERT_MODEL,
        XLM_ROBERTA_MODEL,
        ROBERTA_MODEL,
        DISTILBERT_MODEL,
    ]:
        trainer = TokenBasedModelTrainer(trainer_args, data_args, model_name)
    else:
        raise NotImplementedError

    # check if we are in eval mode only or not
    if not eval_only:
        logger.info("START TRAINING")
        trainer.train()

    logger.info("START FINAL EVALUATION")
    f1, exact_match = trainer.evaluate(mode="val")
    logger.info(f"Obtained F1-score: {f1}, Obtained Exact Match: {exact_match}")

    if not eval_only:
        # Save best model
        trainer.save_model()


In [None]:
%%script false --no-raise-error

if __name__ == "__main__":
    debug = False
    logging.basicConfig(level=logging.INFO)
    logging.getLogger("datasets.arrow_dataset").setLevel(logging.ERROR)
    if debug:
        logger.getChild("question_answering.DatasetCharacterBasedTokenizer").setLevel(
            logging.DEBUG
        )

    parser = argparse.ArgumentParser(
        description="Parser for training and data arguments"
    )

    parser.add_argument(
        "--model_name",
        type=str,
        help="Name of the model",
        choices=[
            DATA2VEC_MODEL,
            MBERT_MODEL,
            BERT_MODEL,
            CANINE_S_MODEL,
            CANINE_C_MODEL,
            ROBERTA_MODEL,
            XLM_ROBERTA_MODEL,
            DISTILBERT_MODEL,
        ],
        required=True,
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        required=True,
        help="Chosen learning rate for AdamW optimizer",
    )
    parser.add_argument(
        "--weight_decay",
        type=float,
        required=True,
        help="Chosen weight decay for AdamW optimizer",
    )
    parser.add_argument(
        "--type_lr_scheduler", type=str, required=True, help="Type of LR scheduler"
    )
    parser.add_argument(
        "--warmup_ratio", type=float, required=True, help="Warmup ratio"
    )
    parser.add_argument(
        "--save_strategy",
        type=str,
        required=True,
        help="Save strategy",
        choices=["steps", "epochs"],
    )
    parser.add_argument(
        "--save_steps",
        type=int,
        required=True,
        help="Number of steps to perform before saving model",
    )
    parser.add_argument(
        "--num_epochs", type=int, required=True, help="Number of epochs to train for"
    )
    parser.add_argument(
        "--early_stopping_patience",
        type=int,
        required=True,
        help="Patience for early stopping, validation loss is monitored",
    )
    parser.add_argument(
        "--output_dir", type=str, required=True, help="Directory to store the model"
    )
    parser.add_argument(
        "--device",
        type=str,
        required=True,
        help="Device to run the code on, either cpu or cuda",
    )
    parser.add_argument(
        "--dataset_name",
        type=str,
        default="squad_v2",
        choices=[
            SQUAD_V2_DATASET_NAME,
            SQUAD_DATASET_NAME,
            XQUAD_DATASET_NAME,
            NOISY_DATASET_NAME,
            DYNABENCH_DATASET_NAME,
        ],
        required=True,
        help="Name of the dataset to train/evaluate on",
    )
    parser.add_argument(
        "--xquad_subdataset_name",
        type=str,
        default="xquad.en",
        help="Config name for XQuAD dataset",
    )
    parser.add_argument(
        "--batch_size",
        type=int,
        required=True,
        help="Batch size for training and evaluation",
    )
    parser.add_argument(
        "--max_length",
        type=int,
        required=True,
        help="The maximum length of a feature (question and context)",
    )
    parser.add_argument(
        "--doc_stride",
        type=int,
        required=True,
        help="The authorized overlap between two part of the context when splitting it is needed.",
    )
    parser.add_argument(
        "--n_best_size",
        type=int,
        required=True,
        help="Number of best logits scores to consider",
    )
    parser.add_argument(
        "--max_answer_length",
        type=int,
        required=True,
        help="Maximum length of an answer",
    )
    parser.add_argument("--squad_v2", type=bool, default=False)
    parser.add_argument("--eval_only", type=bool, default=False)
    parser.add_argument(
        "--path_to_finetuned_model",
        type=str,
        default=None,
        help="Path towards a previously finetuned model",
    )
    parser.add_argument(
        "--dir_data_noisy",
        type=str,
        default=None,
        help="Path towards noisy data will be used only if `dataset_name` is set to noisy",
    )
    parser.add_argument(
        "--few_shot_learning",
        type=bool,
        default=False,
        help="Set to True to do few-shot learning",
    )

    args = parser.parse_args()

    train_model(
        model_name=args.model_name,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        type_lr_scheduler=args.type_lr_scheduler,
        warmup_ratio=args.warmup_ratio,
        save_strategy=args.save_strategy,
        save_steps=args.save_steps,
        num_epochs=args.num_epochs,
        early_stopping_patience=args.early_stopping_patience,
        output_dir=args.output_dir,
        device=args.device,
        dataset_name=args.dataset_name,
        batch_size=args.batch_size,
        max_length=args.max_length,
        doc_stride=args.doc_stride,
        n_best_size=args.n_best_size,
        max_answer_length=args.max_answer_length,
        squad_v2=args.squad_v2,
        eval_only=args.eval_only,
        path_to_finetuned_model=args.path_to_finetuned_model,
        dir_data_noisy=args.dir_data_noisy,
        xquad_subdataset_name=args.xquad_subdataset_name,
        few_shot_learning=args.few_shot_learning,
    )


#### CANINE model

In [None]:
HuggingFaceModelT = Any
CANINE_C = "google/canine-c"
CANINE_S = "google/canine-s"


class Model(nn.Module):
    """Generic model for Question Answering Tasks"""

    def __init__(self, model: HuggingFaceModelT, config: PretrainedConfig):
        nn.Module.__init__(self)
        self.model = model
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        return self.qa_outputs(outputs[0])

class CanineQA(Model):
    """CANINE model for Question Answering Tasks"""

    def __init__(self, pretrained_model_name: str = Union[CANINE_C, CANINE_S]) -> None:
        config = CanineConfig()
        canine = CanineModel.from_pretrained(
            pretrained_model_name, add_pooling_layer=False
        )
        Model.__init__(self, canine, config)


#### Token-based Dataset Tokenizer

Class to tokenize the dataset (if the model requires tokenization based on tokens) and provide right inputs for the model.

In [None]:
class DatasetTokenBasedTokenizer:
    def __init__(
            self,
            tokenizer: PreTrainedTokenizer,
            max_length: int,
            doc_stride: int,
            train: bool,
    ):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.doc_stride = doc_stride
        self.pad_on_right = tokenizer.padding_side == "right"
        self.train = train

    def _tokenize_train_data(self, data: Dataset) -> BatchEncoding:
        # INSPIRED BY:
        # https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

        # Some of the questions have lots of whitespace on the left, which is not useful and will make the
        # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
        # left whitespace
        question_column_name = "question"
        context_column_name = "context"
        answer_column_name = "answers"

        data[question_column_name] = [q.lstrip() for q in data[question_column_name]]

        # Tokenize our data with truncation and maybe padding, but keep the overflows using a stride. This results
        # in one example possible giving several features when a context is long, each of those features having a
        # context that overlaps a bit the context of the previous feature.
        tokenized_examples = self.tokenizer(
            data[question_column_name if self.pad_on_right else context_column_name],
            data[context_column_name if self.pad_on_right else question_column_name],
            truncation="only_second" if self.pad_on_right else "only_first",
            max_length=self.max_length,
            stride=self.doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )

        # Since one example might give us several features if it has a long context, we need a map from a feature to
        # its corresponding example. This key gives us just that.
        sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
        # The offset mappings will give us a map from token to character position in the original context. This will
        # help us compute the start_positions and end_positions.
        offset_mapping = tokenized_examples.pop("offset_mapping")

        tokenized_examples["start_positions"] = []
        tokenized_examples["end_positions"] = []

        for i, offsets in enumerate(offset_mapping):
            # We will label impossible answers with the index of the CLS token.
            input_ids = tokenized_examples["input_ids"][i]
            cls_index = input_ids.index(self.tokenizer.cls_token_id)

            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
            sequence_ids = tokenized_examples.sequence_ids(i)

            # One example can give several spans, this is the index of the example containing this span of text.
            sample_index = sample_mapping[i]
            answers = data[answer_column_name][sample_index]

            # If no answers are given, set the cls_index as answer.
            if len(answers["answer_start"]) == 0:
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Start/end character index of the answer in the text.
                start_char = answers["answer_start"][0]
                end_char = start_char + len(answers["text"][0])

                # Start token index of the current span in the text.
                token_start_index = 0
                while sequence_ids[token_start_index] != (
                        1 if self.pad_on_right else 0
                ):
                    token_start_index += 1

                # End token index of the current span in the text.
                token_end_index = len(input_ids) - 1
                while sequence_ids[token_end_index] != (1 if self.pad_on_right else 0):
                    token_end_index -= 1

                # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
                if not (
                        offsets[token_start_index][0] <= start_char
                        and offsets[token_end_index][1] >= end_char
                ):
                    tokenized_examples["start_positions"].append(cls_index)
                    tokenized_examples["end_positions"].append(cls_index)
                else:
                    # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                    # Note: we could go after the last offset if the answer is the last word (edge case).
                    while (
                            token_start_index < len(offsets)
                            and offsets[token_start_index][0] <= start_char
                    ):
                        token_start_index += 1
                    tokenized_examples["start_positions"].append(token_start_index - 1)
                    while offsets[token_end_index][1] >= end_char:
                        token_end_index -= 1
                    tokenized_examples["end_positions"].append(token_end_index + 1)

        return tokenized_examples

    def _tokenize_val_data(self, data: Dataset) -> BatchEncoding:
        # INSPIRED BY:
        # https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py

        # Some of the questions have lots of whitespace on the left, which is not useful and will make the
        # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
        # left whitespace
        question_column_name = "question"
        context_column_name = "context"

        data[question_column_name] = [q.lstrip() for q in data[question_column_name]]

        # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
        # in one example possible giving several features when a context is long, each of those features having a
        # context that overlaps a bit the context of the previous feature.
        tokenized_examples = self.tokenizer(
            data[question_column_name if self.pad_on_right else context_column_name],
            data[context_column_name if self.pad_on_right else question_column_name],
            truncation="only_second" if self.pad_on_right else "only_first",
            max_length=self.max_length,
            stride=self.doc_stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )

        # Since one example might give us several features if it has a long context, we need a map from a feature to
        # its corresponding example. This key gives us just that.
        sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

        # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
        # corresponding example_id and we will store the offset mappings.
        tokenized_examples["example_id"] = []

        for i in range(len(tokenized_examples["input_ids"])):
            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
            sequence_ids = tokenized_examples.sequence_ids(i)
            context_index = 1 if self.pad_on_right else 0

            # One example can give several spans, this is the index of the example containing this span of text.
            sample_index = sample_mapping[i]
            tokenized_examples["example_id"].append(data["id"][sample_index])

            # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
            # position is part of the context or not.
            tokenized_examples["offset_mapping"][i] = [
                (o if sequence_ids[k] == context_index else None)
                for k, o in enumerate(tokenized_examples["offset_mapping"][i])
            ]

        return tokenized_examples

    def tokenize(self, data: Dataset) -> BatchEncoding:
        if self.train:
            return self._tokenize_train_data(data)
        else:
            return self._tokenize_val_data(data)


#### Base Custom Trainer

In [None]:
QA_METRICS = Tuple[float, float]


@dataclass
class TrainerArguments:
    """
    Arguments needed to initiate a Trainer
    """

    model: nn.Module
    learning_rate: float
    lr_scheduler: SchedulerType
    warmup_ratio: float
    save_strategy: IntervalStrategy
    save_steps: int
    epochs: int
    output_dir: str
    metric: Any
    evaluation_strategy: IntervalStrategy
    weight_decay: float
    data_collator: Callable[[List[InputDataClass]], Dict[str, Any]]
    model_save_path: str
    device: str
    early_stopping_patience: int
    few_shot_learning: bool


@dataclass
class DataArguments:
    """
    Data arguments needed to initiate a Trainer
    """

    datasets: DatasetDict
    dataset_name: str
    validation_features: Dataset
    batch_size: int
    tokenizer: PreTrainedTokenizer
    n_best_size: int
    max_answer_length: int
    tokenized_datasets: DatasetDict
    squad_v2: bool


class CustomTrainer(ABC):
    """General Trainer signature"""

    logger = logging.getLogger(__name__)

    def __init__(
        self, trainer_args: TrainerArguments, data_args: DataArguments, model_name: str
    ) -> None:
        self.trainer_args = trainer_args
        self.data_args = data_args
        self.model_name = model_name

        # Define training arguments
        args = TrainingArguments(
            output_dir=os.path.join(
                self.trainer_args.output_dir, self.model_name + "-finetuned"
            ),
            evaluation_strategy=self.trainer_args.evaluation_strategy,
            learning_rate=self.trainer_args.learning_rate,
            weight_decay=self.trainer_args.weight_decay,
            num_train_epochs=self.trainer_args.epochs,
            lr_scheduler_type=self.trainer_args.lr_scheduler,
            warmup_ratio=self.trainer_args.warmup_ratio,
            per_device_train_batch_size=self.data_args.batch_size,
            per_device_eval_batch_size=self.data_args.batch_size,
            save_strategy=self.trainer_args.save_strategy,
            save_steps=self.trainer_args.save_steps,
            push_to_hub=False,
            metric_for_best_model="eval_loss",
            load_best_model_at_end=True,
            logging_steps=self.trainer_args.save_steps,
            no_cuda=False if self.trainer_args.device == "cuda" else True,
        )

        # Initiate Hugging Face Trainer
        if self.data_args.dataset_name == "xquad":
            train_dataset = self.data_args.tokenized_datasets["validation"]
        else:
            train_dataset = self.data_args.tokenized_datasets["train"]

        if self.trainer_args.few_shot_learning:
            callbacks = None
        else:
            callbacks = [
                EarlyStoppingCallback(
                    early_stopping_patience=self.trainer_args.early_stopping_patience
                )
            ]

        self.trainer = Trainer(
            self.trainer_args.model,
            args,
            train_dataset=train_dataset,
            eval_dataset=self.data_args.tokenized_datasets["validation"],
            data_collator=self.trainer_args.data_collator,
            tokenizer=self.data_args.tokenizer,
            callbacks=callbacks,
        )

    @abstractmethod
    def _postprocess_qa_predictions(
        self,
        data: Dataset,
        features: Dataset,
        raw_predictions: Union[PredictionOutput, QuestionAnsweringModelOutput],
    ) -> OrderedDict:
        raise NotImplementedError

    def train(self) -> None:
        self.logger.info("Start training")
        self.trainer.train()
        self.logger.info("Training done")

    def evaluate(self, mode: str, features: Optional[Dataset] = None) -> QA_METRICS:
        if mode == "val":
            _features = self.data_args.validation_features
        elif mode == "test":
            _features = features
        else:
            raise ValueError(
                "Mode should either be val or test. If val, the model will be evaluated on validation features"
                "defined in the DataArguments. If test, one must provide a Dataset of features in the correct"
                "format."
            )
        self.logger.info("Predicting on eval/test dataset")
        raw_predictions = self.trainer.predict(_features)
        self.data_args.validation_features.set_format(
            type=self.data_args.validation_features.format["type"],
            columns=list(self.data_args.validation_features.features.keys()),
        )
        self.logger.info("Postprocessing QA predictions")
        final_predictions = self._postprocess_qa_predictions(
            self.data_args.datasets["validation"],
            self.data_args.validation_features,
            raw_predictions.predictions,
        )
        self.logger.info("Computing metrics")
        results = self._compute_metrics(
            self.trainer_args.metric,
            self.data_args.datasets["validation"],
            final_predictions,
            self.data_args.squad_v2,
        )
        if self.data_args.squad_v2:
            return results["f1"], results["exact"]
        else:
            return results["f1"], results["exact_match"]

    def save_model(self) -> None:
        self.logger.info(
            f"Saving best trained model at {self.trainer_args.model_save_path}"
        )
        torch.save(self.trainer.model.state_dict(), self.trainer_args.model_save_path)

    @staticmethod
    def _compute_metrics(
        metric: Any,
        data: Dataset,
        predictions: OrderedDict,
        squad_v2: bool = True,
    ) -> Dict[str, float]:
        data = data.map(remove_answer_end, batched=True)
        if squad_v2:
            formatted_predictions = [
                {"id": k, "prediction_text": v, "no_answer_probability": 0.0}
                for k, v in predictions.items()
            ]
        else:
            formatted_predictions = [
                {"id": k, "prediction_text": v} for k, v in predictions.items()
            ]

        references = [{"id": ex["id"], "answers": ex["answers"]} for ex in data]

        return metric.compute(predictions=formatted_predictions, references=references)


#### CLI commands to finetune models on SQuADv2 

Following examples are given for RoBERTa and CANINE-C models but can be applied also to:

- BERT
- mBERT
- DistilBERT
- XLM-RoBERTa
- CANINE-S

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name roberta \
    --learning_rate 2e-5 \
    --weight_decay 0.0001 \
    --type_lr_scheduler cosine \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 5 \
    --early_stopping_patience 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 12 \
    --max_length 348 \
    --doc_stride 128 \
    --n_best_size 20 \
    --max_answer_length 30 \
    --squad_v2 True \
    --dataset_name squad_v2

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 5e-5 \
    --weight_decay 0.001 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 5 \
    --early_stopping_patience 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 6 \
    --max_length 2048 \
    --doc_stride 512 \
    --n_best_size 20 \
    --max_answer_length 256 \
    --squad_v2 True \
    --dataset_name squad_v2


#### CLI commands to do zero-shot transfer learning on multilingual XQuAD dataset

Change parameter `xquad_subdataset_name` to change language.

In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --type_lr_scheduler linear \
    --warmup_ratio 0 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 5 \
    --early_stopping_patience 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 6 \
    --max_length 2048 \
    --doc_stride 512 \
    --n_best_size 20 \
    --max_answer_length 256 \
    --squad_v2 False \
    --dataset_name xquad \
    --xquad_subdataset_name xquad.ar \
    --eval_only True \
    --path_to_finetuned_model /content/drive/MyDrive/models/canine-s-finetuned/canine-c_best_model.pt

INFO:__main__:Loading dataset xquad
100% 1/1 [00:00<00:00, 561.49it/s]
INFO:__main__:Adding end_answers to each question
INFO:__main__:Preparing for model canine-c
INFO:__main__:Removing examples longer than threshold for model canine-c
100% 1190/1190 [00:01<00:00, 651.18it/s]
INFO:__main__:Done removing examples longer than threshold
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Some weights of CanineForQuestionAnswering were not initialized from the model checkpoint at google/canine-c and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100% 2/2 [00:01<00:00,  1.11ba/s]
100% 2/2 [00:01<00:00

#### Code used to generate noisy datasets

The noise can be generated on any SQuAD-like dataset hosted on HuggingFace.

In [None]:
SEED = 0
set_seed(SEED)


@dataclass
class NoisifierArguments:
    """
    Arguments needed to noisify SQuADv2-like datasets.
    """

    dataset_name: str = field(
        default=None, metadata={"help": "Name of the SQuAD-like dataset"}
    )
    output_dir: str = field(
        default=None,
        metadata={
            "help": "Output directory, will be used to store noisy dataset in csv format"
        },
    )
    noise_level: float = field(
        default=None, metadata={"help": "Level of noise to apply on the dataset"}
    )
    augmenter_type: str = field(
        default=None,
        metadata={
            "help": "Type of Augmenter to use. Either: KeyboardAug, RandomCharAug, SpellingAug, BackTranslationAug "
                    "(de/en) or OcrAug"
        },
    )
    action: str = field(
        default=None,
        metadata={
            "help": "Type of action to apply if RandomCharAug was chosen. Either: swap, substitute, insert or delete."
        },
    )

    def __post_init__(self) -> None:
        if (self.augmenter_type == "RandomCharAug" and self.action is None) or (
                self.action is not None and self.augmenter_type != "RandomCharAug"
        ):
            raise ValueError(
                "If you set `augmenter_type` to RandomCharAug, please choose an `action`."
                "If you've chosen an `action`, you must choose `augmenter_type`==RandomCharAug for it"
                "to work."
            )


class Noisifier:
    def __init__(
            self,
            datasets: datasets.dataset_dict.DatasetDict,
            level: float,
            type: str,
            action: Optional[str],
    ) -> None:
        self.datasets = datasets
        self.level = level
        self.type = type
        self.action = action

    def _get_augmenter(self) -> Augmenter:
        if self.type == "KeyboardAug":
            return nac.KeyboardAug()

        elif self.type == "RandomCharAug":
            return nac.RandomCharAug(action=self.action)

        elif self.type == "SpellingAug":
            return naw.SpellingAug()

        elif self.type == "OcrAug":
            return nac.OcrAug()

        elif self.type == "BackTranslationAug":
            return naw.BackTranslationAug(
                from_model_name="facebook/wmt19-en-de",
                to_model_name="facebook/wmt19-de-en",
            )

        else:
            raise NotImplementedError

    def _augment_question(
            self, row: datasets.arrow_dataset.Example
    ) -> datasets.arrow_dataset.Example:
        # Noise can only be applied to the question
        augmenter = self._get_augmenter()
        if random.random() < self.level:
            row["question"] = augmenter.augment(row["question"])
        return row

    def augment(self):
        if (
                "validation" in self.datasets.column_names
                and "train" not in self.datasets.column_names
        ):
            self.datasets["validation"] = self.datasets["validation"].map(
                self._augment_question
            )
        elif (
                "train" in self.datasets.column_names
                and "validation" in self.datasets.column_names
        ):
            self.datasets["train"] = self.datasets["train"].map(self._augment_question)
            self.datasets["validation"] = self.datasets["validation"].map(
                self._augment_question
            )
        elif (
                "id"
                and "context"
                and "question"
                and "answers" in self.datasets.column_names
        ):
            self.datasets = self.datasets.map(self._augment_question)
        else:
            raise NotImplementedError
        return self.datasets


In [None]:
%%script false --no-raise-error

if __name__ == "__main__":
    parser = HfArgumentParser(NoisifierArguments)
    args = parser.parse_args_into_dataclasses()[0]
    datasets = datasets.load_dataset(args.dataset_name)

    noisifier = Noisifier(
        datasets=datasets,
        level=args.noise_level,
        type=args.augmenter_type,
        action=args.action,
    )

    new_datasets = noisifier.augment()

    # saving
    print("saving noisy dataset dict")
    new_datasets.save_to_disk(args.output_dir)
    print(new_datasets)

    print("Loading noisy dataset dict")
    datasets = datasets.load_from_disk(args.output_dir)
    print(datasets)


#### CLI commands to evaluate CANINE-C on noisy data, when $p=10$\%

In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 5e-5 \
    --weight_decay 0.001 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 5 \
    --early_stopping_patience 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 6 \
    --max_length 2048 \
    --doc_stride 128 \
    --n_best_size 20 \
    --max_answer_length 30 \
    --dataset_name noisy \
    --squad_v2 True \
    --eval_only True \
    --path_to_finetuned_model /content/drive/MyDrive/wip_models/canine-c-finetuned/canine-c_best_model.pt\
    --dir_data_noisy /content/drive/MyDrive/models/noisy_data_10

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
 14% 1674/11864 [02:22<12:23, 13.70it/s][A
 14% 1676/11864 [02:22<11:56, 14.23it/s][A
 14% 1678/11864 [02:22<11:46, 14.42it/s][A
 14% 1680/11864 [02:22<11:42, 14.50it/s][A
 14% 1682/11864 [02:22<10:44, 15.80it/s][A
 14% 1684/11864 [02:22<11:23, 14.89it/s][A
 14% 1686/11864 [02:23<12:23, 13.68it/s][A
 14% 1688/11864 [02:23<11:57, 14.19it/s][A
 14% 1690/11864 [02:23<12:26, 13.62it/s][A
 14% 1692/11864 [02:23<11:25, 14.85it/s][A
 14% 1694/11864 [02:23<11:06, 15.27it/s][A
 14% 1696/11864 [02:23<12:12, 13.88it/s][A
 14% 1698/11864 [02:24<13:44, 12.33it/s][A
 14% 1700/11864 [02:24<15:11, 11.15it/s][A
 14% 1702/11864 [02:24<16:31, 10.25it/s][A
 14% 1704/11864 [02:24<16:01, 10.56it/s][A
 14% 1706/11864 [02:24<16:53, 10.02it/s][A
 14% 1708/11864 [02:25<15:24, 10.98it/s][A
 14% 1710/11864 [02:25<16:15, 10.41it/s][A
 14% 1712/11864 [02:25<15:21, 11.02it/s][A
 14% 1714/11864 [02:25<13:56

### CLI commands to do few-shot learning with legal domain dataset CUAD

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 5e-5 \
    --weight_decay 0.001 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 6 \
    --max_length 2048 \
    --doc_stride 512 \
    --n_best_size 20 \
    --max_answer_length 256 \
    --dataset_name cuad \
    --path_to_finetuned_model /content/drive/MyDrive/models/canine-c-finetuned/best_model.pt \
    --few_shot_learning True \
    --squad_v2 True

### CLI commands to do few-shot learning with adversarial QA data from dynabench/QA dataset

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 5e-5 \
    --weight_decay 0.001 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 3 \
    --output_dir /content/drive/MyDrive/models/ \
    --device cuda \
    --batch_size 6 \
    --max_length 2058 \
    --doc_stride 512 \
    --n_best_size 20 \
    --max_answer_length 256 \
    --dataset_name dynabench/qa \
    --path_to_finetuned_model /content/drive/MyDrive/models/canine-c-finetuned/best_model.pt \
    --few_shot_learning True \
    --squad_v2 True

## Showcase of the models capabilities 

In [None]:
context = (
    "The majority of the forest is contained within Brazil, with 60% of the rainforest,"
    "followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, "
    "Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four "
    "nations contain 'Amazonas' in their names. The Amazon represents over half of the planet's "
    "remaining rainforests, and comprises the largest and most biodiverse tract of tropical "
    "rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."
)


def test_model(
    model: Model,
    path_to_finetuned_model: str,
    tokenizer: PreTrainedTokenizer,
    question: str,
    context: str,
    device: str,
) -> None:
    print()
    print("Question: ", question)
    print("Using initial model, not finetuned for prediction:")
    _helper_test(
        model,
        tokenizer,
        question,
        context,
    )

    print()
    print("Now using own finetuned model on SQuADv2")
    print("Loading finetuned model")
    model.load_state_dict(torch.load(path_to_finetuned_model, map_location=device))
    _helper_test(
        model,
        tokenizer,
        question,
        context,
    )


def _helper_test(
    model: Model,
    tokenizer: PreTrainedTokenizer,
    question: str,
    context: str,
) -> None:
    inputs = tokenizer(
        question,
        context,
        add_special_tokens=True,
        return_token_type_ids=True,
        return_tensors="pt",
    )

    with torch.no_grad():
        outputs = model(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask)

    try:
        non_answer_tokens = [x if x in [0, 1] else 0 for x in inputs.sequence_ids()]
        non_answer_tokens = torch.tensor(non_answer_tokens, dtype=torch.bool)
    except ValueError:
        non_answer_tokens = torch.tensor(inputs.token_type_ids.clone().detach(), dtype=torch.bool)

    potential_start = torch.where(
        non_answer_tokens,
        outputs.start_logits,
        torch.tensor(float("-inf"), dtype=torch.float),
    )
    potential_end = torch.where(
        non_answer_tokens,
        outputs.end_logits,
        torch.tensor(float("-inf"), dtype=torch.float),
    )

    potential_start = torch.softmax(potential_start, dim=1)
    potential_end = torch.softmax(potential_end, dim=1)

    answer_start = torch.argmax(potential_start)
    answer_end = torch.argmax(potential_end)

    score = (
        potential_start.squeeze()[answer_start] * potential_end.squeeze()[answer_end]
    )

    predicted_answer_tokens = inputs.input_ids[0, answer_start : answer_end + 1]
    predicted_answer = tokenizer.decode(predicted_answer_tokens)

    print(f"Model answered: {predicted_answer}")
    print(f"Confidence score is: {score}")


#### CANINE

In [None]:
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")
model = CanineForQuestionAnswering.from_pretrained("google/canine-s")
path_to_finetuned_model = "/content/drive/MyDrive/models/canine-s-finetuned/canine-s_best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = CanineForQuestionAnswering.from_pretrained("google/canine-s")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.


---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:




Model answered: 
Confidence score is: 2.0890156520181336e-05

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: over half 
Confidence score is: 0.38604098558425903
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: largest and most biodiverse tract of tropical rainforest in the 
Confidence score is: 1.6213047274504788e-05

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: four 
Confidence score is: 0.28279149532318115


Correct answer for the finetuned CANINE model, but model is not too confident.

### RoBERTa

In [None]:
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = RobertaForQuestionAnswering.from_pretrained("roberta-base")
path_to_finetuned_model = "/content/drive/MyDrive/models/roberta-finetuned/best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = RobertaForQuestionAnswering.from_pretrained("roberta-base")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:
Model answered: %, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain 'Amazonas' in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest
Confidence score is: 0.00015832424105610698

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered:  over half of the planet's remaining rainforests
Confidence score is: 0.3993825614452362
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: 
Confidence score is: 0.00018197613826487213

Now using own finetuned model on SQuADv2
Loading finetuned model
Mod

The finetuned model is right but not very confident (< 40\%)

### BERT

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")
path_to_finetuned_model = "/content/drive/MyDrive/models/bert-finetuned/best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:
Model answered: 
Confidence score is: 0.0003075493441428989

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: 60 %
Confidence score is: 0.047700148075819016
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: 
Confidence score is: 0.0002801779191941023

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: four
Confidence score is: 0.19941359758377075


Tthe finetuned BERT is fooled by the first question ! It is not confident for the second one (< 20\%).

### DistilBERT

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
path_to_finetuned_model = "/content/drive/MyDrive/models/distilbert-finetuned/best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:
Model answered: by peru with 13 %, colombia with 10 %, and with minor amounts in venezuela, ecuador, bolivia, guyana, suriname and french guiana. states or departments in four nations contain'amazonas'in their names. the amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the
Confidence score is: 0.00016032361600082368

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: over half
Confidence score is: 0.6295425891876221
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: , bolivia, guyana, suriname and french guiana. states or
Confid

Distilbert is much more confident than the previous models ! Also its answer to the first question to smaller, which is good.

### mBERT

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-cased")
model = BertForQuestionAnswering.from_pretrained("bert-base-multilingual-cased")
path_to_finetuned_model = "/content/drive/MyDrive/models/mbert-finetuned/best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = BertForQuestionAnswering.from_pretrained("bert-base-multilingual-cased")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:
Model answered: 
Confidence score is: 0.00034199663787148893

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: over half
Confidence score is: 0.22362005710601807
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: in four nations contain'Amazonas'in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and
Confidence score is: 0.00024570024106651545

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: four
Confidence score is: 0.13397370278835297


Correct answer but not much confidence.

### XLM-RoBERTa

In [None]:
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
model = RobertaForQuestionAnswering.from_pretrained("xlm-roberta-base")
path_to_finetuned_model = "/content/drive/MyDrive/models/xlm-roberta-finetuned/best_model.pt"
device="cpu"

print("---------------------------------------------------------------------")
# First question 
question = "What proportion of rainforests does the Amazon represents ? "
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

print("---------------------------------------------------------------------")
# Other type of question
question = "How many states have 'Amazonas' in their name ?"
model = RobertaForQuestionAnswering.from_pretrained("xlm-roberta-base")
test_model(
    model=model,
    path_to_finetuned_model=path_to_finetuned_model,
    tokenizer=tokenizer,
    question=question,
    context=context,
    device=device,
)

---------------------------------------------------------------------

Question:  What proportion of rainforests does the Amazon represents ? 
Using initial model, not finetuned for prediction:
Model answered: s over half of the planet's remaining rainforests, and compris
Confidence score is: 9.027367195812985e-05

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: over half
Confidence score is: 0.29767441749572754
---------------------------------------------------------------------

Question:  How many states have 'Amazonas' in their name ?
Using initial model, not finetuned for prediction:
Model answered: names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 1
Confidence score is: 8.510206680512056e-05

Now using own finetuned model on SQuADv2
Loading finetuned model
Model answered: 

Correct answers, not much confidence for the first one, but great confidence for the second one.

## Results

### Extractive Question Answering on SQuADv2

Models were trained with the following parameters:

|             	| Batch size 	| Learning Rate 	| Weigh decay 	| Nb of epochs 	| Number of training examples 	| Number of validation examples 	| Max sequence length 	| Doc stride 	| Max answer length 	| Lr scheduler 	| Warmup ratio 	|
|:-----------:	|:----------:	|:-------------:	|:-----------:	|:------------:	|:---------------------------:	|:-----------------------------:	|:-------------------:	|:----------:	|:-----------------:	|:------------:	|:------------:	|
|   RoBERTa   	|     12     	|      2e-5     	|     1e-4    	|       3      	|            131823           	|             12165             	|         348         	|     128    	|         30        	|    cosine    	|      0.1     	|
|     BERT    	|      8     	|      3e-5     	|      0      	|              	|            131754           	|             12134             	|         348         	|     128    	|         30        	|    linear    	|       0      	|
|  DistilBERT 	|      8     	|      3e-5     	|     1e-2    	|       2      	|            131754           	|             12134             	|         348         	|     128    	|         30        	|    linear    	|      0.1     	|
|    mBERT    	|      8     	|      2e-5     	|      0      	|       2      	|            132335           	|             12245             	|         348         	|     128    	|         30        	|    linear    	|       0      	|
| XLM-ROBERTA 	|      8     	|      3e-5     	|      0      	|       2      	|            133317           	|             12360             	|         348         	|     128    	|         30        	|    linear    	|       0      	|
|   CANINE-c  	|      4     	|      5e-5     	|     0.01    	|       3      	|            130303           	|             11861             	|         2048        	|     512    	|        256        	|    linear    	|      0.1     	|
|   CANINE-s  	|      4     	|      5e-5     	|    0.001    	|      2.5     	|            130303           	|             11861             	|         2048        	|     512    	|        256        	|    linear    	|      0.1     	|


Obtained results:


|                 	| **F1-score** 	| **EM score** 	|
|:---------------:	|:------------:	|:------------:	|
|     **BERT**    	|     76.74    	|     73.59    	|
|   **RoBERTa**   	|     82.02    	|     78.54    	|
|  **DistilBERT** 	|     67.81    	|     64.71    	|
|   **CANINE-C**  	|     74.1     	|     69.2     	|
|   **CANINE-S**  	|     72.5     	|     69.6     	|
|    **mBERT**    	|     77.51    	|     74.1     	|
| **XLM-RoBERTa** 	|     78.3     	|     75.12    	|

In this settings, CANINE performs decently well (especially CANINE-c i.e. CANINE trained with Autoregressive Character Loss). Even if other BERT-like models are performing better.


### Zero-shot transfer on multilingual data (XQuAD)


In this setting, CANINE does not perform very well. On average it is -20 F1 lower than XLM-RoBERTa and -10 F1 lower than mBERT 
even if we expected CANINE to perfom better since it operates on characters and hence is free of the constraints of manually 
engineered tokenizers (which often do not work well for some languages e.g. for languages that do not use whitespaces 
such as Thai or Chinese) and fixed vocabulary. The gap between XLM-RoBERTa and CANINE-C increases when evaluated on 
languages such as Vietnamese, Thai or Chinese. These languages are mostly isolating ones i.e. language with a morpheme 
per word ratio close to one and almost no inflectional morphology.

#### F1 scores:
|            | **CANINE-C** | **CANINE-S** | **mBERT-base** | **BERT-base** | **XLM-RoBERTa** |
|:----------:|:------------:|:------------:|:--------------:|:-------------:|:---------------:|
| English    | 78,77        | 79,03        | 83,59          | 82,3          | 82,8            |
| Arabic     | 43,78        | 29,74        | 54,09          | 11,76         | 62,48           |
| German     | 59,57        | 55,35        | 68,4           | 19,41         | 72,47           |
| Greek      | 46,93        | 30,82        | 56,47          | 10,21         | 70,93           |
| Spanish    | 60,47        | 59,48        | 72,84          | 19,72         | 75,18           |
| Hindi      | 35,21        | 30,93        | 51,06          | 11,07         | 62,1            |
| Russian    | 60,49        | 55,09        | 68,33          | 9,47          | 73,12           |
| Thai       | 37,28        | 31,2         | 27,63          | 10,04         | 65,21           |
| Turkish    | 31,09        | 23,83        | 44,62          | 16,76         | 65,34           |
| Vietmanese | 43,14        | 35,52        | 64,49          | 24,63         | 73,44           |
| Chinese    | 34,86        | 28,68        | 52,71          | 8,15          | 65,68           |
| Romanian   | 56,62        | 43,69        | 69,31          | 20,03         | 74,78           |
| Average    | 49,02        | 41,95        | 59,46          | 20,30         | 69,16           |

#### Exact Match:
|            | **CANINE-C** | **CANINE-S** | **mBERT-base** | **BERT-base** | **XLM-RoBERTa** |
|:----------:|:------------:|:------------:|:--------------:|:-------------:|:---------------:|
| English    | 67,38        | 66,34        | 79,51          | 69,57         | 72,18           |
| Arabic     | 26,25        | 13,75        | 37,22          | 4             | 45,79           |
| German     | 43,16        | 38,27        | 50,84          | 4,9           | 55,21           |
| Greek      | 29,14        | 13,42        | 40,16          | 5,37          | 53,19           |
| Spanish    | 42,74        | 39,57        | 54,45          | 4,7           | 56,3            |
| Hindi      | 18,93        | 16,54        | 36,97          | 4,8           | 45,042          |
| Russian    | 43,48        | 35,65        | 52,1           | 4,62          | 55,54           |
| Thai       | 20,5         | 17,91        | 21,26          | 2,6           | 54,28           |
| Turkish    | 14,8         | 10,11        | 29,41          | 4,87          | 48,85           |
| Vietmanese | 25,17        | 19,65        | 45,21          | 7,64          | 54,02           |
| Chinese    | 21,36        | 20,2         | 42,26          | 3,1           | 55,63           |
| Romanian   | 39,98        | 26,5         | 54,62          | 6,21          | 61,26           |
| Average    | 32,74        | 26,49        | 45,33          | 10,20         | 53,19           |


### Noisy questions on SQuADv2

In this experience, the goal is to evaluate the models' robustness of noise. To do so, we created 3 noisy versions of
the SQuADv2 dataset where the questions have been artificially enhanced with noisy (in our case we chose ``RandomCharAug``
from ``nlpaug`` library with action `substitute` but in our package 4 other types of noise have been developed - refer 
to `noisifier/noisifier.py`).

Three levels of noise were chosen: 10\%, 20\% and 40\% . Each word gets transformed with probability $p$ into a misspelled 
version of it (see [nlpaug documentation](https://github.com/makcedward/nlpaug/blob/master/nlpaug/augmenter/char/random.py)
for more information).

The noise is **only** applied to the test set (on SQuADv2) made of 1187 examples. We compared the 7 models we finetuned 
on the clean version of SQuADv2 (first experiment) on these 3 noisy datasets (on for each level of $p$). The following
table gathers the results (averaged over 3 runs):

|                 	| **Noise level 10%** 	|        	| **Noise level 20%** 	|        	| **Noise level 40%** 	|        	|
|:---------------:	|:-------------------:	|:------:	|:-------------------:	|:------:	|:-------------------:	|:------:	|
|                 	|     **F1 score**    	| **EM** 	|     **F1 score**    	| **EM** 	|     **F1 score**    	| **EM** 	|
|     **BERT**    	|        73,68        	|  70,79 	|        71,22        	|  68,55 	|        66,42        	|  63,74 	|
|   **RoBERTa**   	|        79,06        	|  75,87 	|        76,57        	|  73,56 	|         70,7        	|  68,18 	|
|  **DistilBERT** 	|        65,85        	|  63,05 	|        64,42        	|  61,92 	|        60,77        	|  58,78 	|
|    **mBERT**    	|          74         	|  70,75 	|        71,66        	|  68,46 	|        67,08        	|  64,74 	|
| **XLM-RoBERTa** 	|        74,54        	|  71,61 	|        72,68        	|  69,81 	|        67,12        	|  64,43 	|
|   **CANINE-C**  	|        69,64        	|  66,89 	|        67,88        	|  65,43 	|        66,03        	|  63,9  	|
|   **CANINE-S**  	|        72,25        	|  69,65 	|         70,3        	|  68,03 	|        67,18        	|  64,6  	|

Overall RoBERTa is a very powerful model, it is the best in all experiences we attempted. However it is worth 
highlighting that once the noise level is high (i.e. > 40\%), both CANINE-C and CANINE-S perform similarly to BERT-like 
models. CANINE-S is even better than mBERT and BERT. CANINE-S does seem to fairly robust to high level of 
artificial noise. 

Further experiments should be run with other types of noise to confirm these results.


### Few-shot learning and domain adaptation

The goal of this experiment is to measure the ability of CANINE (and other models) to transfer to unseen data, in 
another domain. This could either be done in zero-shot or few-shot settings. Here we decided to go with the latter as it 
is more realistic. In real life, a company might already have a custom small database of labeled documents and questions 
associated (manually created) but would want to deploy a Question Answering system on the whole unlabeled database. The 
CUAD dataset is perfect for this task as it is highly specialized (legal domain, legal contract review). The training set 
is made of 22450 question/context pairs and the test set of 4182. We randomly selected 1\% of the training set (224 examples) 
to train on for 3 epochs, using the previously finetuned models on SQuADv2. Then each model was evaluated on 656 test examples. 
Results are reported in the following table and to ensure fair comparison, all models where trained and tested on the 
exact same examples. 

|                 	| **F1 score** 	| **EM score** 	|
|:---------------:	|:------------:	|:------------:	|
|     **BERT**    	|     74.18    	|     72.72    	|
|   **RoBERTa**   	|     73.83    	|     72.24    	|
|  **DistilBERT** 	|     72.86    	|     71.37    	|
|    **mBERT**    	|     74.50    	|     73.12    	|
| **XLM-RoBERTa** 	|     76.64    	|     73.44    	|
|   **CANINE-C**  	|     72.51    	|     71.39    	|
|   **CANINE-S**  	|     72.27    	|     71.27    	|

### Few-shot learning and adversarial attacks

This last Question Answering-related experiment aims at testing CANINE abilities not to be fooled in adversarial settings. 
We decided to us the  dynabench/QA dataset (BERT-version). The latter is an adversarially collected Reading Comprehension 
dataset spanning over multiple rounds of data collect. It has been made so that SOTA NLP models find it challenging. 

We decided to take models finetuned on SQuADv2, take 200 examples (2\%) extracted from dynabench/qa training set to train 
each model for 3 epochs and then evaluate these models on 600 test examples (60\% of the full test set).Our results are 
displayed in the following table. Again, to ensure fair comparison, all models are trained on the exact same examples 
and evaluated on the same ones.

|                 	| **F1 score** 	| **EM score** 	|
|:---------------:	|:------------:	|:------------:	|
|     **BERT**    	|     38.13    	|     25.6     	|
|   **RoBERTa**   	|   **47.47**  	|     35.8     	|
|  **DistilBERT** 	|     32.64    	|     22.5     	|
|    **mBERT**    	|     38.43    	|     28.6     	|
| **XLM-RoBERTa** 	|     36.51    	|     27.6     	|
|   **CANINE-C**  	|     28.25    	|     18.6     	|
|   **CANINE-S**  	|     27.40    	|     17.2     	|

Finally, we observed that CANINE models are much more prone to adversarial attacks (-10F1 points compared to data2vec and 
BERT). It is yet unclear for us why it is the case. Surely this is due to the fact that CANINE is tokenization-free but, 
we still need to build intuition on why this has a great impact when evaluated on adversarial samples.

## Discussion

In our zero-shot transfer QA experiments, CANINE does not appear to perform as well as token-based transformers such as 
mBERT. It might be because it was finetuned on English (analytical language) and hence cannot adapt well in zero-shot 
transfer especially to isolating languages (Thai, Chinese) and synthetic ones with agglutinative morphology (Turkish) or 
non-concatenative (Arabic). CANINE works decently well for languages close enough to English, e.g. Spanish or German. 
While mBERT and CANINE have both been pretrained on the top 104 languages with the largest Wikipedia using a MLM objective, 
XLM-RoBERTa was pretrained on 2.5TB of filtered CommonCrawl data containing 100 languages. This might be a confounding 
variable. Also, CANINE-S seems to be robust to high level of artificial noise and even slightly better than BERT and mBERT. 
Finally, one might also note that multilingual model do, overall, have better capacities of generalization and better 
scores on these Question Answering tasks. Finally, it seems that when artificial noise levels are high, CANINE-S is 
preferable to BERT as it is fairly robust to this type of noise.