# CANINE \& Sentiment Analysis

This colab notebook is meant to be a showcase on how to run the code related to Sentiment Analysis/Classification task developed in [this Github repository](https://github.com/chloeskt/nlp_ensae/tree/main). 

This project has been done by Chloé SEKKAT (ENSAE \& ENS Paris-Saclay) and Jocelyn BEAUMANOIR (ENSAE \& ESSEC).  

## Description of experiments done

In this section, we are interested in the capacities of CANINE versus BERT-like models such as BERT, mBERT and XLM-RoBERTa 
on Sentiment Classification tasks. CANINE is a pre-trained tokenization-free and vocabulary-free encoder, that operates directly 
on character sequences without explicit tokenization. It seeks to generalize beyond the orthographic forms encountered 
during pre-training.

We evaluate its capacities on sentence classification with binary labels (positive/negative) on SST2 dataset. We have 
whosen this dataset because it is part of the GLUE benchmark and as such is a standard way of evaluating models for 
sentiment classification tasks. We monitor the accuracy obtained by our CANINE model and compare it to BERT, DistilBERT,
mBERT, RoBERTa and XLM-RoBERTa. Note that only mBERT, XLM-RoBERTa and CANINE are pretrained on multilingual data. mBERT and 
CANINE are pretrained on the same data while XLM-RoBERTa was pretrained on 2.5TB of filtered CommonCrawl data containing 
100 languages.

A second experiment is to test the abilities of CANINE to handle noisy inputs such as keyboard errors, misspellings, 
grammar error etc, which are very likely to happen in real life settings. 

Our third experiment consists in confronting CANINE to a more complex and noisy dataset: Sentiment140. It is made of 1.6
million of tweets hence the language used is more informal, prone to abbreviations and colloquialisms. From reading the 
CANINE paper, CANINE is expected to do better than regular token-based models which are limited by out-of-vocabulary
words. 

Following the previous experiment, we decided to test how CANINE would perform on Sentiment140 without having been train
on it. It would allow us to see how CANINE and other models perform when faced with "natural" noise (language in tweets)
and when the domain is different (in the sense that the topic and the way of writing/speaking are different). Additionally,
we can quantify the gain in accuracy from directly training on Sentiment140 compared to doing zero-shot transfer.

As CANINE has been pre-trained on multilingual data, it could be worth it to analyze its abilities on other languages
than English, especially since it is tokenization-free and hence, theoretically, should be able to adapt more easily to
languages with richer morphology. To test that, we did zero-shot transfer learning on multilingual data (MARC dataset).

To go further, we decided to compare the abilities of CANINE and other BERT-like models when actually finetuned on this
multilingual data. To do so, we have chosen to work again with the MARC dataset, using data in German, Japanese and
Chinese. We would like to see how CANINE compares and if it is better on languages which are more challenging for
token-based models (Chinese for instance). Compared to the previous experience, we are not doing transfer learning but
really finetuning for 2 epochs the models on a train set.

Finally, we provide a look into the prediction errors of all models on the SST2 test set.

## Setup

In [None]:
# Clone the repository containing all the code

!rm -rf nlp_ensae
!git clone https://github.com/chloeskt/nlp_ensae.git

Cloning into 'nlp_ensae'...
remote: Enumerating objects: 341, done.[K
remote: Counting objects: 100% (341/341), done.[K
remote: Compressing objects: 100% (240/240), done.[K
remote: Total 341 (delta 188), reused 242 (delta 93), pack-reused 0[K
Receiving objects: 100% (341/341), 351.77 KiB | 10.35 MiB/s, done.
Resolving deltas: 100% (188/188), done.


In [None]:
# check GPU
!nvidia-smi

Fri Apr 22 13:09:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install some dependences
! pip install --quiet pandas datasets transformers nlpaug

[K     |████████████████████████████████| 325 kB 14.3 MB/s 
[K     |████████████████████████████████| 4.0 MB 90.4 MB/s 
[K     |████████████████████████████████| 410 kB 67.3 MB/s 
[K     |████████████████████████████████| 136 kB 85.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 83.3 MB/s 
[K     |████████████████████████████████| 77 kB 8.2 MB/s 
[K     |████████████████████████████████| 212 kB 99.9 MB/s 
[K     |████████████████████████████████| 127 kB 90.9 MB/s 
[K     |████████████████████████████████| 596 kB 84.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 83.9 MB/s 
[K     |████████████████████████████████| 895 kB 86.2 MB/s 
[K     |████████████████████████████████| 144 kB 90.8 MB/s 
[K     |████████████████████████████████| 271 kB 97.2 MB/s 
[K     |████████████████████████████████| 94 kB 3.9 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [None]:
import transformers

transformers.logging.set_verbosity_error()

In [None]:
# Mount your google drive to save results

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Download pretrained models and custom datasets
# please note that sometimes the download fails for unknown reasons
# But all files are accessible with the link https://drive.google.com/drive/folders/1HjKQ_C_EoBDncjA3nJ-IlgjwhQe4bKTO?usp=sharing

!gdown --folder https://drive.google.com/drive/folders/1HjKQ_C_EoBDncjA3nJ-IlgjwhQe4bKTO?usp=sharing -O /content/drive/MyDrive/

Retrieving folder list
Retrieving folder 1-LLJWl5Zm3TP8-_J_S2S0Ak3f6GoioVU bert-finetuned
Processing file 14lNatLs-jOUo8wPuYb3-4CXC2V-9URE2 bert_best_model.pt
Retrieving folder 1C-b64k0gxGDIK1aSHzvtiVdjP_IUcjBs canine-c-finetuned
Processing file 1HKShAnDd2iCMYDoCrTXW-ppfmZg_VKk1 canine-c_best_model.pt
Retrieving folder 10GSpIxLSFTXHjd86gS0hgYqOWk3BW87m canine-s-finetuned
Processing file 1485SbxkMNdcX8qV42_7qX0TPkfFU0JZ7 canine-s_best_model.pt
Retrieving folder 1Qd2jzOGMa8ViDjAOU-xQp3aMNyWv9HcC distilbert-finetuned
Processing file 1g-Dz_5GnXoC-rpsntmSunYnMcIGKRTfD distilbert_best_model.pt
Retrieving folder 13SbRlaGyJSdSGqf_esB_QrKHaoYfzvJs mbert-finetuned
Processing file 11AWI2N9QGhpSjRSLdRh7fyf7B0Hc8ifA mbert_best_model.pt
Retrieving folder 1HLMfj2ok75zyj7tJeRv6Y3BmXFHXooJ3 roberta-finetuned
Processing file 1YkKuD1wW4vVHeiaLy8I-6l0ropHRAS-x roberta_best_model.pt
Retrieving folder 1LK7y9ekKlQ92sxDLlkEfh6agBw3dtX92 xlm_roberta-finetuned
Processing file 1XBGWrvbzlfE0-mAELwxJVDIKAs5n5Ars x

In [None]:
# Download pretrained models and custom datasets
# please note that sometimes the download fails for unknown reasons
# But all files are accessible with the link https://drive.google.com/drive/folders/1A9SrIW0hyXSyj0joKBC9re-U4ri9zewt?usp=sharing

!gdown --folder https://drive.google.com/drive/folders/1A9SrIW0hyXSyj0joKBC9re-U4ri9zewt?usp=sharing -O /content/drive/MyDrive/

Retrieving folder list
Retrieving folder 1o9TBDIWpMY4xZdw6o8aeEjPHxQwDBXdj amazon_multilingual
Retrieving folder 1FuMgJ6zIPtHrzMkM3EbD7QqmGgfQERje de
Retrieving folder 1-8wnPWniOLNSY7_ExYM4mVWmRgqGH_j9 canine-s-finetuned
Processing file 14ptaG9y6X6uwYf4GqRZkiT5XxyV4uoId canine-s_best_model.pt
Retrieving folder 1vNfagnU5OgcsUiTOyqqHg2mkQE9J_21i test
Processing file 1mDbXDvyW1YbeiA0NROqwRdgOOF6UfEp9 cache-0cf405bf6d7327dc.arrow
Processing file 1LLSS0fRy6u4TRY-Fmxl-rUIHZIBF6Lm2 cache-0ff9c1dea035819f.arrow
Processing file 1_PkU68qOcd9vxyrSQWVPPbu1VVTZbgjI cache-30e7cd049319cfd2.arrow
Processing file 1qVp6kMeoZsrbgntO5vBYPIhGwqXBTvMh cache-799e77ac4d257bac.arrow
Processing file 139dxZmgiWy0LkHm7422Myw4aMQkrd2N7 cache-ab14ae5ad6b924d0.arrow
Processing file 1-6kVANED6LPTnLwWTbS5npvES36C4z-7 cache-ffca98777b678b04.arrow
Processing file 1D4teKKM1bF4imSVnIg3Cb7fkwEozgdCy dataset_info.json
Processing file 1Mu-GetN15oWscZosSXNbR20NcDBbSS9E dataset.arrow
Processing file 1AzJd6PboyfyOAIO68cqtGzXBgi

In [None]:
# Cd into the question_answering folder to access the Python package

%cd nlp_ensae/source/sentiment_analysis/

/content/nlp_ensae/source/sentiment_analysis


## Imports

In [None]:
import argparse

import logging
import os
from abc import ABC, abstractmethod
from typing import Any, Dict, List, OrderedDict, Callable, Tuple, Union, Optional
from tqdm import tqdm
import random
from dataclasses import dataclass, field

tqdm.pandas()

import datasets
from datasets import load_dataset, load_metric, load_from_disk, DatasetDict, Dataset

import torch
import torch.nn as nn

from transformers.data.data_collator import InputDataClass
from transformers.trainer_utils import PredictionOutput, EvalPrediction

from nlpaug import Augmenter

import argparse
import logging
import os

import torch
from datasets import load_metric, load_from_disk
from transformers import (
    CanineTokenizer,
    IntervalStrategy,
    RobertaTokenizerFast,
    SchedulerType,
    BertTokenizerFast,
    DistilBertTokenizerFast,
    XLMRobertaTokenizerFast,
    DataCollatorWithPadding,
    CanineForSequenceClassification,
    BertForSequenceClassification,
    RobertaForSequenceClassification,
    DistilBertForSequenceClassification,
    PretrainedConfig,
    PreTrainedTokenizer,
    BatchEncoding,
)

from sentiment_analysis import (
    DatasetTokenizer,
    set_seed,
    TrainerArguments,
    DataArguments,
    CustomTrainer,
    save_predictions_to_pandas_dataframe,
    Model,
)


## Caveats

The code developped for the Sentiment Classification task was mainly made to be run on a remote server and not on a jupyter notebook, which is why the following notebook is mostly made of bash command. To see more in details the code in itself, we strongly advise you to look at the package and the corresponding `README.md` to get more information.

However, we will show some core classes in the following cell so that you might get a better feeling of what is going on.

## Parts of the Python package

#### main.py script

In [None]:
NUM_LABELS = 2

SEED = 0
set_seed(SEED)

CANINE_S_MODEL = "canine-s"
CANINE_C_MODEL = "canine-c"
BERT_MODEL = "bert"
MBERT_MODEL = "mbert"
XLM_ROBERTA_MODEL = "xlm_roberta"
ROBERTA_MODEL = "roberta"
DISTILBERT_MODEL = "distilbert"

SST2_DATASET_CONFIG = "sst2"
GLUE_DATASET_NAME = "glue"
SENT140_DATASET_NAME = "sentiment140"
AMAZON_MULTI_DATASET_NAME = "amazon_reviews_multi"

logger = logging.getLogger(__name__)


def train_model(
    model_name: str,
    learning_rate: float,
    weight_decay: float,
    type_lr_scheduler: SchedulerType,
    warmup_ratio: float,
    save_strategy: IntervalStrategy,
    save_steps: int,
    num_epochs: int,
    early_stopping_patience: int,
    output_dir: str,
    device: str,
    dataset_name: str,
    batch_size: int,
    truncation: bool,
    eval_only: bool,
    path_to_finetuned_model: str,
    dataset_config: str,
    padding: str,
    path_to_custom_dataset: str,
    mode: str,
    save_predictions: bool,
) -> None:
    logger.info(f"Loading dataset {dataset_name}")
    if dataset_name == GLUE_DATASET_NAME:
        logger.info(f"Chosen configuration is {dataset_config}")
        datasets = load_from_disk(path_to_custom_dataset)
    elif dataset_name == SENT140_DATASET_NAME:
        datasets = load_from_disk(path_to_custom_dataset)
    elif dataset_name == AMAZON_MULTI_DATASET_NAME:
        datasets = load_from_disk(path_to_custom_dataset)
    else:
        raise NotImplementedError

    print(datasets)
    logger.info(f"Preparing for model {model_name}")
    if model_name in [CANINE_C_MODEL, CANINE_S_MODEL]:
        pretrained_model_name = f"google/{model_name}"
        tokenizer = CanineTokenizer.from_pretrained(pretrained_model_name)
        model = CanineForSequenceClassification.from_pretrained(
            pretrained_model_name, num_labels=NUM_LABELS
        )
    else:
        if model_name == BERT_MODEL:
            pretrained_model_name = "bert-base-uncased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == MBERT_MODEL:
            pretrained_model_name = "bert-base-multilingual-cased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == XLM_ROBERTA_MODEL:
            pretrained_model_name = "xlm-roberta-base"
            tokenizer = XLMRobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == ROBERTA_MODEL:
            pretrained_model_name = "roberta-base"
            tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == DISTILBERT_MODEL:
            pretrained_model_name = "distilbert-base-uncased"
            tokenizer = DistilBertTokenizerFast.from_pretrained(pretrained_model_name)
            model = DistilBertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        else:
            raise NotImplementedError

    dataset_tokenizer = DatasetTokenizer(
        tokenizer=tokenizer,
        padding=padding,
        truncation=truncation,
    )

    logger.info("Tokenizing dataset")
    tokenized_datasets = datasets.map(
        dataset_tokenizer.tokenize,
        batched=True,
    )

    data_collator = DataCollatorWithPadding(tokenizer, padding=padding)
    metric = load_metric("glue", "sst2")

    if eval_only:
        logger.info("Loading own finetuned model")
        model.load_state_dict(torch.load(path_to_finetuned_model, map_location=device))

    trainer_args = TrainerArguments(
        model=model,
        learning_rate=learning_rate,
        lr_scheduler=type_lr_scheduler,
        warmup_ratio=warmup_ratio,
        save_strategy=save_strategy,
        save_steps=save_steps,
        epochs=num_epochs,
        output_dir=output_dir,
        metric=metric,
        evaluation_strategy=save_strategy,
        weight_decay=weight_decay,
        data_collator=data_collator,
        model_save_path=os.path.join(
            output_dir, f"{model_name}-finetuned", f"{model_name}_best_model.pt"
        ),
        device=device,
        early_stopping_patience=early_stopping_patience,
        metric_for_best_model="accuracy",
    )

    data_args = DataArguments(
        datasets=datasets,
        dataset_name=dataset_name,
        dataset_config=dataset_config,
        batch_size=batch_size,
        tokenizer=tokenizer,
        tokenized_datasets=tokenized_datasets,
    )

    if model_name in [
        CANINE_C_MODEL,
        CANINE_S_MODEL,
        BERT_MODEL,
        MBERT_MODEL,
        XLM_ROBERTA_MODEL,
        ROBERTA_MODEL,
        DISTILBERT_MODEL,
    ]:
        trainer = CustomTrainer(trainer_args, data_args, model_name)
    else:
        raise NotImplementedError

    # check if we are in eval mode only or not
    if not eval_only:
        logger.info("START TRAINING")
        trainer.train()

    logger.info("START FINAL EVALUATION")
    trainer.evaluate()
    logger.info("Final evaluation done")

    logger.info("GET PREDICTIONS")
    test_predictions = trainer.predict(mode=mode)
    logger.info("Predictions done")
    results = trainer.evaluate_predictions(test_predictions)
    logger.info(f"Predictions accuracy {results['accuracy']}")
    if save_predictions:
        save_predictions_to_pandas_dataframe(
            test_predictions,
            datasets,
            output_dir,
            model_name,
            mode,
            logger,
        )

    if not eval_only:
        # Save best model
        trainer.save_model()


In [None]:
%%script false --no-raise-error

if __name__ == "__main__":
    debug = False
    logging.basicConfig(level=logging.INFO)
    logging.getLogger("datasets").setLevel(logging.ERROR)

    parser = argparse.ArgumentParser(
        description="Parser for training and data arguments"
    )

    parser.add_argument(
        "--model_name",
        type=str,
        help="Name of the model",
        choices=[
            MBERT_MODEL,
            BERT_MODEL,
            CANINE_S_MODEL,
            CANINE_C_MODEL,
            ROBERTA_MODEL,
            XLM_ROBERTA_MODEL,
            DISTILBERT_MODEL,
        ],
        required=True,
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        required=True,
        help="Chosen learning rate for AdamW optimizer",
    )
    parser.add_argument(
        "--weight_decay",
        type=float,
        required=True,
        help="Chosen weight decay for AdamW optimizer",
    )
    parser.add_argument(
        "--type_lr_scheduler", type=str, required=True, help="Type of LR scheduler"
    )
    parser.add_argument(
        "--warmup_ratio", type=float, required=True, help="Warmup ratio"
    )
    parser.add_argument(
        "--save_strategy",
        type=str,
        required=True,
        help="Save strategy",
        choices=["steps", "epochs"],
    )
    parser.add_argument(
        "--save_steps",
        type=int,
        required=True,
        help="Number of steps to perform before saving model",
    )
    parser.add_argument(
        "--num_epochs", type=int, required=True, help="Number of epochs to train for"
    )
    parser.add_argument(
        "--early_stopping_patience",
        type=int,
        required=True,
        help="Patience for early stopping, validation loss is monitored",
    )
    parser.add_argument(
        "--output_dir", type=str, required=True, help="Directory to store the model"
    )
    parser.add_argument(
        "--device",
        type=str,
        required=True,
        help="Device to run the code on, either cpu or cuda",
    )
    parser.add_argument(
        "--dataset_name",
        type=str,
        default="glue",
        choices=[GLUE_DATASET_NAME, SENT140_DATASET_NAME, AMAZON_MULTI_DATASET_NAME],
        required=True,
        help="Name of the dataset to train/evaluate on",
    )
    parser.add_argument(
        "--dataset_config",
        type=str,
        default="sst2",
        help="Config name for GLUE dataset",
    )
    parser.add_argument(
        "--batch_size",
        type=int,
        required=True,
        help="Batch size for training and evaluation",
    )
    parser.add_argument("--eval_only", type=bool, default=False)
    parser.add_argument(
        "--path_to_finetuned_model",
        type=str,
        default=None,
        help="Path towards a previously finetuned model",
    )
    parser.add_argument(
        "--truncation",
        type=bool,
        required=True,
        help="Whether or not tokenizer should truncate the inputs",
    )
    parser.add_argument("--padding", type=str, help="Padding strategy")
    parser.add_argument(
        "--path_to_custom_dataset",
        type=str,
        help="Path to custom dataset",
        required=True,
    )
    parser.add_argument(
        "--mode",
        type=str,
        help="Evaluation mode",
        choices=["val", "test"],
        required=True,
    )
    parser.add_argument(
        "--save_predictions",
        type=bool,
        help="Whether or not to save the predictions into a csv file",
        default=False,
    )

    args = parser.parse_args()

    train_model(
        model_name=args.model_name,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        type_lr_scheduler=args.type_lr_scheduler,
        warmup_ratio=args.warmup_ratio,
        save_strategy=args.save_strategy,
        save_steps=args.save_steps,
        num_epochs=args.num_epochs,
        early_stopping_patience=args.early_stopping_patience,
        output_dir=args.output_dir,
        device=args.device,
        dataset_name=args.dataset_name,
        batch_size=args.batch_size,
        truncation=args.truncation,
        eval_only=args.eval_only,
        path_to_finetuned_model=args.path_to_finetuned_model,
        dataset_config=args.dataset_config,
        padding=args.padding,
        path_to_custom_dataset=args.path_to_custom_dataset,
        mode=args.mode,
        save_predictions=args.save_predictions,
    )


#### CANINE model

In [None]:
HuggingFaceModelT = Any
CANINE_C = "google/canine-c"
CANINE_S = "google/canine-s"

class Model(nn.Module):
    """Generic model for Sentiment Classification Tasks"""

    def __init__(self, model: HuggingFaceModelT, config: PretrainedConfig):
        nn.Module.__init__(self)
        self.num_labels = config.num_labels

        self.model = model
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: torch.Tensor = None,
        attention_mask: torch.Tensor = None,
        token_type_ids: torch.Tensor = None,
        position_ids: torch.Tensor = None,
        head_mask: torch.Tensor = None,
        inputs_embeds: torch.Tensor = None,
        output_attentions: torch.Tensor = None,
        output_hidden_states: torch.Tensor = None,
        return_dict: bool = None,
    ):
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        return self.classifier(pooled_output)


class CanineSA(Model):
    """CANINE model for Sentiment Classification Tasks"""

    def __init__(self, pretrained_model_name: str = Union[CANINE_C, CANINE_S]) -> None:
        config = CanineConfig(num_labels=2)
        canine = CanineModel.from_pretrained(pretrained_model_name)
        Model.__init__(self, canine, config)


#### Dataset Tokenizer

Class to tokenize the dataset and provide right inputs for the model.

In [None]:
class DatasetTokenizer:
    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        padding: str,
        truncation: bool,
    ) -> None:
        self.tokenizer = tokenizer
        self.padding = padding
        self.truncation = truncation

    def tokenize(self, data: Dataset) -> BatchEncoding:
        return self.tokenizer(
            data["sentence"],
            padding=self.padding,
            truncation=self.truncation,
        )

#### Base Custom Trainer

In [None]:
@dataclass
class TrainerArguments:
    """
    Arguments needed to initiate a Trainer
    """

    model: nn.Module
    learning_rate: float
    lr_scheduler: SchedulerType
    warmup_ratio: float
    save_strategy: IntervalStrategy
    save_steps: int
    epochs: int
    output_dir: str
    metric: Any
    evaluation_strategy: IntervalStrategy
    weight_decay: float
    data_collator: Callable[[List[InputDataClass]], Dict[str, Any]]
    model_save_path: str
    device: str
    early_stopping_patience: int
    metric_for_best_model: str


@dataclass
class DataArguments:
    """
    Data arguments needed to initiate a Trainer
    """

    datasets: DatasetDict
    dataset_name: str
    dataset_config: str
    batch_size: int
    tokenizer: PreTrainedTokenizer
    tokenized_datasets: DatasetDict


class CustomTrainer(ABC):
    """General Trainer signature"""

    logger = logging.getLogger(__name__)

    def __init__(
        self, trainer_args: TrainerArguments, data_args: DataArguments, model_name: str
    ) -> None:
        self.trainer_args = trainer_args
        self.data_args = data_args
        self.model_name = model_name

        # Define training arguments
        args = TrainingArguments(
            output_dir=os.path.join(
                self.trainer_args.output_dir, self.model_name + "-finetuned"
            ),
            evaluation_strategy=self.trainer_args.evaluation_strategy,
            learning_rate=self.trainer_args.learning_rate,
            weight_decay=self.trainer_args.weight_decay,
            num_train_epochs=self.trainer_args.epochs,
            lr_scheduler_type=self.trainer_args.lr_scheduler,
            warmup_ratio=self.trainer_args.warmup_ratio,
            per_device_train_batch_size=self.data_args.batch_size,
            per_device_eval_batch_size=self.data_args.batch_size,
            save_strategy=self.trainer_args.save_strategy,
            save_steps=self.trainer_args.save_steps,
            push_to_hub=False,
            metric_for_best_model=trainer_args.metric_for_best_model,
            load_best_model_at_end=True,
            logging_steps=self.trainer_args.save_steps,
            no_cuda=False if self.trainer_args.device == "cuda" else True,
        )

        callbacks = [
            EarlyStoppingCallback(
                early_stopping_patience=self.trainer_args.early_stopping_patience
            )
        ]

        self.trainer = Trainer(
            self.trainer_args.model,
            args,
            train_dataset=self.data_args.tokenized_datasets["train"],
            eval_dataset=self.data_args.tokenized_datasets["validation"],
            data_collator=self.trainer_args.data_collator,
            tokenizer=self.data_args.tokenizer,
            callbacks=callbacks,
            compute_metrics=self._compute_metrics,
        )

    def train(self) -> None:
        self.logger.info("Start training")
        self.trainer.train()
        self.logger.info("Training done")

    def evaluate(self) -> None:
        self.logger.info("Start evaluation")
        results = self.trainer.evaluate()
        self.logger.info(
            f"Evaluation done: Eval loss {results['eval_loss']}, Eval accuracy {results['eval_accuracy']}"
        )

    def predict(self, mode: str) -> PredictionOutput:
        self.logger.info(f"Start predicting on {mode} set")
        if mode == "val":
            data = self.data_args.tokenized_datasets["validation"]
        elif mode == "test":
            data = self.data_args.tokenized_datasets["test"]
        else:
            raise NotImplementedError
        predictions = self.trainer.predict(data)
        self.logger.info("Prediction done")
        return predictions

    def evaluate_predictions(
        self, eval_predictions: PredictionOutput
    ) -> Dict[str, float]:
        return self._compute_metrics(eval_predictions)

    def save_model(self) -> None:
        self.logger.info(
            f"Saving best trained model at {self.trainer_args.model_save_path}"
        )
        torch.save(self.trainer.model.state_dict(), self.trainer_args.model_save_path)

    def _compute_metrics(
        self, eval_predictions: Union[EvalPrediction, PredictionOutput]
    ) -> Dict[str, float]:
        try:
            predictions, labels = eval_predictions
        except:
            predictions, labels = (
                eval_predictions.predictions,
                eval_predictions.label_ids,
            )
        predictions = np.argmax(predictions, axis=1)
        return self.trainer_args.metric.compute(
            predictions=predictions, references=labels
        )

#### CLI commands to finetune models on SST2 

Following examples are given for RoBERTa and CANINE-S models but can be applied also to:

- BERT
- mBERT
- DistilBERT
- XLM-RoBERTa
- CANINE-C

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-s \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis \
    --dataset_name glue \
    --dataset_config sst2 \
    --batch_size 12 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/sst2 \
    --mode test \
    --device cuda \


In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name xlm_roberta \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis \
    --dataset_name glue \
    --dataset_config sst2 \
    --batch_size 12 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/sst2 \
    --mode test \
    --device cuda \


#### CLI commands to do zero-shot transfer learning and domain adaptation from SST2 to Sentiment140


In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name canine-s \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis \
    --dataset_name sentiment140 \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/sentiment_analysis_140/sentiment140 \
    --mode test \
    --eval_only True \
    --device cuda \
    --path_to_finetuned_model /content/drive/MyDrive/sentiment_analysis/canine-s-finetuned/canine-s_best_model.pt

INFO:__main__:Loading dataset sentiment140
DatasetDict({
    train: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 63360
    })
    test: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 359
    })
    validation: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 16000
    })
})
INFO:__main__:Preparing for model canine-s
Downloading: 100% 657/657 [00:00<00:00, 628kB/s]
Downloading: 100% 854/854 [00:00<00:00, 835kB/s]
Downloading: 100% 670/670 [00:00<00:00, 579kB/s]
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Downloading: 100% 504M/504M [00:08<00:00, 63.0MB/s]
Some weights of CanineForSequenceClassification were not initialized from the model checkpoint

In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name bert \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis \
    --dataset_name sentiment140 \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/sentiment_analysis_140/sentiment140 \
    --mode test \
    --eval_only True \
    --device cuda \
    --path_to_finetuned_model /content/drive/MyDrive/sentiment_analysis/bert-finetuned/bert_best_model.pt

INFO:__main__:Loading dataset sentiment140
DatasetDict({
    train: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 63360
    })
    test: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 359
    })
    validation: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 16000
    })
})
INFO:__main__:Preparing for model bert
Downloading: 100% 28.0/28.0 [00:00<00:00, 28.5kB/s]
Downloading: 100% 226k/226k [00:00<00:00, 321kB/s]
Downloading: 100% 455k/455k [00:00<00:00, 517kB/s]
Downloading: 100% 570/570 [00:00<00:00, 543kB/s]
Downloading: 100% 420M/420M [00:06<00:00, 72.3MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.t

#### Code used to generate noisy datasets

Noise generated on SST2 dataset.

In [None]:
import random
from dataclasses import dataclass, field
from typing import Optional

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
from datasets import load_from_disk, DatasetDict, Dataset
from nlpaug import Augmenter
from transformers import HfArgumentParser


@dataclass
class NoisifierArguments:
    """
    Arguments needed to noisify SST2-like datasets.
    """

    path_to_custom_dataset: str = field(
        default=None, metadata={"help": "Path towards custom dataset"}
    )
    dataset_name: str = field(
        default=None,
        metadata={"help": "Name of the dataset. Either sst2 or sentiment140"},
    )
    output_dir: str = field(
        default=None,
        metadata={
            "help": "Output directory, will be used to store noisy dataset in csv format"
        },
    )
    noise_level: float = field(
        default=None, metadata={"help": "Level of noise to apply on the dataset"}
    )
    augmenter_type: str = field(
        default=None,
        metadata={
            "help": "Type of Augmenter to use. Either: KeyboardAug, RandomCharAug, SpellingAug, BackTranslationAug "
            "(de/en) or OcrAug"
        },
    )
    action: str = field(
        default=None,
        metadata={
            "help": "Type of action to apply if RandomCharAug was chosen. Either: swap, substitute, insert or delete."
        },
    )

    def __post_init__(self) -> None:
        if (self.augmenter_type == "RandomCharAug" and self.action is None) or (
            self.action is not None and self.augmenter_type != "RandomCharAug"
        ):
            raise ValueError(
                "If you set `augmenter_type` to RandomCharAug, please choose an `action`."
                "If you've chosen an `action`, you must choose `augmenter_type`==RandomCharAug for it"
                "to work."
            )


class Noisifier:
    def __init__(
        self,
        datasets: DatasetDict,
        dataset_name: str,
        level: float,
        type: str,
        action: Optional[str],
    ) -> None:
        self.datasets = datasets
        self.dataset_name = dataset_name
        self.level = level
        self.type = type
        self.action = action

    def _get_augmenter(self) -> Augmenter:
        if self.type == "KeyboardAug":
            return nac.KeyboardAug()

        elif self.type == "RandomCharAug":
            return nac.RandomCharAug(action=self.action)

        elif self.type == "SpellingAug":
            return naw.SpellingAug()

        elif self.type == "OcrAug":
            return nac.OcrAug()

        elif self.type == "BackTranslationAug":
            return naw.BackTranslationAug(
                from_model_name="facebook/wmt19-en-de",
                to_model_name="facebook/wmt19-de-en",
            )

        else:
            raise NotImplementedError

    def _augment_text(self, row: Dataset) -> Dataset:
        if self.dataset_name == "sst2":
            text = "sentence"
        elif self.dataset_name == "sentiment140":
            text = "text"
        else:
            raise NotImplementedError
        augmenter = self._get_augmenter()
        if random.random() < self.level:
            row[text] = augmenter.augment(row[text])
        return row

    def augment(self):
        if (
            "train" in self.datasets.column_names
            and "validation" in self.datasets.column_names
            and "test" in self.datasets.column_names
        ):
            self.datasets["train"] = self.datasets["train"].map(self._augment_text)
            self.datasets["validation"] = self.datasets["validation"].map(
                self._augment_text
            )
            self.datasets["test"] = self.datasets["test"].map(self._augment_text)
        else:
            raise NotImplementedError
        return self.datasets


In [None]:
%%script false --no-raise-error

if __name__ == "__main__":
    parser = HfArgumentParser(NoisifierArguments)
    args = parser.parse_args_into_dataclasses()[0]
    datasets = load_from_disk(args.path_to_custom_dataset)

    noisifier = Noisifier(
        datasets=datasets,
        dataset_name=args.dataset_name,
        level=args.noise_level,
        type=args.augmenter_type,
        action=args.action,
    )

    new_datasets = noisifier.augment()

    # saving
    print(f"saving noisy dataset dict at {args.output_dir}")
    new_datasets.save_to_disk(args.output_dir)
    print(new_datasets)

    print("Loading noisy dataset dict")
    datasets = datasets.load_from_disk(args.output_dir)
    print(datasets)

#### CLI commands to evaluate CANINE-S on noisy data, when $p=40$\%

In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name canine-s \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis \
    --dataset_name glue \
    --dataset_config sst2 \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/noisy_data/noisy_data_40 \
    --mode test \
    --eval_only True \
    --device cuda \
    --path_to_finetuned_model /content/drive/MyDrive/sentiment_analysis/canine-s-finetuned/canine-s_best_model.pt

INFO:__main__:Loading dataset glue
INFO:__main__:Chosen configuration is sst2
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 63982
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 3367
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
})
INFO:__main__:Preparing for model canine-s
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Some weights of CanineForSequenceClassification were not initialized from the model checkpoint at google/canine-s and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task 

### CLI commands to finetune models on Sentiment140

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-s \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 5000 \
    --num_epochs 3 \
    --early_stopping_patience 3 \
    --output_dir /content/drive/MyDrive/sentiment_analysis/sentiment_analysis_140 \
    --dataset_name sentiment140 \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/sentiment_analysis_140/sentiment140 \
    --mode test \
    --device cuda

### CLI commands to do zero-shot transfer learning on multilingual data

In [None]:
#%%script false --no-raise-error

!python3 main.py \
    --model_name canine-c \
    --learning_rate 2e-5 \
    --weight_decay 1e-2 \
    --type_lr_scheduler linear \
    --warmup_ratio 0.1 \
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \
    --early_stopping_patience 1 \
    --output_dir /content/drive/MyDrive/sentiment_analysis/sentiment_analysis \
    --dataset_name amazon_reviews_multi \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /content/drive/MyDrive/sentiment_analysis_data/amazon_multilingual/zh \
    --mode test \
    --eval_only True \
    --device cuda \
    --path_to_finetuned_model /content/drive/MyDrive/sentiment_analysis/canine-c-finetuned/canine-c_best_model.pt

INFO:__main__:Loading dataset amazon_reviews_multi
DatasetDict({
    train: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 160000
    })
    validation: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['sentence', 'labels'],
        num_rows: 4000
    })
})
INFO:__main__:Preparing for model canine-c
Downloading: 100% 657/657 [00:00<00:00, 660kB/s]
Downloading: 100% 892/892 [00:00<00:00, 730kB/s]
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Downloading: 100% 698/698 [00:00<00:00, 638kB/s]
Downloading: 100% 504M/504M [00:28<00:00, 18.8MB/s]
Some weights of CanineForSequenceClassification were not initialized from the model c

### CLI commands to do finetuning on multlingual data

In [None]:
%%script false --no-raise-error

!python3 main.py \
    --model_name canine-s \
    --learning_rate 2e-5 \ 
    --weight_decay 1e-2 \  
    --type_lr_scheduler linear \   
    --warmup_ratio 0.1 \   
    --save_strategy steps \
    --save_steps 2500 \
    --num_epochs 3 \   
    --early_stopping_patience 1 \  
    --output_dir /mnt/hdd/sentiment_analysis \
    --dataset_name amazon_reviews_multi \
    --batch_size 6 \
    --truncation True \
    --padding max_length \
    --path_to_custom_dataset /mnt/hdd/sentiment_analysis_data/amazon_multilingual/zh \
    --mode test \
    --device cuda 

## Showcase of the models capabilities 

In [None]:
CANINE_S_FINETUNED_PATH = "/content/drive/MyDrive/sentiment_analysis/canine-s-finetuned/canine-s_best_model.pt"
CANINE_C_FINETUNED_PATH = "/content/drive/MyDrive/sentiment_analysis/canine-c-finetuned/canine-c_best_model.pt"
XLM_ROBERTA_FINETUNED_PATH = "/content/drive/MyDrive/sentiment_analysis/xlm_roberta-finetuned/xlm_roberta_best_model.pt"
MBERT_FINETUNED_PATH = (
    "/content/drive/MyDrive/sentiment_analysis/mbert-finetuned/mbert_best_model.pt"
)
BERT_FINETUNED_PATH = (
    "/content/drive/MyDrive/sentiment_analysis/bert-finetuned/bert_best_model.pt"
)
DISTILBERT_FINETUNED_PATH = "/content/drive/MyDrive/sentiment_analysis/distilbert-finetuned/distilbert_best_model.pt"
ROBERTA_FINETUNED_PATH = (
    "/content/drive/MyDrive/sentiment_analysis/roberta-finetuned/roberta_best_model.pt"
)

CANINE_S_MODEL = "canine-s"
CANINE_C_MODEL = "canine-c"
BERT_MODEL = "bert"
MBERT_MODEL = "mbert"
XLM_ROBERTA_MODEL = "xlm_roberta"
ROBERTA_MODEL = "roberta"
DISTILBERT_MODEL = "distilbert"

NUM_LABELS = 2


def test_model(
    model_name: str,
    sentence: str,
    use_finetuned_model: str,
    true_label: int,
    device: str,
) -> None:
    model, tokenizer = _loading_model_and_tokenizer(model_name)
    path_to_finetuned_model = _get_finetuned_model_path(model_name)
    print()
    print("Chosen sentence: ", sentence)

    if use_finetuned_model == "Yes":
        use_finetuned_model = True
    else:
        use_finetuned_model = False

    if use_finetuned_model:
        print()
        print("Using own finetuned model on SST2")
        print("Loading finetuned model")
        model.load_state_dict(torch.load(path_to_finetuned_model, map_location=device))
        _helper_test(
            model,
            tokenizer,
            sentence,
            true_label,
        )
    else:
        print("Using initial model, not finetuned for prediction:")
        _helper_test(
            model,
            tokenizer,
            sentence,
            true_label,
        )


def _helper_test(
    model: Model,
    tokenizer: PreTrainedTokenizer,
    sentence: str,
    label: int,
) -> None:
    inputs = tokenizer(
        sentence,
        return_tensors="pt",
    )

    labels = torch.tensor(label)
    outputs = model(**inputs, labels=labels)
    predicted_class_id = outputs.logits.argmax().item()

    print(f"Model predicted class: {predicted_class_id}")
    print(f"Loss is: {round(outputs.loss.item(), 2)}")


def _loading_model_and_tokenizer(model_name: str) -> Tuple[Model, PreTrainedTokenizer]:
    if model_name in [CANINE_C_MODEL, CANINE_S_MODEL]:
        pretrained_model_name = f"google/{model_name}"
        tokenizer = CanineTokenizer.from_pretrained(pretrained_model_name)
        model = CanineForSequenceClassification.from_pretrained(
            pretrained_model_name, num_labels=NUM_LABELS
        )
    else:
        if model_name == BERT_MODEL:
            pretrained_model_name = "bert-base-uncased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == MBERT_MODEL:
            pretrained_model_name = "bert-base-multilingual-cased"
            tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name)
            model = BertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == XLM_ROBERTA_MODEL:
            pretrained_model_name = "xlm-roberta-base"
            tokenizer = XLMRobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == ROBERTA_MODEL:
            pretrained_model_name = "roberta-base"
            tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_model_name)
            model = RobertaForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        elif model_name == DISTILBERT_MODEL:
            pretrained_model_name = "distilbert-base-uncased"
            tokenizer = DistilBertTokenizerFast.from_pretrained(pretrained_model_name)
            model = DistilBertForSequenceClassification.from_pretrained(
                pretrained_model_name, num_labels=NUM_LABELS
            )

        else:
            raise NotImplementedError

    return model, tokenizer


def _get_finetuned_model_path(model_name: str) -> str:
    if model_name == CANINE_C_MODEL:
        path_to_finetuned_model = CANINE_C_FINETUNED_PATH
    elif model_name == CANINE_S_MODEL:
        path_to_finetuned_model = CANINE_S_FINETUNED_PATH
    elif model_name == XLM_ROBERTA_MODEL:
        path_to_finetuned_model = XLM_ROBERTA_FINETUNED_PATH
    elif model_name == MBERT_MODEL:
        path_to_finetuned_model = MBERT_FINETUNED_PATH
    elif model_name == BERT_MODEL:
        path_to_finetuned_model = BERT_FINETUNED_PATH
    elif model_name == DISTILBERT_MODEL:
        path_to_finetuned_model = DISTILBERT_FINETUNED_PATH
    elif model_name == ROBERTA_MODEL:
        path_to_finetuned_model = ROBERTA_FINETUNED_PATH
    else:
        raise NotImplementedError

    return path_to_finetuned_model


In [None]:
user_model_name = input("Choose a model; names are in the list above:")
user_use_finetuned_model = input("Do you want to use our finetuned model in STT2 ? Yes/No ")
user_sentence = input("Type a sentence: ")
user_label = input(
    "Is this sentence negative (0) or positive (1) - put corresponding int: "
)
while user_model_name != "quit":
    test_model(
        model_name=user_model_name,
        sentence=user_sentence,
        use_finetuned_model=user_use_finetuned_model,
        true_label=int(user_label),
        device="cpu",
    )
    print()
    user_model_name = input("Choose a model; names are in the list above:")
    user_use_finetuned_model = input(
        "Do you want to use our finetuned model in STT2 ? Yes/No "
    )
    user_sentence = input("Type a sentence: ")
    user_label = input(
        "Is this sentence negative (0) or positive (1) - put corresponding int: "
    )

Choose a model; names are in the list above:canine-s
Do you want to use our finetuned model in STT2 ? Yes/NoYes
Type a sentence: This is cleary bad ! What are you doing ?? Stop it !
Is this sentence negative (0) or positive (1) - put corresponding int: 0


Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.



Chosen sentence:  This is cleary bad ! What are you doing ?? Stop it !

Using own finetuned model on SST2
Loading finetuned model
Model predicted class: 0
Loss is: 0.0

Choose a model; names are in the list above:canine-s
Do you want to use our finetuned model in STT2 ? Yes/No Yes
Type a sentence: Yes I think your right, this is great ! 
Is this sentence negative (0) or positive (1) - put corresponding int: 1


Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.



Chosen sentence:  Yes I think your right, this is great ! 

Using own finetuned model on SST2
Loading finetuned model
Model predicted class: 1
Loss is: 0.0

Choose a model; names are in the list above:canine-s
Do you want to use our finetuned model in STT2 ? Yes/No No
Type a sentence: Arrête, tu racontes n'importe quoi, ça m'énerve !
Is this sentence negative (0) or positive (1) - put corresponding int: 0


Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.



Chosen sentence:  Arrête, tu racontes n'importe quoi, ça m'énerve !
Using initial model, not finetuned for prediction:
Model predicted class: 0
Loss is: 0.55

Choose a model; names are in the list above:canine-s
Do you want to use our finetuned model in STT2 ? Yes/No Yes
Type a sentence: Arrête, tu racontes n'importe quoi, ça m'énerve !
Is this sentence negative (0) or positive (1) - put corresponding int: 0


Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.
Using unk_token, but it is not set yet.



Chosen sentence:  Arrête, tu racontes n'importe quoi, ça m'énerve !

Using own finetuned model on SST2
Loading finetuned model
Model predicted class: 0
Loss is: 0.0

Choose a model; names are in the list above:quit
Do you want to use our finetuned model in STT2 ? Yes/No quit
Type a sentence: quit
Is this sentence negative (0) or positive (1) - put corresponding int: quit


## Results

### Binary Sentiment Classification on SST2

Models were trained with the following parameters:

|             	| Batch size 	| Learning Rate 	| Weigh decay 	| Nb of epochs 	 | Number of training examples 	| Number of validation examples 	| Lr scheduler 	| Warmup ratio 	|
|-------------	|------------	|---------------	|-------------	|----------------|-----------------------------	|-------------------------------	|--------------	|--------------	|
| RoBERTa     	| 12         	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| BERT        	| 12         	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| DistilBERT  	| 12         	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| mBERT       	| 12         	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| XLM-ROBERTA 	| 12         	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| CANINE-c    	| 6          	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|
| CANINE-s    	| 6          	| 2e-5          	| 1e-2        	| 3            	 | 63981                       	| 872                           	| linear       	| 0.1          	|


Obtained results:

|   Accuracy  	| Val set 	| Test set 	|
|:-----------:	|:-------:	|:--------:	|
|     BERT    	|   0.94  	|   0.93   	|
|   RoBERTa   	|   0.94  	|   0.94   	|
|  DistilBERT 	|   0.94  	|   0.91   	|
|    mBERT    	|   0.93  	|   0.88   	|
| XLM-RoBERTa 	|   0.92  	|   0.92   	|
|   CANINE-C  	|   0.93  	|   0.86   	|
|   CANINE-S  	|   0.92  	|   0.85   	|

In this setting, both CANINE-S and CANINE-C perform decently well on the validation set but not as much as the test set.
There are 8 percentage points of difference between CANINE-C and RoBERTa for instance. mBERT has similar behavior than
the two CANINE models.


### Robustness to noise

In this experience, the goal is to evaluate the models' robustness of noise. To do so, we created 3 noisy versions of the
SST2 dataset where the sentences have been artificially enhanced with noisy (in our case we chose ``RandomCharAug``
from ``nlpaug`` library with action `substitute` but in our package 4 other types of noise have been developed - refer 
to `noisifier/noisifier.py`).

Three levels of noise were chosen: 10\%, 20\% and 40\% . Each word gets transformed with probability $p$ into a misspelled 
version of it (see [nlpaug documentation](https://github.com/makcedward/nlpaug/blob/master/nlpaug/augmenter/char/random.py)
for more information).

The noise is **only** applied to the SST2 validation and test sets made of 3368 and 872 examples respectively. 
We compared the 7 models we finetuned on the clean version of SST2 (first experiment) on these 3 noisy datasets (on for 
each level of $p$). The following table gathers the results (averaged over 3 runs):

|             	| Noise level 10% 	|          	| Noise level 20% 	|          	| Noise level 40% 	|          	|
|:-----------:	|:---------------:	|:--------:	|:---------------:	|:--------:	|:---------------:	|:--------:	|
|             	|     Val set     	| Test set 	|     Val set     	| Test set 	|     Val set     	| Test set 	|
|     BERT    	|       0.88      	|   0.87   	|       0.85      	|   0.82   	|       0.80      	|   0.80   	|
|   RoBERTa   	|       0.88      	|   0.89   	|       0.87      	|   0.85   	|       0.83      	|   0.82   	|
|  DistilBERT 	|       0.85      	|   0.82   	|       0.82      	|   0.79   	|       0.76      	|   0.76   	|
|    mBERT    	|       0.88      	|   0.82   	|       0.85      	|   0.80   	|       0.80      	|   0.76   	|
| XLM-RoBERTa 	|       0.89      	|   0.85   	|       0.86      	|   0.83   	|       0.81      	|   0.81   	|
|   CANINE-C  	|       0.86      	|   0.80   	|       0.83      	|   0.76   	|       0.79      	|   0.74   	|
|   CANINE-S  	|       0.85      	|   0.80   	|       0.83      	|   0.77   	|       0.78      	|   0.74   	|

Both CANINE models have a better performance than DistilBERT for a high level of noise (>= 40\%). However all other models
are better to handle this type of artificial noise, RoBERTa being the best of all. 

### Sentiment Classification on more challenging Sentiment140 dataset (tweets)

The following experience is meant to evaluate the performances of the various models on a more challenging dataset: 
Sentiment140. This dataset is made of 1.6 million of tweets, all in English. The language used is very different from the
one in SST2 as it is made of more abbreviations, colloquialisms, slang, etc. Therefore it is expected to be hard for the
models to handle such text (which is "naturally" noisy). CANINE has a theoretical advantage on such dataset due to the
fact that it is tokenizer-free and operates at the character level.

The following table reports the results we obtained when finetuning all models on the (smaller) training set of 63360 examples.

|                 	| **Val set** 	 | **Test set** 	 |
|:---------------:	|:-------------:|:--------------:|
|     **BERT**    	|   0.84    	   |   0.86     	   |
|   **RoBERTa**   	|   0.87    	   |   0.86     	   |
|  **DistilBERT** 	|   0.83    	   |   0.85     	   |
|    **mBERT**    	|   0.79    	   |   0.78     	   |
| **XLM-RoBERTa** 	|   0.81    	   |   0.80     	   |
|   **CANINE-C**  	|   0.79    	   |   0.78     	   |
|   **CANINE-S**  	|   0.80    	   |   0.79     	   |


### Few-shot learning and domain adaptation

The goal of this experiment is to measure the ability of CANINE (and other models) to transfer to unseen data, in 
another domain. This could either be done in zero-shot or few-shot settings. Here we decided to go with the latter as it 
is more realistic. In real life, a company might already have a custom small database of labeled documents and questions 
associated (manually created) but would want to deploy a Question Answering system on the whole unlabeled database. The 
CUAD dataset is perfect for this task as it is highly specialized (legal domain, legal contract review). The training set 
is made of 22450 question/context pairs and the test set of 4182. We randomly selected 1\% of the training set (224 examples) 
to train on for 3 epochs, using the previously finetuned models on SQuADv2. Then each model was evaluated on 656 test examples. 
Results are reported in the following table and to ensure fair comparison, all models where trained and tested on the 
exact same examples. 

|                 	| **F1 score** 	| **EM score** 	|
|:---------------:	|:------------:	|:------------:	|
|     **BERT**    	|     74.18    	|     72.72    	|
|   **RoBERTa**   	|     73.83    	|     72.24    	|
|  **DistilBERT** 	|     72.86    	|     71.37    	|
|    **mBERT**    	|     74.50    	|     73.12    	|
| **XLM-RoBERTa** 	|     76.64    	|     73.44    	|
|   **CANINE-C**  	|     72.51    	|     71.39    	|
|   **CANINE-S**  	|     72.27    	|     71.27    	|

### Zero-shot transfer learning and domain adaptation from SST2 to Sentiment140

In this experience we would like to see how CANINE models perform when they are faced with "natural" noise that they were
**not** trained on. Compared to the previous experience where models where trained on Sentiment140, here models are 
trained on SST2 but evaluated on validation and test set from Sentiment140. 

In the previous task, CANINE models were not the best performing one. Actually, with mBERT, they were the last ones. Here 
we are evaluating something different: the ability for a model to adapt to another domain (in the sense that the topic 
and the way of writing/speaking are different) in a zero-shot transfer setting. It might be that, in real life settings, 
one has access to a clean benchmark-type dataset (such as SST2) but wants to do inference on a dataset whose subject is
quite different and full of misspellings and grammar errors. 

Results are reported in the following table:

|                 	| **Val set** 	| **Test set** 	|
|:---------------:	|:-----------:	|:------------:	|
|     **BERT**    	|     0.72    	|     0.84     	|
|   **RoBERTa**   	|     0.73    	|     0.88     	|
|  **DistilBERT** 	|     0.71    	|     0.82     	|
|    **mBERT**    	|     0.68    	|     0.76     	|
| **XLM-RoBERTa** 	|     0.72    	|     0.83     	|
|   **CANINE-C**  	|     0.64    	|     0.77     	|
|   **CANINE-S**  	|     0.64    	|     0.73     	|

CANINE models do not perform well on this task. They have -9 percentage point of accuracy compared to RoBERTa for 
instance (best performing model on this task) on the validation set. We noticed that mBERT has more difficulties than 
other BERT-like models on Sentiment140 dataset overall. Again, CANINE and mBERT have similar behavior.

### Zero-shot transfer learning on multlingual data

This experiment builds on the idea that CANINE is expected to perform better on languages with a different morphology
than English, for instance on non-concatenative morphology (such as Arabic and Hebrew), compounding (such as German and
Japanese), vowel harmony (Finnish), etc. Moreover, it is known that splitting on whitespaces (which is often done in most
tokenizer - note that SentencePiece has an option to skip whitespace splitting) is not adapted to languages such as Thai 
or Chinese. 

In this experience, models have been finetuned on the English dataset SST2 and are only evaluated both on validation and
tests sets of 4 languages from the Multilingual Amazon Reviews Corpus ([MARC](https://arxiv.org/abs/2010.02573)). We 
considered the four following language: German, French, Japanese and Chinese for their morphological properties. 

This dataset contains for each review the number of stars associated by the reviewer. To derive positive/negative
sentiment from this, we considered that if 1 or 2 stars only have been associated to the review, the sentiment is 
negative. While if 4 or 5 stars have been chosen, the review is positive. Neutral reviews, with 3 stars, were not 
considered. For each language, this gives us 160000 training samples, 4000 validation samples and 4000 test samples.

Results are given in the following table:

|                 	|  **French** 	|              	|  **German** 	|              	| **Japanese** 	|              	| **Chinese** 	|              	|
|:---------------:	|:-----------:	|:------------:	|:-----------:	|:------------:	|:------------:	|:------------:	|:-----------:	|:------------:	|
|                 	| **Val set** 	| **Test set** 	| **Val set** 	| **Test set** 	|  **Val set** 	| **Test set** 	| **Val set** 	| **Test set** 	|
|    **mBERT**    	|     0.71    	|     0.70     	|     0.66    	|     0.66     	|     0.56     	|     0.55     	|     0.58    	|     0.59     	|
| **XLM-RoBERTa** 	|     0.87    	|     0.86     	|     0.86    	|     0.87     	|     0.87     	|     0.85     	|     0.80    	|     0.79     	|
|   **CANINE-C**  	|     0.70    	|     0.69     	|     0.59    	|     0.58     	|     0.50     	|     0.50     	|     0.57    	|     0.55     	|
|   **CANINE-S**  	|     0.71    	|     0.70     	|     0.61    	|     0.61     	|     0.52     	|     0.52     	|     0.57    	|     0.57     	|

CANINE-S is similar to mBERT for French and Chinese data. Overall XLM-RoBERTa is extremely better than other models. 
Note that its pre-training strategy is different from the one of mBERT and CANINE. Indeed, while mBERT and CANINE have both been 
pretrained on the top 104 languages with the largest Wikipedia using a MLM objective, XLM-RoBERTa was pretrained on 2.5TB 
of filtered CommonCrawl data containing 100 languages. This might be a confounding variable.

### Finetuning on multlingual data
 
In this last experiment, we now compare CANINE to other BERT-like models on multilingual data where they are finetuned 
on it. This is the difference with the previous experience. To do so, we have chosen to work again with the MARC dataset, 
using data in German, Japanese and Chinese. We would like to see how CANINE compares and if it is better on languages
which are more challenging for token-based models (Chinese for instance). 

Please note that due to time and compute constraints, we considered only one CANINE model, CANINE-S. 

The results are given below:

|             	|  German 	|          	| Japanese 	|          	| Chinese 	|          	|
|:-----------:	|:-------:	|:--------:	|:--------:	|:--------:	|:-------:	|:--------:	|
|             	| Val set 	| Test set 	|  Val set 	| Test set 	| Val set 	| Test set 	|
|    mBERT    	|   0.93  	|   0.93   	|   0.92   	|   0.92   	|   0.87  	|   0.88   	|
| XLM-RoBERTa 	|   0.92  	|   0.92   	|   0.93   	|   0.93   	|   0.88  	|   0.88   	|
|   CANINE-S  	|   0.93  	|   0.93   	|   0.90   	|   0.89   	|   0.85  	|   0.85   	|

Quite surprisingly, on German, CANINE-S is slightly better than XLM-RoBERTa and has similar performance than mBERT. 
However on Japanese and Chinese, it is not the case. mBERT and especially XLM-RoBERTa should be preferred has they
provide better accuracy on both validation and test sets. 

### Analysis of prediction errors on SST2 dataset


For this section, please take a look at the following [Colab notebook]().

## Discussion

From our SC experiments, overall, other BERT-like models were better than CANINE. However note that on most tasks, CANINE performs similarly to mBERT and might be even slightly better. But for general tasks in English or more generally with languages for which we have a lot of resources, XLM-R and/or RoBERTa are better. We were not able to prove in our experiments that CANINE is better than tokenizer-based BERT-like models even on more challenging and complex languages such as Thai, Chinese, Japanese or Arabic. When finetuning for binary classification on German, Japanese and Chinese, CANINE-S was slightly better than XLM-R on German (high proximity with English, West Germanic family). But this was not the case in Japanese and Chinese where mBERT and XLM-R should be prefered (+3pp in accuracy). 