<a href="https://colab.research.google.com/github/WilderGitHub/datascience/blob/main/Text_classification_on_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **IMDB Reviews Classification with dynamic padding for faster training**

In this noteboook, we train a `t5-base` model on multi-lingual Amazon Reviews dataset. The model attains accuracy comparable to state of the art. Furthermore we implement dynamic padding to speed up model training.


If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.33.0-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.8 MB/s[0m eta [36m0:00:0

## Loading the dataset

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0


In [None]:
from datasets import load_dataset, load_metric

In [None]:
dataset = load_dataset('amazon_reviews_multi')

Downloading builder script:   0%|          | 0.00/7.11k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/84.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/78.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/71.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.97M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.86M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.54M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/30000 [00:00<?, ? examples/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

As you can see below the dataset has 1.2MM training examples and 30K validation examples.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 1200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 30000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

{'review_id': 'de_0203609',
 'product_id': 'product_de_0865382',
 'reviewer_id': 'reviewer_de_0267719',
 'stars': 1,
 'review_body': 'Armband ist leider nach 1 Jahr kaputt gegangen',
 'review_title': 'Leider nach 1 Jahr kaputt',
 'language': 'de',
 'product_category': 'sports'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,fr_0350607,product_fr_0607824,reviewer_fr_0441937,2,ATTENTION ! Bien lire la composition ! Il y a de l huile de castor ..... pour un produit bio ça va à contre courant . Donc l éthique n'y est pas,ATTENTION,fr,beauty
1,fr_0935377,product_fr_0326248,reviewer_fr_0044910,2,Usage personnel. Je ne suis pas pleinement satisfait de ce produit parce qu il ne fonctionne pas du tout de la même façon que la vrai coque samsung...,Pale imitation de la vrai coque samsung,fr,wireless
2,en_0543479,product_en_0152985,reviewer_en_0753379,1,It's yellow and super bendy. No way will this protect the screen if the phone is dropped.,Crap,en,wireless
3,zh_0018893,product_zh_0534709,reviewer_zh_0703807,5,产品收到了，很满意，包装结实，牢靠，一看就是专业的卖家，服务态度也好，产品的质量不错，物美价廉，性价比高！,满意,zh,pc
4,fr_0751691,product_fr_0503677,reviewer_fr_0295875,3,"Je commande très souvent sur votre site, mais cette fois, je suis déçue par le fait que la boite de thé n'était pas dans une boite en carton amazon, personne n'a besoin de connaitre mes achats",Manque d'emballage,fr,grocery
5,es_0750952,product_es_0199459,reviewer_es_0184807,3,"El tamaño de los ojetes no es real con la foto, ni el embalse concuerda con lo anunciado. Pero su utilización, es cómoda y el resultado bueno.",Fácil utilización,es,home_improvement
6,fr_0773376,product_fr_0277870,reviewer_fr_0284997,1,Je me demande si les wiko view de cette série ont été vérifiés avant mise sur marché. C'est quand même curieux qu'il y ait trop de défauts sur ce téléphone ! J'aime bien Wiko mais la je suis déçue.,Problèmes avec entrée et sortie audio,fr,wireless
7,ja_0287622,product_ja_0277924,reviewer_ja_0214626,1,Amazonさんへ 手続きが間違えたとおもって配達されていないので❗️書店で購入しました こんな間違えがあるのですね クレジットカード払いが引き落とされていたなら❗️本は届かなくて料金だけ払ったのでしょうか？ 本は届かないです。,本は届かない,ja,digital_ebook_purchase
8,zh_0833203,product_zh_0767787,reviewer_zh_0968668,2,字迹算清晰，但纸张和印刷水准来说。跟淘宝盗版如出一撤。慎入！,大概率非正版书,zh,book
9,en_0042364,product_en_0532556,reviewer_en_0542477,5,First time that I've actually received what I was looking for! Thank you for your product and fast delivery!,Fluorescent Glow in The Dark Paint,en,home


As can be seen, the data set has reviews in many languages. The `review_body` column has the review text. The `stars` column has the rating for the review. Review ratings range from 1-5 with equal percentage for all the classes.

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric = load_metric('accuracy')
metric

  metric = load_metric('accuracy')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = datasets.load_metric("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

   

In [None]:
f1_metric = load_metric('f1')
f1_metric

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Metric(name: "f1", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
    pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
    average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the sco

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [None]:
import numpy as np

fake_preds = np.random.randint(1, 6, size=(64,))
fake_labels = np.random.randint(1, 6, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.125}

In [None]:
f1_metric.compute(predictions=fake_preds, references=fake_labels, average='weighted')

{'f1': 0.1281752786801044}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

Since we are dealing with multi-lingual data, we will use the `xlm-roberta` model. This model is trained with data in over 100 languages. For more model options see [link](https://huggingface.co/transformers/multilingual.html)

To speed up the training process, we are going to train on a fraction of data.

In [None]:
do_shard = True
if do_shard:
    dataset = dataset.shuffle(seed=123)
    train_dataset = dataset["train"].shard(index=1, num_shards=10)
    val_dataset = dataset['validation'].shard(index=1, num_shards=5)
else:
    train_dataset = dataset['train']
    val_dataset = dataset['validation']

In [None]:
train_dataset[0:5]

{'review_id': ['fr_0929862',
  'en_0807143',
  'ja_0303398',
  'es_0712203',
  'es_0203937'],
 'product_id': ['product_fr_0404142',
  'product_en_0971943',
  'product_ja_0119647',
  'product_es_0814248',
  'product_es_0013336'],
 'reviewer_id': ['reviewer_fr_0266352',
  'reviewer_en_0699711',
  'reviewer_ja_0129893',
  'reviewer_es_0512166',
  'reviewer_es_0521528'],
 'stars': [5, 2, 5, 1, 5],
 'review_body': ["Magnifique sac à bandoulière pour y loger mon MAC Book air 13.3 pouces, qui de ce fait, se retrouve très bien protégé. Une tablette de 10 pouces trouve aussi sa place à côté du Mac...Poche avant profonde pouvant être utilisée pour ranger divers câbles. Pour ma part, j'ai pu ranger en plus des câbles, mes 2 petits disques durs externes...souris... Rangement aussi possible pour carnets, notes diverses sur format A4. Ensemble très classe. Suis très heureuse de mon achat, que je recommande vivement.",
  'I’ve had good luck with this product when I’ve bought it in stores, but this pa

In [None]:
val_dataset

Dataset({
    features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
    num_rows: 6000
})

In [None]:
from transformers import AutoTokenizer
model_checkpoint = 'xlm-roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [None]:
tokenizer("Hello, this one sentence!", "为什么一个620卖1988，一个卖4299？都一样的吗？")

{'input_ids': [0, 35378, 4, 903, 1632, 149357, 38, 2, 2, 6, 23543, 1860, 910, 1549, 21633, 109332, 4, 1860, 21633, 13023, 5046, 32, 1198, 13326, 43, 9131, 32, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

We concatenate the `review_body`, `review_title` and `product_category` in a string and pass that to the tokenizer. Concatenating title and product category along with body results in a significant increase in accuracy.

In [None]:
import torch
max_len = 512
pad_to_max = False
def tokenize_data(example):
    # Tokenize the review body
    text_ = example['review_body'] + " " + example['review_title'] + " " + example['product_category']
    encodings = tokenizer.encode_plus(text_, pad_to_max_length=pad_to_max, max_length=max_len,
                                           add_special_tokens=True,
                                            return_token_type_ids=False,
                                            return_attention_mask=True,
                                            return_overflowing_tokens=False,
                                            return_special_tokens_mask=False,
                                           )

    # Subtract 1 from labels to have them in range 0-4
    targets = torch.tensor(example['stars']-1,dtype=torch.long)


    encodings.update({'labels': targets})
    return encodings



In [None]:
tokenize_data(dataset['train'][0]).keys()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


dict_keys(['input_ids', 'attention_mask', 'labels'])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_train_dataset = train_dataset.map(tokenize_data)
encoded_val_dataset = val_dataset.map(tokenize_data)

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

In [None]:
encoded_train_dataset.column_names

['review_id',
 'product_id',
 'reviewer_id',
 'stars',
 'review_body',
 'review_title',
 'language',
 'product_category',
 'input_ids',
 'attention_mask',
 'labels']

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

We can also pass `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

In [None]:
def pad_seq(seq, max_batch_len, pad_value):
    return seq + (max_batch_len - len(seq)) * [pad_value]

In [None]:
from dataclasses import dataclass, field
from transformers import DataCollator
@dataclass
class SmartCollator():
    pad_token_id: int

    def __call__(self, batch):
        batch_inputs = list()
        batch_attention_masks = list()
        labels = list()
        max_size = max([len(ex['input_ids']) for ex in batch])
        for item in batch:
            batch_inputs += [pad_seq(item['input_ids'], max_size, self.pad_token_id)]
            batch_attention_masks += [pad_seq(item['attention_mask'], max_size, 0)]
            labels.append(item['labels'])

        return {"input_ids": torch.tensor(batch_inputs, dtype=torch.long),
                "attention_mask": torch.tensor(batch_attention_masks, dtype=torch.long),
                "labels": torch.tensor(labels, dtype=torch.long)
                }

In [None]:
# # a very simple accuracy function, nothing fancy
# def compute_metrics(p: EvalPrediction) -> Dict:
#     preds = np.argmax(p.predictions, axis=1)
#     return {"acc": (preds == p.label_ids).mean()}

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. We set the num_labels as 5 and use a batch size of 8.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
batch_size = 8
num_labels = 5

resume_training = False
if resume_training:
    model_checkpoint = 'test-results/checkpoint-20000'
else:
    model_checkpoint = 'xlm-roberta-base'
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = "test-results-concat",
    seed = 123,
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    eval_steps = 5000,
    save_steps = 5000,
    fp16 = True

)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits:

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    predictions = np.argmax(predictions, axis=1)

    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "validation"
trainer = Trainer(
    model,
    args,
    train_dataset= encoded_train_dataset,
    eval_dataset=encoded_val_dataset,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
!nvidia-smi

Tue Sep  5 14:52:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    26W /  70W |   1715MiB / 15360MiB |     31%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
5000,1.027,1.005685,0.562


We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

Our model gets an accuracy score of **59.9%** which is comparable to the accuracy score of 59.2% reported in the [paper](https://arxiv.org/abs/2010.02573)

## Hyperparameter search

The `Trainer` supports hyperparameter search using [optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/). For this last section you will need either of those libraries installed, just uncomment the line you want on the next cell and run it.

In [None]:
! pip install optuna
! pip install ray[tune]



During hyperparameter search, the `Trainer` will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. We jsut use the same function as before:

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
batch_size = 8
num_labels = 5

In [None]:
metric_name = "accuracy"

args = TrainingArguments(
    output_dir = "test-results-concat",
    seed = 123,
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    eval_steps = 5000,
    save_steps = 5000,
    fp16 = True

)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    predictions = np.argmax(predictions, axis=1)

    return metric.compute(predictions=predictions, references=labels)

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset= encoded_train_dataset,
    eval_dataset=encoded_val_dataset,
    data_collator=SmartCollator(pad_token_id=tokenizer.pad_token_id),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense

The method we call this time is `hyperparameter_search`. Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the `train_dataset` line above by:
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

In [None]:
best_run = trainer.hyperparameter_search(n_trials=5, direction="maximize")

[32m[I 2021-01-27 21:24:48,534][0m A new study created in memory with name: no-name-14608e42-17f5-42c4-95b9-d546cf4ca8a9[0m
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequence

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
5000,1.0049,1.000911,0.575,15.9744,375.601
10000,0.9319,0.959235,0.591333,15.4717,387.806
15000,0.9378,0.935104,0.593,15.4812,387.567
20000,0.8733,0.936651,0.597667,15.627,383.952
25000,0.8601,0.935514,0.598167,16.186,370.69
30000,0.861,0.929796,0.601333,15.6816,382.613


[32m[I 2021-01-27 22:27:06,422][0m Trial 0 finished with value: 398.89593333333335 and parameters: {'learning_rate': 8.898353327747936e-06, 'num_train_epochs': 2, 'seed': 33, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 398.89593333333335.[0m
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identic

Step,Training Loss,Validation Loss


[33m[W 2021-01-27 22:29:07,305][0m Trial 1 failed because of the following error: RuntimeError('CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.75 GiB total capacity; 14.14 GiB already allocated; 58.88 MiB free; 14.48 GiB reserved in total by PyTorch)',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 198, in _run_trial
    value_or_values = func(trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 134, in _objective
    trainer.train(model_path=model_path, trial=trial)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 888, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1259, in training_step
    self.scaler.scale(loss).backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, re

RuntimeError: ignored

The `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [None]:
best_run

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Don't forget to [update your model](https://huggingface.co/transformers/model_sharing.html) on the [🤗 Model Hub](https://huggingface.co/models). You can then use it only to generate results like the one shown in the first picture of this notebook!