# Fine-tuning a model with the Trainer API or Keras

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

🤗 Transformers provides a <font color='blue'>Trainer class</font> to help you <font color='blue'>fine-tune</font> any of the <font color='blue'>pretrained models</font> it provides on your dataset. Once you've done all the <font color='blue'>data preprocessing work</font> in the last section, you have just a few steps left to define the <font color='blue'>`Trainer`</font>. The <font color='blue'>hardest part</font> is likely to be <font color='blue'>preparing the environment</font> to run <font color='blue'>`Trainer.train()`</font>, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to <font color='blue'>free GPUs or TPUs</font> on Google <font color='blue'>Colab</font>.

The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

**Training**

The <font color='blue'>first step</font> before we can define our `Trainer` is to define a <font color='blue'>`TrainingArguments` class</font> that will <font color='blue'>contain</font> all the <font color='blue'>hyperparameters</font> the Trainer will use for <font color='blue'>training and evaluation</font>. The <font color='blue'>only argument</font> you have to provide is a <font color='blue'>directory</font> where the <font color='blue'>trained model</font> will be <font color='blue'>saved</font>, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

**Note:** 💡 If you want to <font color='blue'>automatically upload</font> your <font color='blue'>model</font> to the <font color='blue'>Hub</font> during training, pass along <font color='blue'>`push_to_hub=True`</font> in the <font color='blue'>`TrainingArguments`</font>. We will learn more about this in [Chapter 4](https://huggingface.co/course/chapter4/3).

---



The second step is to define our model. As in the [previous chapter](https://huggingface.co/course/chapter2), we will use the <font color='blue'>`AutoModelForSequenceClassification` class</font>, with two labels:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


You will notice that unlike in [Chapter 2](https://huggingface.co/course/chapter2), you get a <font color='blue'>warning</font> after <font color='blue'>instantiating</font> this <font color='blue'>pretrained model</font>. This is because <font color='blue'>BERT has not been pretrained</font> on <font color='blue'>classifying pairs of sentences</font>, so the <font color='blue'>head</font> of the <font color='blue'>pretrained model</font> has been <font color='blue'>discarded</font> and a <font color='blue'>new head</font> suitable for <font color='blue'>sequence classification</font> has been <font color='blue'>added instead</font>. The <font color='blue'>warnings indicate</font> that some <font color='blue'>weights were not used</font> (the ones corresponding to the dropped pretraining head) and that some <font color='blue'>others were randomly initialized</font> (the ones for the new head). It concludes by <font color='blue'>encouraging you</font> to <font color='blue'>train the model</font>, which is exactly what we are going to do now.

Once we have our model, we can define a <font color='blue'>`Trainer`</font> by passing it all the <font color='blue'>objects constructed</font> up to now — the `model`, the `training_args`, the `training` and `validation` datasets, our `data_collator`, and our `tokenizer`:

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Note that when you pass the tokenizer as we did here, the <font color='blue'>default `data_collator`</font> used by the <font color='blue'>`Trainer`</font> will be a <font color='blue'>`DataCollatorWithPadding`</font> as defined previously, so you can <font color='blue'>skip</font> the line <font color='blue'>`data_collator=data_collator`</font> in this call. It was still important to show you this part of the processing in section 2!

To <font color='blue'>fine-tune</font> the model on our dataset, we just have to <font color='blue'>call</font> the <font color='blue'>`train()` method</font> of our Trainer:

In [None]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,0.6167
1000,0.5393


TrainOutput(global_step=1377, training_loss=0.5322803505625341, metrics={'train_runtime': 307.2718, 'train_samples_per_second': 35.812, 'train_steps_per_second': 4.481, 'total_flos': 405114969714960.0, 'train_loss': 0.5322803505625341, 'epoch': 3.0})

This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 500 steps. It won't, however, tell you how well (or <font color='blue'>badly</font>) your <font color='blue'>model is performing</font>. This is because:

1. We <font color='blue'>didn't tell</font> the <font color='blue'>`Trainer`</font> to <font color='blue'>evaluate during training</font> by setting <font color='blue'>`evaluation_strategy`</font> to either <font color='blue'>steps</font> (evaluate every `eval_steps`) or <font color='blue'>epoch</font> (evaluate at the end of each epoch).
2. We <font color='blue'>didn't provid</font> the <font color='blue'>`Trainer`</font> with a <font color='blue'>`compute_metrics()` function</font> to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

**Evaluation**

Let's see how we can <font color='blue'>build</font> a <font color='blue'>useful `compute_metrics()` function</font> and use it the next time we train. The function must take an <font color='blue'>`EvalPrediction` object</font> (which is a <font color='blue'>named tuple</font> with a <font color='blue'>`predictions` field</font> and a <font color='blue'>`label_ids` field</font>) and will <font color='blue'>return</font> a <font color='blue'>dictionary mapping strings to floats</font> (the strings being the names of the metrics returned, and the floats their values). To get some <font color='blue'>predictions</font> from our <font color='blue'>model</font>, we can use the <font color='blue'>`Trainer.predict()`</font> command:

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The <font color='blue'>output</font> of the `predict()` method is another <font color='blue'>named tuple</font> with <font color='blue'>three fields</font>: `predictions`, `label_ids`, and `metrics`. The <font color='blue'>metrics</font> field will just <font color='blue'>contain the loss</font> on the dataset passed, as well as some <font color='blue'>time metrics</font> (how long it took to predict, in total and on average). Once we complete our `compute_metrics()` function and pass it to the Trainer, that field will also contain the metrics returned by `compute_metrics()`.

As you can see, <font color='blue'>predictions</font> is a <font color='blue'>two-dimensional array</font> with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the <font color='blue'>logits</font> for <font color='blue'>each element</font> of the <font color='blue'>dataset</font> we <font color='blue'>passed</font> to <font color='blue'>`predict()`</font> (as you saw in the [previous chapter](https://huggingface.co/course/chapter2), all <font color='blue'>Transformer models</font> return <font color='blue'>logits</font>). To <font color='blue'>transform</font> them into <font color='blue'>predictions</font> that we can compare to our labels, we need to take the <font color='blue'>index</font> with the <font color='blue'>maximum value</font> on the <font color='blue'>second axis</font>:

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We can now <font color='blue'>compare</font> those <font color='blue'>preds</font> to the <font color='blue'>labels</font>. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 [Evaluate library](https://github.com/huggingface/evaluate/). We can load the <font color='blue'>metrics</font> associated with the <font color='blue'>MRPC dataset</font> as easily as we loaded the dataset, this time with the <font color='blue'>`evaluate.load()` function</font>. The object returned has a `compute()` method we can use to do the metric calculation:

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8406862745098039, 'f1': 0.8888888888888888}

The <font color='blue'>exact results</font> you get may <font color='blue'>vary</font>, as the <font color='blue'>random initialization</font> of the <font color='blue'>model head</font> might change the metrics it achieved. Here, we can see our model has an <font color='blue'>accuracy of 85.78%</font> on the validation set and an <font color='blue'>F1 score</font> of <font color='blue'>89.97</font>. Those are the two metrics used to evaluate results on the MRPC dataset for the <font color='blue'>GLUE benchmark</font>. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an <font color='blue'>F1 score of 88.9</font> for the <font color='blue'>base model</font>. That was the <font color='blue'>uncased model</font> while we are currently using the <font color='blue'>cased model</font>, which explains the better result.

Wrapping everything together, we get our `compute_metrics()` function:

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to <font color='blue'>report metrics</font> at the <font color='blue'>end</font> of <font color='blue'>each epoch</font>, here is how we define a new `Trainer` with this `compute_metrics()` function:

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note that we create a <font color='blue'>new `TrainingArguments`</font> with its <font color='blue'>`evaluation_strategy`</font> set to <font color='blue'>epoch</font> and a <font color='blue'>new model</font> — otherwise, we would just be continuing the training of the model we have already trained. To launch a new training run, we execute:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.514786,0.776961,0.858034
2,0.504300,0.514033,0.845588,0.892675
3,0.269700,0.780903,0.838235,0.886986


TrainOutput(global_step=1377, training_loss=0.31747924095491786, metrics={'train_runtime': 217.113, 'train_samples_per_second': 50.683, 'train_steps_per_second': 6.342, 'total_flos': 405114969714960.0, 'train_loss': 0.31747924095491786, 'epoch': 3.0})

This time, it will report the <font color='blue'>validation loss</font> and <font color='blue'>metrics</font> at the <font color='blue'>end of each epoch</font> on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.

The <font color='blue'>Trainer</font> will work <font color='blue'>out of the box</font> on <font color='blue'>multiple GPUs or TPUs</font> and provides lots of options, like mixed-precision training (use fp16 = True in your training arguments). We will go over everything it supports in Chapter 10.

This concludes the introduction to fine-tuning using the `Trainer` API. An <font color='blue'>example</font> of doing this for most <font color='blue'>common NLP tasks</font> will be given in [Chapter 7](https://huggingface.co/course/chapter7), but for now let's look at how to do the same thing in pure <font color='blue'>PyTorch</font>.

✏️ **Try it out!** <font color='blue'>Fine-tune</font> a model on the <font color='blue'>GLUE SST-2 dataset</font>, using the data processing you did in section 2.

**Loading the GLUE SST-2 dataset using the Hugging Face Transformers library:** We will go through the process of preprocessing the GLUE SST-2 dataset using the Hugging Face Transformers library. The <font color='blue'>SST-2 dataset</font> is a <font color='blue'>single-sentence text classification task</font>, making it slightly different from other GLUE tasks that involve pairs of sentences. We'll cover loading the dataset, tokenization, and dynamic padding.

**1. Loading the Dataset:**

We'll start by loading the SST-2 dataset from the 🤗 Datasets library. This dataset consists of single sentences along with their corresponding labels.

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "sst2")

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

**2. Tokenization and Preprocessing:**

Now that we have the raw dataset, let's preprocess the data by tokenizing the sentences using a pretrained tokenizer. We'll define a tokenization function and then apply it to the entire dataset.

In [None]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

# Tokenize the entire dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)



Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

**3. Dynamic Padding:**

Dynamic padding allows us to pad the batch to the length of the longest sequence within that batch, instead of padding to the maximum sequence length in the entire dataset. This improves efficiency by reducing unnecessary padding.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Example: Select a few samples from the training set
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence"]}

# Apply dynamic padding using data_collator
batch = data_collator(samples)

{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 29]),
 'token_type_ids': torch.Size([8, 29]),
 'attention_mask': torch.Size([8, 29]),
 'labels': torch.Size([8])}

The batch dictionary now contains keys for `input_ids`, `attention_mask`, `token_type_ids`, and `labels`, each corresponding to a PyTorch tensor. The `attention_mask` indicates which tokens are padding tokens, and `token_type_ids` distinguish between `sentences` in paired tasks.

**4. Fine-Tuning:**

With the preprocessed data, we can now proceed to fine-tune our model on the SST-2 dataset. We utilize the `batch` dictionary as input to our model for training and validation.

This completes the preprocessing of the GLUE SST-2 dataset using Hugging Face Transformers. The combination of tokenization and dynamic padding allows us to efficiently prepare our data for fine-tuning and training classification models on various NLP tasks.

**Full Sequence: Fine-Tuning a Model on GLUE SST-2 Dataset using Hugging Face Transformers**

In this part, we will go through the process of fine-tuning a model on the GLUE SST-2 (Single-Sentence Text Classification) dataset using the Hugging Face Transformers library. We will cover defining a Trainer, and evaluating the model's performance.

**1. Load dataset:**

First, we need to preprocess the dataset using the steps provided in the previous part of the tutorial. This involves loading the GLUE SST-2 dataset, tokenizing the sentences, and applying dynamic padding.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

**2. Defining TrainingArguments:**

Now, let's define the training arguments for the Trainer. These arguments will include the directory to save the trained model and other hyperparameters. We will use the TrainingArguments class for this.

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments("sst2-finetuned-model")
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=no,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=

**3. Defining the Model:**

Next, we'll define the model to be fine-tuned. We'll use the AutoModelForSequenceClassification class and specify the number of labels (in this case, 2 for binary classification).

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**4. Defining the Trainer:**

Now, we can define the Trainer using the model, training arguments, datasets, data collator, and tokenizer. Additionally, we'll specify the `compute_metrics` function to evaluate the model's performance during training.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

**5. Fine-Tuning the Model:**

With the Trainer defined, we can now initiate the fine-tuning process using the `train()` method.

In [None]:
trainer.train()

Step,Training Loss
500,0.4279
1000,0.3803
1500,0.3493
2000,0.3327
2500,0.3198
3000,0.3008
3500,0.3059
4000,0.2715
4500,0.2824
5000,0.2768


TrainOutput(global_step=25257, training_loss=0.19856225573431782, metrics={'train_runtime': 2974.0039, 'train_samples_per_second': 67.938, 'train_steps_per_second': 8.493, 'total_flos': 3082513027395900.0, 'train_loss': 0.19856225573431782, 'epoch': 3.0})

The Trainer will take care of training the model, reporting losses, and evaluating metrics during the training process. We need to make sure to have access to a GPU or TPU for faster training.

**6. Evaluating the Model:**

After training is complete, we can evaluate the model's performance on the evaluation dataset. The Trainer's `evaluate()` method can be used to do this.

In [None]:
evaluation_results = trainer.evaluate(tokenized_datasets["validation"])
for result in evaluation_results:
    print(f"{result}: {evaluation_results[result]}")

eval_loss: 0.4068796634674072
eval_accuracy: 0.9094036697247706
eval_f1: 0.9128996692392503
eval_runtime: 5.1052
eval_samples_per_second: 170.808
eval_steps_per_second: 21.351
epoch: 3.0


This will give us a dictionary of evaluation results, which can include metrics like accuracy, F1-score, etc.

**7. Saving the Fine-Tuned Model:**

To save the fine-tuned model, we can use the `save_model()` method provided by the Trainer.

In [None]:
trainer.save_model("fine_tuned_sst2_model")

In doing this, we have successfully fine-tuned a model on the GLUE SST-2 dataset using the Hugging Face Transformers library. This process can be adapted to other NLP tasks and datasets as well, allowing us to leverage the power of pre-trained transformer models for specific applications.