<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/03_fine_tuning_a_pretrained_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3. Fine-Tuning a pretrained model

It's all well and good to use toy data sets, but what about fine-tuning a pretrained model for your own? 

In this chapter we'll: 


*   prepare a large dataset from the Hub
*   use the ```Trainer``` API to fine-tune a model
*   use a custom training loop
*   leverage the  🤗 Accelerate library to run a custom training loop on a distributed setup



In [1]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 7.0 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 62.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 67.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 57.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 59.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K   

## [Processing the data](https://huggingface.co/course/chapter3/2?fw=pt)

This is how we would train a sequence classifier on one batch using PyTorch: 

In [2]:
import torch 
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

batch = tokenizer(sequences, 
                  padding=True,
                  truncation=True,
                  return_tensors="pt")

batch["labels"] = torch.tensor([1,1])

optimizer = AdamW(model.parameters())

loss = model(**batch).loss
loss.backward()
optimizer.step()

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Of course the example above is just that: an example. If we want to get decent results, we're going to need to train on a much larger dataset. Now where can we find one of those? 

### Loading a dataset from the Hub


In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

If we want to inspect our data, we index it like a dictionary: 

In [4]:
raw_train_dataset = raw_datasets["train"]

raw_train_dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

What are the different labels for the dataset above? 

In [5]:
raw_train_dataset.features["label"]

ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None)

Now that we know what's in our dataset, we can start...

### Preprocessing a dataset

We've previously preprocessed single sentences but what about pairs of sentences? For instance, what if we want to train a model for if a sentence natrually follows on from the previous one, if questions are duplicates, or if there is plagiarism? 

To do so, we need to preporcess sentence pairs like this: 

In [6]:
tokenized_pairs = tokenizer("My name is Evan.", "I work for the AI Guild")

print(tokenized_pairs['input_ids'])
print(tokenized_pairs['token_type_ids'])
print(tokenized_pairs["attention_mask"])

[101, 2026, 2171, 2003, 9340, 1012, 102, 1045, 2147, 2005, 1996, 9932, 9054, 102]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


**Key Point**: ```token_type_ids```, (i.e., which tokens belong to which sequence) is only returned by models which are trained to handle multiple sentences. 

For example, BERT is trained on masked language tasks as well as sequencing (i.e., "Does this sentnece naturally follow on from the previous one?"). Distil-BERT, on the other hand, does not.

BTW, if we have several pairs, we can pass them like this: 

In [7]:
tokenized_groups = tokenizer(
    ["My name is Evan.", "I'm going to the movies."], #first sentences
    ["I work for the AI Guild.", "This song rocks!"], 
    padding=True
)

print(tokenized_groups['input_ids'])
print(tokenized_groups['token_type_ids'])
print(tokenized_groups["attention_mask"])

[[101, 2026, 2171, 2003, 9340, 1012, 102, 1045, 2147, 2005, 1996, 9932, 9054, 1012, 102], [101, 1045, 1005, 1049, 2183, 2000, 1996, 5691, 1012, 102, 2023, 2299, 5749, 999, 102]]
[[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


Now while the above method works, it causes some issues in that it returns a dictionary so we'll probably end up running out of RAM unless we're using a toy dataset. 

To avoid this issue, we can use the ```Dataset.map()``` method so as to keep our data as an Apache Arrow file. 

In [8]:
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"],
                   truncation=True)

**Key Point**: we've omitted ```padding=True``` because padding is best done during batching because we don't need the length of the samples to match until they are ready to go into training; otherwise, we're just creating a massive dataset. 

Now we can apply our tokenize function on the dataset in one go like so: 

In [9]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [10]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

### Dynamic Padding

Fixed padding is padding every input to a given length. 

Dynamic padding is padding every input to the max lenght of the longest input for a given batch. 

**NB**: DO NOT use dynamic padding on a TPU.

How do we do dynamic padding with the ```transformers``` library? 

Like so:

In [11]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

It really is just that simple. 

Now let's try it out with some samples from our training data.

In [12]:
samples = tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

So, based on what we see above, we need to pad this batch to a max length of 67. 

In [13]:
batch = data_collator(samples)

{k:v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

## [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?fw=pt)

The ```Trainer``` api allows us to easily train our model on our custom dataset by passing a myriad of specifications such as: 

* metrics
* hyperparametes
* model
* training, validation, and test datasets
* tokenizer
* data collator

The trainer will then output the: 
* training
* evaluation
* prediction

**Key Point**: ```Trainer``` runs incredibly slowly on CPU be sure to use either Colab or Kaggle. 

Now that that's covered, the first step is to preprocess the data.

In [3]:
from datasets import load_dataset
from transformers import  AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

### Training 

Before we define the ```Trainer``` we'll want to define the ```TrainingArguments``` like so: 

In [4]:
from transformers import TrainingArguments

training_args = TrainingArguments('test-trainer')

**Key Point**: To automatically push the model to the hub during training, pass ```push_to_hub=True``` in ```TrainingArguments```.

Next, we define our model: 

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=checkpoint,
    num_labels=2)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Now we can define our ```Trainer``` by passing all the objects we've created to this point:

In [7]:
from transformers import  Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer
)

**NB**: When you pass the ```toknizer``` as we did above, you do not need to define the ```data_collator``` as we did above. However, in the spirit of explicit is better than implicit, we did it anyways 😀

Now, to fine-tune the model on our custom dataset, we simply call the ```train()``` method on our ```Trainer```:

In [8]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Step,Training Loss
500,0.5708
1000,0.3687


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1377, training_loss=0.40630322846278993, metrics={'train_runtime': 206.6487, 'train_samples_per_second': 53.25, 'train_steps_per_second': 6.663, 'total_flos': 405470580750720.0, 'train_loss': 0.40630322846278993, 'epoch': 3.0})

We didn't get any idea on how our model did during training since we didn't provide the ```Trainer``` with a metric to compute the loss. Let's remedy that 😀

### [Evaluation](https://huggingface.co/course/chapter3/3?fw=pt#evaluation)

To get some predictions from our model, we use the ```predict()``` method for the ```Trainer```:


In [10]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


(408, 2) (408,)


```predictions``` is a 2D array comprising the logits for our predictions

In [15]:
predictions.predictions[0]

array([-3.0714695,  2.9938512], dtype=float32)

In order to actually compare our predictions to our labels, we need to take the index with the max value on the second axis: 

In [20]:
import numpy as np 

preds = np.argmax(predictions.predictions, axis=-1)

Now we can compare our predictions with the labels: 

In [21]:
from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

{'accuracy': 0.8529411764705882, 'f1': 0.8993288590604027}

Now that we've done everything step-by-step, we can wrap everything up in a function and pass it to the ```Trainer``` both ```training_args``` and ```compute_metrics```:

In [34]:
def compute_metrics(eval_preds):
  metric = load_metric("glue", "mrpc")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [35]:
training_args = TrainingArguments('test-trainer', evaluation_strategy='epoch')
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=checkpoint,
    num_labels=2
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": 

All systems are now go! 

In [36]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.387581,0.840686,0.886165
2,0.535300,0.450535,0.85049,0.895726
3,0.283600,0.650574,0.85049,0.896435


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoin

TrainOutput(global_step=1377, training_loss=0.3330001664836942, metrics={'train_runtime': 215.8797, 'train_samples_per_second': 50.973, 'train_steps_per_second': 6.379, 'total_flos': 405470580750720.0, 'train_loss': 0.3330001664836942, 'epoch': 3.0})

## [A full training](https://huggingface.co/course/chapter3/4?fw=pt) 

# START HERE - Use a GPU