# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
#!pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

Also, log into Hugging face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Now we'll see how to achieve the same results as we did in the last section <font color='blue'> without using the `Trainer` class</font>. Again, we assume you have done the <font color='blue'>data processing</font> in <font color='blue'>
section 2</font>. Here is a short summary covering everything you will need:

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

**Prepare for training**

Before actually <font color='blue'>writing</font> our <font color='blue'>training loop</font>, we will need to <font color='blue'>define</font> a few <font color='blue'>objects</font>. The first ones are the <font color='blue'>dataloaders</font> we will use to <font color='blue'>iterate over batches</font>. But before we can define those dataloaders, we need to <font color='blue'>apply a bit of postprocessing</font> to our `tokenized_datasets`, to take care of some things that the <font color='blue'>`Trainer` did</font> for us <font color='blue'>automatically</font>. Specifically, we need to:
- <font color='blue'>Remove</font> the <font color='blue'>columns corresponding</font> to <font color='blue'>values</font> the <font color='blue'>model</font> does <font color='blue'>not expect</font> (like the `sentence1` and `sentence2` columns).
- <font color='blue'>Rename</font> the column <font color='blue'>`label`</font> to <font color='blue'>`labels`</font> (because the model expects the argument to be named `labels`).
- Set the format of the datasets so they <font color='blue'>return PyTorch tensors</font> instead of lists.

Our `tokenized_datasets` has one method for each of those steps:

In [4]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

We can then check that the result only has columns that our model will accept:

In [5]:
["attention_mask", "input_ids", "labels", "token_type_ids"]

['attention_mask', 'input_ids', 'labels', 'token_type_ids']

Now that this is done, we can <font color='blue'>define our dataloaders</font>:

In [6]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can <font color='blue'>inspect a batch</font> like this:

In [7]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 65]),
 'token_type_ids': torch.Size([8, 65]),
 'attention_mask': torch.Size([8, 65])}

Note that the actual shapes will probably be slightly different for you since we set <font color='blue'>`shuffle=True`</font> for the training <font color='blue'>dataloader</font> and we are <font color='blue'>padding</font> to the <font color='blue'>maximum length</font> inside the batch.

Now that we're completely finished with data preprocessing (a satisfying yet elusive goal for any ML practitioner), let's turn to the model. We <font color='blue'>instantiate it</font> exactly as we did in the previous section:

In [8]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make sure that everything will go smoothly during training, we <font color='blue'>pass our batch</font> to this <font color='blue'>model</font>:

In [9]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.5699, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


All 🤗 Transformers models will <font color='blue'>return the loss</font> when labels are provided, and we also get the <font color='blue'>logits</font> (two for each input in our batch, so a tensor of size 8 x 2).

We're almost ready to <font color='blue'>write our training loop</font>! We're just <font color='blue'>missing two things</font>: an <font color='blue'>optimizer</font> and a <font color='blue'>learning rate scheduler</font>. Since we are trying to replicate what the `Trainer` was doing by hand, we will use the same defaults. The optimizer used by the `Trainer` is <font color='blue'>AdamW</font>, which is the same as Adam, but with a <font color='blue'>twist for weight decay regularization</font> (see “[Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)” by Ilya Loshchilov and Frank Hutter):

In [10]:
from transformers import AdamW

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

NameError: name 'torch' is not defined

Finally, the <font color='blue'>learning rate scheduler</font> used by default is just a <font color='blue'>linear decay</font> from the <font color='blue'>maximum value</font> (5e-5) to <font color='blue'>0</font>. To properly define it, we need to <font color='blue'>know the number of training steps</font> we will take, which is the <font color='blue'>number of epochs</font> we want to run <font color='blue'>multiplied</font> by the <font color='blue'>number of training batches</font> (which is the length of our training dataloader). The `Trainer` uses three epochs by default, so we will follow that:

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

 **The training loop**

 One last thing: we will <font color='blue'>want to use the GPU</font> if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we <font color='blue'>define a device</font> we will <font color='blue'>put our model and our batches on</font>:

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

We are now ready to train! To get some sense of when training will be finished, we <font color='blue'>add a progress bar</font> over our number of training steps, using the `tqdm` library:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

You can see that the <font color='blue'>core</font> of the <font color='blue'>training loop</font> looks a lot like the one in the <font color='blue'>introduction</font>. We didn't <font color='blue'>ask</font> for <font color='blue'>any reporting</font>, so this training loop will not tell us anything about how the model fares. We need to <font color='blue'>add an evaluation loop</font> for that.

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

**The evaluation loop**

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the `metric.compute()` method, but <font color='blue'>metrics</font> can actually <font color='blue'>accumulate batches</font> for us as <font color='blue'>we go over the prediction loop</font> with the <font color='blue'>method `add_batch()`</font>. Once we have accumulated <font color='blue'>all the batches</font>, we can get the <font color='blue'>final result</font> with <font color='blue'>`metric.compute()`</font>. Here's how to implement all of this in an evaluation loop:

In [None]:
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.eval()

Again, your results will be <font color='blue'>slightly different</font> because of the <font color='blue'>randomness in the model head</font> initialization and the <font color='blue'>data shuffling</font>, but they should be in the same ballpark.

✏️ **Try it out!** Modify the previous training loop to <font color='blue'>fine-tune your model</font> on the <font color='blue'>SST-2 dataset</font>.

In [None]:
# Data Preprocessing Section
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

raw_datasets = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

In [None]:
# Modeling Section
from transformers import AutoModelForSequenceClassification
from transformers import AdamW, get_scheduler # Optimizer and a Learning Rate Scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
# Training Loop Section
import torch
from tqdm.auto import tqdm

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)
model.to(device)
progress_bar = tqdm(range(num_training_steps))

In [None]:
# Evaluation Loop Section
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.eval()

**Supercharge your training loop with 🤗 Accelerate**

The training loop we defined earlier works fine on a <font color='blue'>single CPU or GPU</font>. But using the [🤗 Accelerate library](https://github.com/huggingface/accelerate), with just a few adjustments we can enable <font color='blue'>distributed training on multiple GPUs or TPUs</font>. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like:

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

And here are the changes:

In [None]:
from accelerate import Accelerator # Added
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator() # Added

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # Deleted
# model.to(device) # Deleted

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(train_dataloader, eval_dataloader, model, optimizer) # Added, main bulk of work

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
   "linear",
   optimizer=optimizer,
   num_warmup_steps=0,
   num_training_steps=num_training_steps
  )

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    #batch = {k: v.to(device) for k, v in batch.items()} # Deleted
    outputs = model(**batch)
    loss = outputs.loss
    #loss.backward() # Deleted
    accelerator.backward(loss) # Added

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

The first line to <font color='blue'>add</font> is the <font color='blue'>import line</font>. The second line <font color='blue'>instantiates an `Accelerator` object</font> that will look at the environment and initialize the proper distributed setup. <font color='blue'>🤗 Accelerate</font> handles the <font color='blue'>device placement for you</font>, so you can remove the lines that put the model on the device (or, if you prefer, change them to use `accelerator.device` instead of device).

Then the <font color='blue'>main bulk of the work</font> is done in the line that <font color='blue'>sends the dataloaders</font>, the <font color='blue'>model</font>, and the <font color='blue'>optimizer</font> to <font color='blue'>`accelerator.prepare()`</font>. This will <font color='blue'>wrap those objects</font> in the <font color='blue'>proper container</font> to make sure your <font color='blue'>distributed training works</font> as intended. The remaining changes to make are <font color='blue'>removing the line</font> that puts the <font color='blue'>batch on the device</font> (again, if you want to keep this you can just change it to use `accelerator.device`) and replacing `loss.backward()` with `accelerator.backward(loss)`.

**Note:** In order to benefit from the speed-up offered by Cloud TPUs, we recommend <font color='blue'>padding</font> your samples to a <font color='blue'>fixed length</font> with the <font color='blue'>`padding="max_length"`</font> and `max_length` arguments of the tokenizer.

If you'd like to copy and paste it to play around, here's what the <font color='blue'>complete training loop</font> looks like with 🤗 Accelerate:

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Putting this in a <font color='blue'>`train.py` script</font> will make that <font color='blue'>script runnable</font> on any kind of <font color='blue'>distributed setup</font>. To try it out in your distributed setup, run the command:

In [None]:
!accelerate config

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

In [None]:
!accelerate launch train.py

which will launch the distributed training. If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a <font color='blue'>`training_function()`</font> and run a last cell with:

In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function)

You can find more examples in the [🤗 Accelerate repo](https://github.com/huggingface/accelerate/tree/main/examples).