# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
#!pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

Also, log into Hugging face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Now we'll see how to achieve the same results as we did in the last section <font color='blue'> without using the `Trainer` class</font>. Again, we assume you have done the <font color='blue'>data processing</font> in <font color='blue'>
section 2</font>. Here is a short summary covering everything you will need:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

**Prepare for training**

Before actually <font color='blue'>writing</font> our <font color='blue'>training loop</font>, we will need to <font color='blue'>define</font> a few <font color='blue'>objects</font>. The first ones are the <font color='blue'>dataloaders</font> we will use to <font color='blue'>iterate over batches</font>. But before we can define those dataloaders, we need to <font color='blue'>apply a bit of postprocessing</font> to our `tokenized_datasets`, to take care of some things that the <font color='blue'>`Trainer` did</font> for us <font color='blue'>automatically</font>. Specifically, we need to:
- <font color='blue'>Remove</font> the <font color='blue'>columns corresponding</font> to <font color='blue'>values</font> the <font color='blue'>model</font> does <font color='blue'>not expect</font> (like the `sentence1` and `sentence2` columns).
- <font color='blue'>Rename</font> the column <font color='blue'>`label`</font> to <font color='blue'>`labels`</font> (because the model expects the argument to be named `labels`).
- Set the format of the datasets so they <font color='blue'>return PyTorch tensors</font> instead of lists.

Our `tokenized_datasets` has one method for each of those steps:

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

We can then check that the result only has columns that our model will accept:

In [None]:
["attention_mask", "input_ids", "labels", "token_type_ids"]

['attention_mask', 'input_ids', 'labels', 'token_type_ids']

Now that this is done, we can <font color='blue'>define our dataloaders</font>:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can <font color='blue'>inspect a batch</font> like this:

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 72]),
 'token_type_ids': torch.Size([8, 72]),
 'attention_mask': torch.Size([8, 72])}

Note that the actual shapes will probably be slightly different for you since we set <font color='blue'>`shuffle=True`</font> for the training <font color='blue'>dataloader</font> and we are <font color='blue'>padding</font> to the <font color='blue'>maximum length</font> inside the batch.

Now that we're completely finished with data preprocessing (a satisfying yet elusive goal for any ML practitioner), let's turn to the model. We <font color='blue'>instantiate it</font> exactly as we did in the previous section:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make sure that everything will go smoothly during training, we <font color='blue'>pass our batch</font> to this <font color='blue'>model</font>:

In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7747, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


All 🤗 Transformers models will <font color='blue'>return the loss</font> when labels are provided, and we also get the <font color='blue'>logits</font> (two for each input in our batch, so a tensor of size 8 x 2).

We're almost ready to <font color='blue'>write our training loop</font>! We're just <font color='blue'>missing two things</font>: an <font color='blue'>optimizer</font> and a <font color='blue'>learning rate scheduler</font>. Since we are trying to replicate what the `Trainer` was doing by hand, we will use the same defaults. The optimizer used by the `Trainer` is <font color='blue'>AdamW</font>, which is the same as Adam, but with a <font color='blue'>twist for weight decay regularization</font> (see “[Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)” by Ilya Loshchilov and Frank Hutter):

In [None]:
import torch
from transformers import AdamW

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

Finally, the <font color='blue'>learning rate scheduler</font> used by default is just a <font color='blue'>linear decay</font> from the <font color='blue'>maximum value</font> (5e-5) to <font color='blue'>0</font>. To properly define it, we need to <font color='blue'>know the number of training steps</font> we will take, which is the <font color='blue'>number of epochs</font> we want to run <font color='blue'>multiplied</font> by the <font color='blue'>number of training batches</font> (which is the length of our training dataloader). The `Trainer` uses three epochs by default, so we will follow that:

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


 **The training loop**

 One last thing: we will <font color='blue'>want to use the GPU</font> if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we <font color='blue'>define a device</font> we will <font color='blue'>put our model and our batches on</font>:

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

We are now ready to train! To get some sense of when training will be finished, we <font color='blue'>add a progress bar</font> over our number of training steps, using the `tqdm` library:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

You can see that the <font color='blue'>core</font> of the <font color='blue'>training loop</font> looks a lot like the one in the <font color='blue'>introduction</font>. We didn't <font color='blue'>ask</font> for <font color='blue'>any reporting</font>, so this training loop will not tell us anything about how the model fares. We need to <font color='blue'>add an evaluation loop</font> for that.

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8578431372549019, 'f1': 0.8993055555555556}

**The evaluation loop**

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the <font color='blue'>`metric.compute()`</font> method, but <font color='blue'>metrics</font> can actually <font color='blue'>accumulate batches</font> for us as <font color='blue'>we go over the prediction loop</font> with the <font color='blue'>method `add_batch()`</font>. Once we have accumulated <font color='blue'>all the batches</font>, we can get the <font color='blue'>final result</font> with <font color='blue'>`metric.compute()`</font>. Here's how to implement all of this in an evaluation loop:

In [None]:
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.eval()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

Again, your results will be <font color='blue'>slightly different</font> because of the <font color='blue'>randomness in the model head</font> initialization and the <font color='blue'>data shuffling</font>, but they should be in the same ballpark.

✏️ **Try it out!** Modify the previous training loop to <font color='blue'>fine-tune your model</font> on the <font color='blue'>SST-2 dataset</font>.

**Loading the GLUE SST-2 dataset using the Hugging Face Transformers library:** We will once again go through the process of preprocessing the GLUE SST-2 dataset using the Hugging Face Transformers library. The <font color='blue'>SST-2 dataset</font> is a <font color='blue'>single-sentence text classification task</font>, making it slightly different from other GLUE tasks that involve pairs of sentences. We'll cover loading the dataset, tokenization, and training.

**1. Loading the Dataset, Tokenization, and Preprocessing:**
We'll start by loading the SST-2 dataset from the 🤗 Datasets library. This dataset consists of single sentences along with their corresponding labels. After we download the raw dataset, we will preprocess the data by tokenizing the sentences using a pretrained tokenizer. We'll define a tokenization function and then apply it to the entire dataset.

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader

raw_datasets = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

**2. Modeling:** Next, we'll define the model to be fine-tuned. We'll use the AutoModelForSequenceClassification class.

In [7]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW, get_scheduler # Optimizer and a Learning Rate Scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**3. Defining the Trainer and Fine-Tuning the Model:**

Now, we can define the Trainer using the model, training arguments, datasets, data collator, and tokenizer. Additionally, we'll specify the fine-tuning process using the `train()` method. The method `model.train()` tells your model that you are training the model and the method `mode.eval()` is used for evaluation (inference) mode. This helps inform layers such as Dropout and BatchNorm, which are designed to behave differently during training and evaluation. For instance, in training mode, BatchNorm updates a moving average on each new batch; whereas, for evaluation mode, these updates are frozen.

In [None]:
from tqdm.auto import tqdm

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)
model.to(device)
progress_bar = tqdm(range(num_training_steps))

cuda


  0%|          | 0/25257 [00:00<?, ?it/s]

In [None]:
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

**Supercharge your training loop with 🤗 Accelerate**

The training loop we defined earlier works fine on a <font color='blue'>single CPU or GPU</font>. But using the [🤗 Accelerate library](https://github.com/huggingface/accelerate), with just a few adjustments we can enable <font color='blue'>distributed training on multiple GPUs or TPUs</font>. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like:

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/25257 [00:00<?, ?it/s]

And here are the changes:

In [9]:
import torch
from tqdm.auto import tqdm
from accelerate import Accelerator # Added
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator() # Added

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") # Deleted
# model.to(device) # Deleted

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(train_dataloader, eval_dataloader, model, optimizer) # Added, main bulk of work

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
   "linear",
   optimizer=optimizer,
   num_warmup_steps=0,
   num_training_steps=num_training_steps
  )

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    #batch = {k: v.to(device) for k, v in batch.items()} # Deleted
    outputs = model(**batch)
    loss = outputs.loss
    #loss.backward() # Deleted
    accelerator.backward(loss) # Added

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/25257 [00:00<?, ?it/s]

The first line to <font color='blue'>add</font> is the <font color='blue'>import line</font>. The second line <font color='blue'>instantiates an `Accelerator` object</font> that will look at the environment and initialize the proper distributed setup. <font color='blue'>🤗 Accelerate</font> handles the <font color='blue'>device placement for you</font>, so you can remove the lines that put the model on the device (or, if you prefer, change them to use `accelerator.device` instead of device).

Then the <font color='blue'>main bulk of the work</font> is done in the line that <font color='blue'>sends the dataloaders</font>, the <font color='blue'>model</font>, and the <font color='blue'>optimizer</font> to <font color='blue'>`accelerator.prepare()`</font>. This will <font color='blue'>wrap those objects</font> in the <font color='blue'>proper container</font> to make sure your <font color='blue'>distributed training works</font> as intended. The remaining changes to make are <font color='blue'>removing the line</font> that puts the <font color='blue'>batch on the device</font> (again, if you want to keep this you can just change it to use `accelerator.device`) and replacing `loss.backward()` with `accelerator.backward(loss)`.

**Note:** In order to benefit from the speed-up offered by Cloud TPUs, we recommend <font color='blue'>padding</font> your samples to a <font color='blue'>fixed length</font> with the <font color='blue'>`padding="max_length"`</font> and `max_length` arguments of the tokenizer.

If you'd like to copy and paste it to play around, here's what the <font color='blue'>complete training loop</font> looks like with 🤗 Accelerate:

In [10]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/25257 [00:00<?, ?it/s]

Putting this in a <font color='blue'>`train.py` script</font> will make that <font color='blue'>script runnable</font> on any kind of <font color='blue'>distributed setup</font>. To try it out in your distributed setup, run the command:

In [19]:
!accelerate config default

accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml


which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

In [28]:
!ls -larth

total 24K
drwxr-xr-x 4 root root 4.0K Nov 22 14:22 .config
drwxr-xr-x 1 root root 4.0K Nov 22 14:23 sample_data
drwxr-xr-x 1 root root 4.0K Nov 26 04:47 ..
-rw-r--r-- 1 root root 6.2K Nov 26 06:07 train.py
drwxr-xr-x 1 root root 4.0K Nov 26 06:07 .


In [29]:
!accelerate launch ./train.py

Traceback (most recent call last):
  File "/content/./train.py", line 18, in <module>
    from ..data import SingleSentenceClassificationProcessor as Processor
ImportError: attempted relative import with no known parent package
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1168, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 763, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', './train.py']' returned non-zero exit status 1.


In [34]:
#!accelerate launch train.py
#Doesn't appear to work in a notebook environment

which will launch the distributed training. See [accelerate example notebooks](https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb) from which we can define a `training_function` from the Accelerate library.

In [48]:
def training_function(model):
    # Initialize accelerator
    accelerator = Accelerator()

    # To have only one message (and not 8) per logs of Transformers or Datasets, we set the logging verbosity
    # to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    train_dataloader, eval_dataloader = create_dataloaders(
        train_batch_size=hyperparameters["train_batch_size"], eval_batch_size=hyperparameters["eval_batch_size"]
    )
    # The seed need to be set before we instantiate the model, as it will determine the random head.
    set_seed(hyperparameters["seed"])

    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"])

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader
    )

    num_epochs = hyperparameters["num_epochs"]
    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
    # may change its length.
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=100,
        num_training_steps=len(train_dataloader) * num_epochs,
    )

    # Instantiate a progress bar to keep track of training. Note that we only enable it on the main
    # process to avoid having 8 progress bars.
    progress_bar = tqdm(range(num_epochs * len(train_dataloader)), disable=not accelerator.is_main_process)
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        model.eval()
        all_predictions = []
        all_labels = []

        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 8 TPUs to have them all.
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels.
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(tokenized_datasets["validation"])]
        all_labels = torch.cat(all_labels)[:len(tokenized_datasets["validation"])]

        eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}:", eval_metric)


def create_dataloaders(train_batch_size=8, eval_batch_size=32):
    train_dataloader = DataLoader(
        tokenized_datasets["train"], shuffle=True, batch_size=train_batch_size
    )
    eval_dataloader = DataLoader(
        tokenized_datasets["validation"], shuffle=False, batch_size=eval_batch_size
    )
    return train_dataloader, eval_dataloader

hyperparameters = {
    "learning_rate": 2e-5,
    "num_epochs": 3,
    "train_batch_size": 8, # Actual batch size will this x 8
    "eval_batch_size": 32, # Actual batch size will this x 8
    "seed": 42,
}

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a <font color='blue'>`training_function()`</font> and run a last cell with:

In [49]:
import torch
from torch.utils.data import DataLoader

from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    set_seed,
)

from tqdm.auto import tqdm

import datasets
import transformers


notebook_launcher(training_function, (model,))

Launching training on one GPU.




  0%|          | 0/25257 [00:00<?, ?it/s]

RuntimeError: stack expects each tensor to be equal size, but got [7] at entry 0 and [17] at entry 1

You can find more examples in the [🤗 Accelerate repo](https://github.com/huggingface/accelerate/tree/main/examples).