# HuggingFace NLP Course Chapter 3

## Fine-tuning a model with the Trainer API

[https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt](https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt)

Working through first part quickly since familiar - I want to get to the part where it introduces `Accelerate` library

---

## Quick notes from previous section - Processing The Data

### dataset

we will use as an example the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a paper by William B. Dolan and Chris Brockett. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). We’ve selected it for this chapter because it’s a small dataset, so it’s easy to experiment with training on it.

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 649k/649k [00:00<00:00, 2.48MB/s]
Downloading data: 100%|██████████| 75.7k/75.7k [00:00<00:00, 412kB/s]
Downloading data: 100%|██████████| 308k/308k [00:00<00:00, 1.60MB/s]


Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [3]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [4]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Using Trainer class

Practice with HF stuff first before `Accelerate` DIY

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# DATSET CONSISTS OF PAIRS OF SENTENCES
# NOTE IN CH03 "Processing the data" IT SHOWS THAT tokenizer("sentence 1", "sent 2", ...)
# CAN ACCEPT A LIST OF SENTENCES, SO BELOW CODE MAKES SENSE
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

2024-05-04 21:51:36.555269: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-04 21:51:36.555369: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-04 21:51:36.694917: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer") # THE ONLY ARG THAT IS NEEDED IS NAME OF SAVE DIRECTORY

In [7]:
# 2 LABELS BECAUSE TASK IS "is sentence2 a paraphrase of sentence1 -> yes / no classification"
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Evaluation

(in the course they train they above without eval dataset or compute_metrics, then they make you redo with these 2 things afer)

**NOTE I HAVE NOT DONE THIS SO BELOW CODE IS WITH UNTRAINED MODEL!!!!!!!**

---

Let’s see how we can build a useful compute_metrics() function and use it the next time we train. The function must take an EvalPrediction object (which is a named tuple with a predictions field and a label_ids field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). To get some predictions from our model, we can use the Trainer.predict() command:

In [10]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics. The metrics field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our compute_metrics() function and pass it to the Trainer, that field will also contain the metrics returned by compute_metrics().

As you can see, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the logits for each element of the dataset we passed to predict() (as you saw in the previous chapter, all Transformer models return logits). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [12]:
predictions.predictions[:10] # REMEMBER MY MODEL IS UNTRAINED

array([[-0.949354  , -0.05467544],
       [-0.93355924, -0.04643866],
       [-0.9492972 , -0.06445806],
       [-0.94276524, -0.06690286],
       [-0.9375537 , -0.05669558],
       [-0.9518137 , -0.05445227],
       [-0.94520247, -0.06065865],
       [-0.9233213 , -0.05202857],
       [-0.9564588 , -0.04163671],
       [-0.9583559 , -0.04877253]], dtype=float32)

In [14]:
predictions.label_ids[:10] # LABEL IDS IS WHAT THEY CALL THE "GOLD" / "REFERENCE"

array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1])

In [13]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We can now compare those preds to the labels. To build our compute_metric() function, we will rely on the metrics from the 🤗 Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the evaluate.load() function. The object returned has a compute() method we can use to do the metric calculation:

In [16]:
%pip install evaluate


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2
Note: you may need to restart the kernel to use updated packages.


In [17]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}

Wrapping everything together, we get our compute_metrics() function:

In [19]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds # WHY SO BAD CHOICE OF NAMES - the labels here are label_ids from before, which is just the REFERENCE/GOLD LABELS TO BE 100% CLEAR
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new Trainer with this compute_metrics() function:

In [23]:
training_args = TrainingArguments("test-trainer",
                                  evaluation_strategy="epoch",
                                 report_to="none")

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [24]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.387825,0.835784,0.888889
2,0.515100,0.445928,0.852941,0.892857
3,0.300100,0.593101,0.860294,0.901893


TrainOutput(global_step=1377, training_loss=0.33896216154964226, metrics={'train_runtime': 126.4371, 'train_samples_per_second': 87.031, 'train_steps_per_second': 10.891, 'total_flos': 405114969714960.0, 'train_loss': 0.33896216154964226, 'epoch': 3.0})

# A Full Training

Doing the same as above but with more control in PyTorch

---

Repeat here the stuff we will need:

In [26]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## Prepare for training

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

* Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
* Rename the column label to labels **(because the model expects the argument to be named labels)**.
* Set the format of the datasets so they return PyTorch tensors instead of lists.

Our tokenized_datasets has one method for each of those steps:

In [27]:
# REMINDER OF COLUMNS ETC
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [28]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

tokenized_datasets.set_format("torch")

tokenized_datasets["train"].column_names # ["attention_mask", "input_ids", "labels", "token_type_ids"]

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

Now can easily define our needed dataloaders:

In [29]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To quickly check there is no mistake in the data processing, we can inspect a batch like this:

**SEE ABOVE: OUR BATCH SIZE IS 8 SO SHOULD SEE 8 AS A DIMENSION IN BELOW TENSORS - THE OTHER VAR IS RANDOM SINCE WE HAVE SHUFFLE=TRUE IN THE TRAIN DATALOADER**

In [30]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 81]),
 'token_type_ids': torch.Size([8, 81]),
 'attention_mask': torch.Size([8, 81])}

Now turn to model part:

In [31]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**GOOD PRACTICE SEND 1 BATCH THROUGH**

To make sure that everything will go smoothly during training, we pass our batch to this model:

In [34]:
outputs = model(**batch)
print(outputs)
print("----")
print(outputs.loss)
print(outputs.logits.shape)

SequenceClassifierOutput(loss=tensor(0.9363, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.6608, -0.3372],
        [ 0.6689, -0.3575],
        [ 0.6599, -0.3340],
        [ 0.6668, -0.3561],
        [ 0.6559, -0.3264],
        [ 0.6782, -0.3260],
        [ 0.6623, -0.3201],
        [ 0.6589, -0.3497]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
----
tensor(0.9363, grad_fn=<NllLossBackward0>)
torch.Size([8, 2])


We’re almost ready to write our training loop! We’re just missing two things: an optimizer and a learning rate scheduler. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults. The optimizer used by the Trainer is AdamW, which is the same as Adam, but with a twist for weight decay regularization (see “Decoupled Weight Decay Regularization” by Ilya Loshchilov and Frank Hutter):

In [35]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that:

In [36]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we define a device we will put our model and our batches on:

In [37]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

In [38]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
debug = 0
for epoch in range(num_epochs):
    for batch in train_dataloader:
        if debug < 1:
            print(batch)
            debug = 123
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

{'labels': tensor([1, 0, 0, 1, 0, 1, 1, 1]), 'input_ids': tensor([[  101,  2157,  2085,  1010,  2069,  2416,  2163,  2079,  1024,  6751,
          1010,  4174,  1010,  5135,  1010,  2047,  3933,  1010,  3448,  1010,
          1998,  5273,  1012,   102,  6410,  2015,  2550,  2011,  2069,  2416,
          2163,  1035,  6751,  1010,  4174,  1010,  5135,  1010,  2047,  3933,
          1010,  3448,  1998,  5273,  1035,  2085,  2031,  2107,  5433,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1996, 13486, 12163,  3446,  3062,  2000,  1020,  1012,  1018,
          3867,  1010,  2091,  2013,  1037,  8001,  1020,  1012,  1021,  3867,
          1999,  2257,  1012,   102,  1996, 12163,  3446,  1999,  2624, 16877,
          2221, 13537,  2197,  3204,  2000,  1022,  1012,  1019,  3867,  1010,
          2091,  3053,  1037,  2440,  7017,  2391,  2013,  2257,  1012,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0],
      

You can see that the core of the training loop looks a lot like the one in the introduction. We didn’t ask for any reporting, so this training loop will not tell us anything about how the model fares. We need to add an evaluation loop for that.

## The evaluation loop

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We’ve already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch(). Once we have accumulated all the batches, we can get the final result with metric.compute(). Here’s how to implement all of this in an evaluation loop:

In [39]:
import evaluate

metric = evaluate.load("glue", "mrpc")

model.eval()

debug = 0
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
        
    if debug < 1:
        print(outputs)
        debug = 9994
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

SequenceClassifierOutput(loss=tensor(0.3927, device='cuda:0'), logits=tensor([[-2.0384,  1.7079],
        [ 1.0465, -0.7586],
        [ 0.2972, -1.2662],
        [-1.2164,  0.8131],
        [ 0.5937, -1.1682],
        [-1.5595,  1.3339],
        [-1.2748,  1.0378],
        [-1.7863,  1.6049]], device='cuda:0'), hidden_states=None, attentions=None)


{'accuracy': 0.8529411764705882, 'f1': 0.8958333333333334}

## Supercharge your training loop with 🤗 Accelerate

The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. 

**NEW CODE BELOW - I COMMENTED OUT STUFF FROM PREVIOUS LOOP AND ADDED <---- TO NEW STUFF**

In [None]:
from accelerate import Accelerator # <------
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator() # <------

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

#device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
#model.to(device)

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
) # <--------------------------------------------------------------

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        #batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        #loss.backward()
        accelerator.backward(loss) # <---------------
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

The first line to add is the import line. The second line instantiates an Accelerator object that will look at the environment and initialize the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use accelerator.device instead of device).

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare(). This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and replacing loss.backward() with accelerator.backward(loss).

**In order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the `padding="max_length"` and `max_length` arguments of the tokenizer.**

---

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

`accelerate config`

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

`accelerate launch train.py`

which will launch the distributed training.

---

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a training_function() and run a last cell with:


# UPDATE - THIS DOESN'T WORK: need to adjust (the NLP course stuff is wrong/incomplete - had to google elsewhere)

**ALSO, cannot run in this notebook since already launched 1 Accelerate object already apparently**

- will redo in new notebook

In [40]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

def training_function():
    accelerator = Accelerator()
    
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    
    optimizer = AdamW(model.parameters(), lr=3e-5)

    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )

    num_epochs = 3
    num_training_steps = num_epochs * len(train_dl)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )

    progress_bar = tqdm(range(num_training_steps))

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

In [43]:
from accelerate.utils import write_basic_config
write_basic_config()

PosixPath('/root/.cache/huggingface/accelerate/default_config.yaml')

In [45]:
from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=2)

ValueError: To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized inside your training function. Restart your notebook and make sure no cells initializes an `Accelerator`.