In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [2]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    # For GPU 0
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used = info.used
    # For GPU 1
    handle = nvmlDeviceGetHandleByIndex(1)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used += info.used
    print(f"GPU memory occupied: {memory_used//1024**2} MB.")


print_gpu_utilization()

GPU memory occupied: 620 MB.


In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to('cuda:0')
print_gpu_utilization()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mo

GPU memory occupied: 2564 MB.


In [4]:
default_args = {
    "output_dir": "outputs",
    "evaluation_strategy": "no",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [5]:
import numpy as np
from datasets import Dataset

seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

# Using Hugging Face Accelerate
So far we have used the Trainer to run the experiments but a more flexible alternative to that approach is to use Hugging Face Accelerate. With Accelerate, you have full control over the training loop and can essentially write the loop in pure PyTorch with some minor modifications. In turn it allows you to **easily scale across different infrastructures such as CPUs, GPUs, TPUs, or distributed multi-GPU setups** without changing any code.

Let’s see what it takes to implement all of the above tweaks in Accelerate. We can still use the TrainingArguments to wrap the training settings:

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    optim="adafactor",
    **default_args,
)

The full example code with Accelerate is give below:

In [7]:
import time
from transformers import Adafactor
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader

# Wrap the dataset in a DataLoader
dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size)

# Enable gradient checkpointing
if training_args.gradient_checkpointing:
    model.gradient_checkpointing_enable()

# Define the AdaFactor optimizer
optim = Adafactor(model.parameters(), beta1=training_args.adam_beta1)

# Specify if we want to use mixed precision training,
# and it will take care of it for us in the prepare call.
accelerator = Accelerator(mixed_precision='fp16')
model, optimizer, dataloader = accelerator.prepare(model, optim, dataloader)

# Set the model in training mode
model.train()

start_time = time.time()

# Main training loop
for step, batch in enumerate(dataloader, start=1):
    loss = model(**batch).loss

    # Normalize the loss so we get the average at the end of accumulation
    loss = loss / training_args.gradient_accumulation_steps

    accelerator.backward(loss)
    
    if step % training_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

print(f'Time: {time.time() - start_time}')
print_gpu_utilization()

Time: 45.45839309692383
GPU memory occupied: 6432 MB.


Here we compare the results with the one implemented on pure transformers:

* Time: 42.19 (Increase to 45.46 seconds)
* GPU memory occupied: 5142 MB (Increase to 6432 MB)

Implementing these optimization techniques with Accelerate only takes a handful of lines of code and comes with the benefit of **more flexiblity in the training loop**.