In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [2]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    # For GPU 0
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used = info.used
    # For GPU 1
    handle = nvmlDeviceGetHandleByIndex(1)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used += info.used
    print(f"GPU memory occupied: {memory_used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

GPU memory occupied: 620 MB.


In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to('cuda:0')
print_gpu_utilization()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mo

GPU memory occupied: 2564 MB.


In [4]:
default_args = {
    "output_dir": "outputs",
    "evaluation_strategy": "no",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [5]:
import numpy as np
from datasets import Dataset

seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

# Gradient Checkpointing
Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with large models.

In order to compute the gradients during the backward pass, **all activations from the forward pass are normally saved**. This can create a big memory overhead.

Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training.

**Gradient checkpointing** strikes a compromise between the two approaches and **saves selected activations** throughout the computational graph, so **only a fraction of the activations need to be re-computed for the gradients**. See this [article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) explaining the ideas behind gradient checkpointing.

To enable gradient checkpointing in the Trainer we only need to pass it as a flag to the TrainingArguments. Everything else is handled under the hood:

In [7]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=4, 
    gradient_checkpointing=True, 
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 63.28
Samples/second: 8.09
GPU memory occupied: 7078 MB.


Here we compare the results with the one enabing gradient accumulation but disabling gradient checkpointing:

* Time: 48.18 (Increase to 63.28 seconds)
* Samples/second: 10.63 (Decrease to 8.09 samples per second)
* GPU memory occupied: 8590 MB **(Decrease to 7078 MB)**

We can see that this **saved some more memory** but at the same time **training became a bit slower**. A general rule of thumb is that **gradient checkpointing slows down training by about 20%**.

Then, let's train with a larger batch size of 64 to make better use of the available GPU resources.

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=8, 
    gradient_accumulation_steps=8, 
    gradient_checkpointing=True, 
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 46.53
Samples/second: 11.00
GPU memory occupied: 8030 MB.


Here we compare the results with the baseline metrics (with a batch size of 64 and gradient accumulation enabled):

* Time: 34.26 (Increase to 46.53 seconds)
* Samples/second: 14.94 (Decrease to 11 samples per second)
* GPU memory occupied: 20314 MB **(Decrease to 8030 MB)**

Finally, let's try **how large a batch size can be applied to maximum the memory usage** of an NVIDIA Geforce RTX 3090 (with a total of 24 GB of CUDA memory). The answer is: **120**!!

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=120, 
    gradient_accumulation_steps=1, 
    gradient_checkpointing=True, 
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 43.78
Samples/second: 11.70
GPU memory occupied: 24600 MB.
