In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [2]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    # For GPU 0
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used = info.used
    # For GPU 1
    handle = nvmlDeviceGetHandleByIndex(1)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used += info.used
    print(f"GPU memory occupied: {memory_used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

GPU memory occupied: 620 MB.


In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to('cuda:0')
print_gpu_utilization()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mo

GPU memory occupied: 2564 MB.


In [4]:
default_args = {
    "output_dir": "outputs",
    "evaluation_strategy": "no",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [5]:
import numpy as np
from datasets import Dataset

seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

# Gradient Accumulation
The idea behind gradient accumulation is to **calculate the gradients for smaller batches and accumulate them**. When enough gradients are accumulated we run the model’s optimization step.

This way we can easily increase the overall batch size to numbers that would never fit into the GPU’s memory. In turn, however, it can slow down the training.

We can use gradient accumulation in the Trainer by simply adding the <code>gradient_accumulation_steps</code> argument to TrainingArguments. Let’s see how it impacts the models memory footprint:

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=1, # Note here
    gradient_accumulation_steps=4, # Note here
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 48.18
Samples/second: 10.63
GPU memory occupied: 8590 MB.


Here we give the baseline metrics:

* Time: 40.96 (Increase to 48.18 seconds)
* Samples/second: 12.50 (Decrease to 10.63 samples per second)
* GPU memory occupied: 12852 MB **(Decrease to 8590 MB)**

We can see that the **memory footprint was dramatically reduced** at the cost of being only **slightly slower** than the vanilla run.

In general you would want to max out the GPU usage as much as possible. So in the following case, we wanted to train with a batch size of 64 to **make better use of the available GPU resources**.

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=8, # Note here
    gradient_accumulation_steps=8, # Note here
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 34.26
Samples/second: 14.94
GPU memory occupied: 20314 MB.


Here we give the baseline metrics:

* Time: 40.96 **(Decrease to 34.26 seconds)**
* Samples/second: 12.50 **(Increase to 14.94 samples per second)**
* GPU memory occupied: 12852 MB **(Increase to 20314 MB)**

**Think: What would happen if we use distributed training to do gradient accumulation in parallel?**