In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [2]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    # For GPU 0
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used = info.used
    # For GPU 1
    handle = nvmlDeviceGetHandleByIndex(1)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used += info.used
    print(f"GPU memory occupied: {memory_used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

GPU memory occupied: 620 MB.


In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to('cuda:0')
print_gpu_utilization()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mo

GPU memory occupied: 2564 MB.


In [4]:
default_args = {
    "output_dir": "outputs",
    "evaluation_strategy": "no",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

In [5]:
import numpy as np
from datasets import Dataset

seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

# Mixed Precision Training
Let’s have a look at another method with which we can regain some speed: mixed precision training. The idea of mixed precision training is that **not all variables need to be stored in full (32-bit) floating point precision**. **If we can reduce the precision, the variables and their computations are faster.**

Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:

* fp32 (float32)
* fp16 (float16)
* bf16 (bfloat16)
* tf32 (CUDA internal data type)

While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only available on the Ampere architecture GPUS and TPUs support bf16 as well. Let’s start with the most commonly used method which is **FP16 training**.

**Although the gradients are also computed in half precision, they are converted back to full precision for the optimization step, so no memory is saved here.**

Since the **model is present on the GPU in both 16-bit and 32-bit precision**, this can use **more GPU memory (1.5x the original model is on the GPU)**, especially for small batch sizes.

The main advantage comes from **saving the activations in half (16-bit) precision**. 

Since some computations are performed in full and some in half precision, this approach is called mixed precision training. Enabling mixed precision training is also just a matter of setting the <code>fp16</code> flag to True:

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=4, 
    fp16=True, 
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



Step,Training Loss


Time: 23.46
Samples/second: 21.82
GPU memory occupied: 11224 MB.


Here we compare the results with the vanilla training.

* Time: 40.96 **(Decrease to 23.46 seconds)**
* Samples/second: 12.50 **(Increase to 21.82 samples per second)**
* GPU memory occupied: 12852 MB. **(Decrease to 11224 MB)**

We can see that this is almost twice as fast as the vanilla training.

Next, let’s add the mixed precision training to the previous methods:

In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    fp16=True,
    **default_args,
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)



Step,Training Loss


Time: 39.06
Samples/second: 13.11
GPU memory occupied: 7574 MB.


Here we compare the results with the one with only fp16 enabled:

* Time: 23.46 (Increase to 39.06 seconds)
* Samples/second: 21.82 (Decrease to 13.11 samples per second)
* GPU memory occupied: 11224 MB **(Decrease to 7574 MB)**

Here we compare the results with the one only enabing gradient accumulation and gradient checkpointing:

* Time: 63.28 **(Decrease to 39.06 seconds)**
* Samples/second: 8.09 **(Increase to 13.11 samples per second)**
* GPU memory occupied: 7078 MB (Increase to 7574 MB)

We can see that with these methods we use about **half the GPU memory** as at the beginning while also being **slightly faster**.