In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Performance and Scalability
Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training, your model can **require more GPU memory than is available** or be **very slow to train**, and when you deploy it for inference, it can be **overwhelmed with the throughput** that is required in the production environment.

## Efficient Training on a Single GPU
In this section we have a look at a few tricks to **reduce the memory footprint** and **speed up training** for large models and how they are integrated in the Trainer and Accelerate.

|**Method**|**Speed**|**Memory**|
|:-:|:-:|:-:|
|Gradient accumulation|No|Yes|
|Gradient checkpointing|No|Yes|
|Mixed precision training|Yes|No|
|Batch size|Yes|Yes|
|Optimizer choice|Yes|Yes|
|DataLoader|Yes|No|
|DeepSpeed Zero|No|Yes|

First we setup two helper functions to print summary statistics for the GPU utilization:

In [2]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    # For GPU 0
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used = info.used
    # For GPU 1
    handle = nvmlDeviceGetHandleByIndex(1)
    info = nvmlDeviceGetMemoryInfo(handle)
    memory_used += info.used
    print(f"GPU memory occupied: {memory_used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

GPU memory occupied: 620 MB.


That looks good: the GPU memory is not occupied as we would expect before we load any models. If that’s not the case on your machine make sure to stop all processes that are using GPU memory.

However, not all free GPU memory can be used by the user. When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory. To see how much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.

In [3]:
import torch

torch.ones((1, 1)).to("cuda:0")
print_gpu_utilization()

GPU memory occupied: 1276 MB.


We see that the kernels alone take up ~650 MB of GPU memory. Now let’s see how much space the model uses.

First, we load the <code>bert-large-uncased</code> model. We load the model weights directly to the GPU so that we can check how much space just weights use.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to('cuda:0')
print_gpu_utilization()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mo

GPU memory occupied: 2564 MB.


We can see that the model weights alone take up $(2564 - 1276) / 1024 \approx 1.3$ GB of the GPU memory.

Now we can start training the model and see how the GPU memory consumption changes. First, we set up a few standard training arguments that we will use across all our experiments:

In [5]:
default_args = {
    "output_dir": "outputs",
    "evaluation_strategy": "no",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a Dataset with PyTorch format.

In [6]:
import numpy as np
from datasets import Dataset

seq_len, dataset_size = 512, 512
dummy_data = {
    "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
    "labels": np.random.randint(0, 1, (dataset_size)),
}
ds = Dataset.from_dict(dummy_data)
ds.set_format("pt")

As a first experiment we will use the Trainer and train the model without any further modifications and a batch size of 4:

In [7]:
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

training_args = TrainingArguments(
    per_device_train_batch_size=4, 
    **default_args
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=ds
)

result = trainer.train()

print_summary(result)



{'train_runtime': 40.9582, 'train_samples_per_second': 12.501, 'train_steps_per_second': 3.125, 'train_loss': 0.03662079945206642, 'epoch': 1.0}
Time: 40.96
Samples/second: 12.50
GPU memory occupied: 12852 MB.


We see that already a relatively small batch size 4 almost fills up our GPU’s memory. However, a larger batch size can often result in faster model convergence or better performance. So ideally we want to tune the batch size to our model’s needs and not to the GPU limitations.

## Anatomy of Model's Operations
What’s interesting is that we use much more memory than the size of the model. To understand a bit better why this is the case let’s have look at a model’s operations and memory needs.

Transformers architecture includes 3 main groups of operations grouped below by compute-intensity.

### Tensor Contractions
Linear layers and components of Multi-Head Attention all do **batched matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer.

### Statistical Normalizations
Softmax and layer normalization are less compute-intensive, and involve one or more **reduction operations**, the result of which is then applied via a map.

### Element-wise Operators
These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations.

This knowledge can be helpful to know when analyzing performance bottlenecks.

## Anatomy of Model's Memory
We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there are many components during training that use GPU memory.

The components on GPU memory are the following: 
1. **model weights**
   * 4 bytes * number of parameters for fp32 training
   * 6 bytes * number of parameters for **mixed precision training** (maintains a model in fp32 and one in fp16 in memory)
2. **optimizer states**
   * 8 bytes * number of parameters for normal AdamW (maintains 2 states)
   * 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
   * 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
3. **gradients**
   * 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)
4. **forward activations** (saved for gradient computation)
   * size depends on many factors, the key ones being sequence length, hidden size and batch size.
5. **temporary buffers**
   * Temporary variables will be released once the calculation is done, but in the moment they could require additional memory and push to out of cuda memory. Therefore explicitly free them as soon as they are no longer needed is crucial.
6. **functionality-specific memory**
   * The developer software could have special memory needs. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs.

So there are potentially a few places where we could save GPU memory or speed up operations. 