## Running on multiple GPUs using Hugging Face Transformers

Naive pipeline parallelism is supported out of the box. For this, simply load the model with device="auto" which will automatically place the different layers on the available GPUs.

Your task:

1. Create a pod with two 24GB GPUs.

2. Try to run the model with device="auto" and see how much VRAM is used. You can also try to run the model with device_map="auto" which will automatically place the different layers on the available GPUs. This is a more advanced version of pipeline parallelism that allows for more flexibility in how the model is distributed across GPUs.


In [1]:
model_path = "/ssdshare/share/Meta-Llama-3-8B-Instruct/"
# TODO(Your Task): Load the model to multiple GPUs and check the GPU memory usage
from accelerate.utils import release_memory
from transformers import AutoModelForCausalLM
import torch


def get_gpu_memory(model):
    memory = torch.cuda.max_memory_allocated() / (1024**3)
    model = release_memory(model)
    torch.cuda.reset_peak_memory_stats()
    return memory


model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="cuda:0"
)
print(f'device_map="cuda:0": {get_gpu_memory(model)}GB')
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="auto"
)
print(f'device="auto": {get_gpu_memory(model)}GB')
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="auto"
)
print(f'device_map="auto": {get_gpu_memory(model)}GB')

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

device_map="cuda:0": 14.95752763748169GB


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

device="auto": 21.623756885528564GB


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

device_map="auto": 13.33245849609375GB


The GPU memory usage of loading the model to only one GPU is \_\_\_\_\_\_\_\_.

The GPU memory usage of loading the model with device="auto" is \_\_\_\_\_\_\_\_. The GPU memory usage of loading the model with device_map="auto" is \_\_\_\_\_\_\_\_.

The number of GPUs you used is \_\_\_\_\_\_\_\_.

Does the numbers above make sense?


In [3]:
print("""
The GPU memory usage of loading the model to only one GPU is 15GB.

The GPU memory usage of loading the model with device="auto" is 22GB. The GPU memory usage of loading the model with device_map="auto" is 13GB.

The number of GPUs you used is 2.

They don't make sense because the memory usage with device="auto" is significantly higher than the other two methods, but the memory usage with device_map="cuda:0" is almost as low as that with device="auto".
""")


The GPU memory usage of loading the model to only one GPU is 15GB.

The GPU memory usage of loading the model with device="auto" is 22GB. The GPU memory usage of loading the model with device_map="auto" is 13GB.

The number of GPUs you used is 2.

They don't make sense because the memory usage with device="auto" is significantly higher than the other two methods, but the memory usage with device_map="cuda:0" is almost as low as that with device="auto".

