Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

saptarshidatta96 · 2023-11-27T14:57:08Z

Hi,

I am running on 4x Tesla T4. So, vRAM size is around 4*16 = 64 GB. Azure VM being used is NC64as_T4_v3.

the command I am running to execute is:
torchrun --nnodes=1 --nproc-per-node=4 train.py

I an getting the below error across all the 4GPUs. A sample error for GPU3 is as below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.49GiB. GPU3 has a total capacity of 14.58 GiB of which 233.75MiB is free.

I was of the impression that the model would be distributed across the 4 GPUs with a cumulative RAM sixe of 64 GB and I would not need to use qLORA for FT.

Can you please tell me if I am missing something?

shepardyan · 2024-04-03T02:55:21Z

Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM.

shun1267 · 2024-04-06T23:58:01Z

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs]

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output]

shepardyan · 2024-04-07T06:19:43Z

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs]

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output]

Thank you for your test! I was able to run the code on my quad-4090s setup now (with both batch size = 1). Though on quad 4090s, the performance may not be satisfactory due to limited card to card bandwidth without NVLinks.

saptarshidatta96 changed the title ~~Getting OOM error in 4xTeslaT4~~ Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

saptarshidatta96 commented Nov 27, 2023 •

edited

Loading

shepardyan commented Apr 3, 2024

shun1267 commented Apr 6, 2024

shepardyan commented Apr 7, 2024 •

edited

Loading

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

Comments

saptarshidatta96 commented Nov 27, 2023 • edited Loading

shepardyan commented Apr 3, 2024

shun1267 commented Apr 6, 2024

shepardyan commented Apr 7, 2024 • edited Loading

saptarshidatta96 commented Nov 27, 2023 •

edited

Loading

shepardyan commented Apr 7, 2024 •

edited

Loading