Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

Open
saptarshidatta96 opened this issue Nov 27, 2023 · 3 comments
Open

Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6

saptarshidatta96 opened this issue Nov 27, 2023 · 3 comments

Comments

@saptarshidatta96
Copy link

saptarshidatta96 commented Nov 27, 2023

Hi,

I am running on 4x Tesla T4. So, vRAM size is around 4*16 = 64 GB. Azure VM being used is NC64as_T4_v3.

the command I am running to execute is:
torchrun --nnodes=1 --nproc-per-node=4 train.py

I an getting the below error across all the 4GPUs. A sample error for GPU3 is as below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.49GiB. GPU3 has a total capacity of 14.58 GiB of which 233.75MiB is free.

I was of the impression that the model would be distributed across the 4 GPUs with a cumulative RAM sixe of 64 GB and I would not need to use qLORA for FT.

Can you please tell me if I am missing something?

@saptarshidatta96 saptarshidatta96 changed the title Getting OOM error in 4xTeslaT4 Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 Nov 27, 2023
@shepardyan
Copy link

Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM.

@shun1267
Copy link

shun1267 commented Apr 6, 2024

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs]
image

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output]
image

@shepardyan
Copy link

shepardyan commented Apr 7, 2024

I was able to run the code successfully on a machine with 4xRTX3090 (totally 96GB of VRAM), by setting "train_batch_size" and "validation_batch_size" both to 1 in "train.py". (As suggested, you may also lower the learning rate.)

[Execution results for 3 epochs] image

Based on the nvidia-smi output, it seems the execution requires close to the full 96GB of VRAM, as all available memory was nearly used up during the process.

[nvidia-smi output] image

Thank you for your test! I was able to run the code on my quad-4090s setup now (with both batch size = 1). Though on quad 4090s, the performance may not be satisfactory due to limited card to card bandwidth without NVLinks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants