-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting OOM error in 4xTeslaT4. Azure VM NC64as_T4_v3 #6
Comments
Same here. I used a quad-RTX 4090 setup (~96GB VRAM) for testing, but it still ran into OOM. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I am running on 4x Tesla T4. So, vRAM size is around 4*16 = 64 GB. Azure VM being used is NC64as_T4_v3.
the command I am running to execute is:
torchrun --nnodes=1 --nproc-per-node=4 train.py
I an getting the below error across all the 4GPUs. A sample error for GPU3 is as below:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.49GiB. GPU3 has a total capacity of 14.58 GiB of which 233.75MiB is free.
I was of the impression that the model would be distributed across the 4 GPUs with a cumulative RAM sixe of 64 GB and I would not need to use qLORA for FT.
Can you please tell me if I am missing something?
The text was updated successfully, but these errors were encountered: