-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to run the largest possible model on a single A100 80Gb #101
Comments
I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs. |
See: memory requirements for each model size. It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues. If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8. (I'm not affiliated with FAIR.) |
Seems that torch.distributed does not support GPU virtualization. |
I was able to run the 13B and 30B (batch size 1) models on a single A100-80GB. I used a script [1] to reshard the models and torchrun with --nproc_per_node 1. [1] https://gist.github.com/benob/4850a0210b01672175942203aa36d300 |
@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance! |
For one can't even run the 33B model in 16bit mod. You would need another 16GB+ of vram. But using the 8bit fork you can load it without needing to reshard it. |
@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :) |
oobabooga/text-generation-webui#147 (comment) I'm also a fellow A6000 user. |
@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization.. |
I'm using ooba python server.py --listen --model LLaMA-30B --load-in-8bit --cai-chat If you just want to use LLaMA-8bit then only run with node 1. |
The linked memory requirement calculation table is adding the wrong rows together, I think. The corrected table should look like: Memory requirements in 8-bit precision:
This seems to more closely match up with what I'm seeing people report their actual VRAM usage is in oobabooga/text-generation-webui#147 For example: If this is true then 65B should fit on a single A100 80GB after all. |
Yours looks almost correct. I have updated the table based on your findings (a copy below).
To prevent all sort of confusion, let's keep the precision in fp16 (before 8-bit quantization). I need to point out that when people report their actual VRAM, they never state the model arguments. The most important ones are Based on the Transformer kv cache formula Memory requirements in fp16 precision.
Sounds right to me. I have also check the above model mem. requirements against GPU requirements (from my repo). The two closely match up now.
133GB in fp16, 66.5GB in int8. Your estimate is 70.2GB. I guess the small diff should be rounding error. 66.5GB VRAM is within the range of 1x A100 80GB. So, should fit
|
4-bit for LLaMA is underway oobabooga/text-generation-webui#177 65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model. Int4 LLaMA VRAM usage is aprox. as follows:
(30B should fit on 20GB+ cards and 7B will fit on cards under 8 or even 6GB if they support int8.) |
I'm well aware. It's a matter of time.
Exciting but the big question is, does this technique reduce the model performance (both inference throughput and accuracy)? To my understanding, I couldn't find legible results that prove the model performance closely match LLM.int8(). Furthermore, the custom CUDA kernel implementation just make the deployment harder by a large margin. I can wait a little longer for bitsandbytes 4-bit . There are a lot more to GPTQ and bitsandbytes like SmoothQuant, etc. I actually maintained a list of possible model compression and acceleration papers and methods here if you're interested. Disclaimer: I'm not an expert in this area.
Thanks for the heads-up. I will update and add this to my repo accordingly. |
That particular 4-bit implementation is mostly a proof of concept at this point. Bitsandbytes may be getting some 4-bit functionality towards the end of this month. Best to wait for that. |
Old issue. With Llama 2 you should be able to run/inference the Llama 70B model on a single A100 GPU with enough memory. |
I was able to get the 7B model to work. It looks like I might be able to run the 33B version?
Will I need to merge the checkpoint files (.pth) to run on a single GPU? and set MP = 1?
It would be great if FAIR could provide some guidance on vram requirements
The text was updated successfully, but these errors were encountered: