Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to run the largest possible model on a single A100 80Gb #101

Closed
huey2531 opened this issue Mar 4, 2023 · 16 comments
Closed

how to run the largest possible model on a single A100 80Gb #101

huey2531 opened this issue Mar 4, 2023 · 16 comments
Labels
model-usage issues related to how models are used/loaded

Comments

@huey2531
Copy link

huey2531 commented Mar 4, 2023

I was able to get the 7B model to work. It looks like I might be able to run the 33B version?
Will I need to merge the checkpoint files (.pth) to run on a single GPU? and set MP = 1?

It would be great if FAIR could provide some guidance on vram requirements

@elephantpanda
Copy link

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

@cedrickchee
Copy link

cedrickchee commented Mar 4, 2023

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

@benob
Copy link

benob commented Mar 4, 2023

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

Seems that torch.distributed does not support GPU virtualization.

@benob
Copy link

benob commented Mar 4, 2023

I was able to run the 13B and 30B (batch size 1) models on a single A100-80GB. I used a script [1] to reshard the models and torchrun with --nproc_per_node 1.

[1] https://gist.github.com/benob/4850a0210b01672175942203aa36d300

@baleksey
Copy link

baleksey commented Mar 4, 2023

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

@USBhost
Copy link

USBhost commented Mar 4, 2023

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

For one can't even run the 33B model in 16bit mod. You would need another 16GB+ of vram.

But using the 8bit fork you can load it without needing to reshard it.

@baleksey
Copy link

baleksey commented Mar 4, 2023

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

@USBhost
Copy link

USBhost commented Mar 4, 2023

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

oobabooga/text-generation-webui#147 (comment)

I'm also a fellow A6000 user.

@baleksey
Copy link

baleksey commented Mar 5, 2023

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

@USBhost
Copy link

USBhost commented Mar 5, 2023

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

I'm using ooba python server.py --listen --model LLaMA-30B --load-in-8bit --cai-chat

If you just want to use LLaMA-8bit then only run with node 1.

@MarkSchmidty
Copy link
Contributor

MarkSchmidty commented Mar 5, 2023

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

The linked memory requirement calculation table is adding the wrong rows together, I think. The corrected table should look like:

Memory requirements in 8-bit precision:

Model (on disk)*** 13 24 60 120
Memory Requirements (GB) 6.7 13 32.5 65.2
Cache 1 2 3 5
Total 7.7 16 35.5 70.2

This seems to more closely match up with what I'm seeing people report their actual VRAM usage is in oobabooga/text-generation-webui#147

For example:
"LLaMA-7B: 9225MiB"
"LLaMA-13B: 16249MiB"
"The 30B uses around 35GB of vram at 8bit."

If this is true then 65B should fit on a single A100 80GB after all.

@cedrickchee
Copy link

cedrickchee commented Mar 6, 2023

@MarkSchmidty

The linked memory requirement calculation table is adding the wrong rows together
Not exactly.

The corrected table should look like: ...

Yours looks almost correct. I have updated the table based on your findings (a copy below).

Memory requirements in 8-bit precision:

To prevent all sort of confusion, let's keep the precision in fp16 (before 8-bit quantization).

I need to point out that when people report their actual VRAM, they never state the model arguments. The most important ones are max_batch_size and max_seq_length. These impact the VRAM required (too large, you run into OOM. FAIR should really set the max_batch_size to 1 by default. It's 32 now.)

Based on the Transformer kv cache formula
and max_batch_size of 1 and max_seq_length of 1024, the table looks like this now:

Memory requirements in fp16 precision.

Model Params (Billions) 6.7 13 32.5 65.2
Model on disk (GB)*** 13 26 65 130
Cache (GB) 1 1 2 3
Total (GB) 14 27 67 133

This seems to more closely match up with what I'm seeing people report their actual VRAM usage

Sounds right to me. I have also check the above model mem. requirements against GPU requirements (from my repo). The two closely match up now.

If this is true then 65B should fit on a single A100 80GB after all.

133GB in fp16, 66.5GB in int8. Your estimate is 70.2GB. I guess the small diff should be rounding error. 66.5GB VRAM is within the range of 1x A100 80GB. So, should fit but can't confirm though. It's true now. More people got it working. One example is this.

I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. ... The output is at least as good as davinci.

@MarkSchmidty
Copy link
Contributor

MarkSchmidty commented Mar 7, 2023

4-bit for LLaMA is underway oobabooga/text-generation-webui#177

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Int4 LLaMA VRAM usage is aprox. as follows:

Model Params (Billions) 6.7 13 32.5 65.2
Model on disk (GB) 13 26 65 130
Int4 model size (GB) 3.2 6.5 16.2 32.5
Cache (GB) 1 2 2 3
Total VRAM Used(GB) 4.2 8.5 18.2 35.5

(30B should fit on 20GB+ cards and 7B will fit on cards under 8 or even 6GB if they support int8.)

@cedrickchee
Copy link

cedrickchee commented Mar 8, 2023

4-bit for LLaMA

I'm well aware. It's a matter of time.
This 4-bit post-training quantization technique is different the one that Tim Dettmers working on.

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Exciting but the big question is, does this technique reduce the model performance (both inference throughput and accuracy)? To my understanding, I couldn't find legible results that prove the model performance closely match LLM.int8(). Furthermore, the custom CUDA kernel implementation just make the deployment harder by a large margin. I can wait a little longer for bitsandbytes 4-bit . There are a lot more to GPTQ and bitsandbytes like SmoothQuant, etc. I actually maintained a list of possible model compression and acceleration papers and methods here if you're interested.

Disclaimer: I'm not an expert in this area.

Int4 LLaMA VRAM usage is aprox. as follows:

Thanks for the heads-up. I will update and add this to my repo accordingly.

@MarkSchmidty
Copy link
Contributor

That particular 4-bit implementation is mostly a proof of concept at this point. Bitsandbytes may be getting some 4-bit functionality towards the end of this month. Best to wait for that.

@WuhanMonkey WuhanMonkey added the model-usage issues related to how models are used/loaded label Sep 6, 2023
@amitsangani
Copy link

Old issue. With Llama 2 you should be able to run/inference the Llama 70B model on a single A100 GPU with enough memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model-usage issues related to how models are used/loaded
Projects
None yet
Development

No branches or pull requests

9 participants