how to run the largest possible model on a single A100 80Gb #101

huey2531 · 2023-03-04T05:59:31Z

I was able to get the 7B model to work. It looks like I might be able to run the 33B version?
Will I need to merge the checkpoint files (.pth) to run on a single GPU? and set MP = 1?

It would be great if FAIR could provide some guidance on vram requirements

elephantpanda · 2023-03-04T12:37:36Z

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

cedrickchee · 2023-03-04T13:27:19Z

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

benob · 2023-03-04T17:32:57Z

I had an idea that you might use GPU virtualisation to make it appear like you had two GPUs. Well, it wasn't my idea, it was Chat GPTs.

Seems that torch.distributed does not support GPU virtualization.

benob · 2023-03-04T17:36:25Z

I was able to run the 13B and 30B (batch size 1) models on a single A100-80GB. I used a script [1] to reshard the models and torchrun with --nproc_per_node 1.

[1] https://gist.github.com/benob/4850a0210b01672175942203aa36d300

baleksey · 2023-03-04T19:51:11Z

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

USBhost · 2023-03-04T21:03:04Z

@benob Thanks for the script! I was successfully run 13B with it. But I was failed while sharding 30B model as I run our of memory (128 RAM is obviously not enough for this). Is it any way you can share your combined 30B model so I can try to run it on my A6000-48GB? Thank you so much in advance!

For one can't even run the 33B model in 16bit mod. You would need another 16GB+ of vram.

But using the 8bit fork you can load it without needing to reshard it.

baleksey · 2023-03-04T21:41:02Z

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

USBhost · 2023-03-04T21:43:29Z

@USBhost Got it, thanks for clarification! Can you direct me to 8bit model/code for 33B l so I can continue my experiments? :)

oobabooga/text-generation-webui#147 (comment)

I'm also a fellow A6000 user.

baleksey · 2023-03-05T00:32:15Z

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

USBhost · 2023-03-05T00:34:06Z

@USBhost Thank you! Have you managed to run 33B model with it? I still have OOMs after model quantization..

I'm using ooba python server.py --listen --model LLaMA-30B --load-in-8bit --cai-chat

If you just want to use LLaMA-8bit then only run with node 1.

MarkSchmidty · 2023-03-05T17:29:23Z

It would be great if FAIR could provide some guidance on vram requirements

See: memory requirements for each model size.

It's a small place that's part of my own experiments where I also document down things I learned. While the official documentations are lacking now, you can also learn from the good discussions around this project on various GitHub Issues.

If I'm not wrong, 65B needs a GPU cluster with a total of 250GB in fp16 precision or half in int8.

(I'm not affiliated with FAIR.)

The linked memory requirement calculation table is adding the wrong rows together, I think. The corrected table should look like:

Memory requirements in 8-bit precision:

Model (on disk)***	13	24	60	120
Memory Requirements (GB)	6.7	13	32.5	65.2
Cache	1	2	3	5

Total	7.7	16	35.5	70.2

This seems to more closely match up with what I'm seeing people report their actual VRAM usage is in oobabooga/text-generation-webui#147

For example:
"LLaMA-7B: 9225MiB"
"LLaMA-13B: 16249MiB"
"The 30B uses around 35GB of vram at 8bit."

If this is true then 65B should fit on a single A100 80GB after all.

cedrickchee · 2023-03-06T12:37:11Z

@MarkSchmidty

The linked memory requirement calculation table is adding the wrong rows together
Not exactly.

The corrected table should look like: ...

Yours looks almost correct. I have updated the table based on your findings (a copy below).

Memory requirements in 8-bit precision:

To prevent all sort of confusion, let's keep the precision in fp16 (before 8-bit quantization).

I need to point out that when people report their actual VRAM, they never state the model arguments. The most important ones are max_batch_size and max_seq_length. These impact the VRAM required (too large, you run into OOM. FAIR should really set the max_batch_size to 1 by default. It's 32 now.)

Based on the Transformer kv cache formula
and max_batch_size of 1 and max_seq_length of 1024, the table looks like this now:

Memory requirements in fp16 precision.

Model Params (Billions)	6.7	13	32.5	65.2
Model on disk (GB)***	13	26	65	130
Cache (GB)	1	1	2	3
Total (GB)	14	27	67	133

This seems to more closely match up with what I'm seeing people report their actual VRAM usage

Sounds right to me. I have also check the above model mem. requirements against GPU requirements (from my repo). The two closely match up now.

If this is true then 65B should fit on a single A100 80GB after all.

133GB in fp16, 66.5GB in int8. Your estimate is 70.2GB. I guess the small diff should be rounding error. 66.5GB VRAM is within the range of 1x A100 80GB. So, should fit ~~but can't confirm though.~~ It's true now. More people got it working. One example is this.

I'm running LLaMA-65B on a single A100 80GB with 8bit quantization. ... The output is at least as good as davinci.

MarkSchmidty · 2023-03-07T17:45:42Z

4-bit for LLaMA is underway oobabooga/text-generation-webui#177

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Int4 LLaMA VRAM usage is aprox. as follows:

Model Params (Billions)	6.7	13	32.5	65.2
Model on disk (GB)	13	26	65	130
Int4 model size (GB)	3.2	6.5	16.2	32.5
Cache (GB)	1	2	2	3
Total VRAM Used(GB)	4.2	8.5	18.2	35.5

(30B should fit on 20GB+ cards and 7B will fit on cards under 8 or even 6GB if they support int8.)

cedrickchee · 2023-03-08T06:03:27Z

4-bit for LLaMA

I'm well aware. It's a matter of time.
This 4-bit post-training quantization technique is different the one that Tim Dettmers working on.

65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model.

Exciting but the big question is, does this technique reduce the model performance (both inference throughput and accuracy)? To my understanding, I couldn't find legible results that prove the model performance closely match LLM.int8(). Furthermore, the custom CUDA kernel implementation just make the deployment harder by a large margin. I can wait a little longer for bitsandbytes 4-bit . There are a lot more to GPTQ and bitsandbytes like SmoothQuant, etc. I actually maintained a list of possible model compression and acceleration papers and methods here if you're interested.

Disclaimer: I'm not an expert in this area.

Int4 LLaMA VRAM usage is aprox. as follows:

Thanks for the heads-up. I will update and add this to my repo accordingly.

MarkSchmidty · 2023-03-08T09:01:08Z

That particular 4-bit implementation is mostly a proof of concept at this point. Bitsandbytes may be getting some 4-bit functionality towards the end of this month. Best to wait for that.

amitsangani · 2023-09-06T19:06:05Z

Old issue. With Llama 2 you should be able to run/inference the Llama 70B model on a single A100 GPU with enough memory.

notune mentioned this issue Mar 12, 2023

Cannot run 13B model #56

Closed

GaganHonor mentioned this issue Aug 27, 2023

Run 13B or 34B in a single GPU meta-llama/codellama#27

Closed

WuhanMonkey added the model-usage issues related to how models are used/loaded label Sep 6, 2023

amitsangani closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to run the largest possible model on a single A100 80Gb #101

how to run the largest possible model on a single A100 80Gb #101

huey2531 commented Mar 4, 2023 •

edited

Loading

elephantpanda commented Mar 4, 2023

cedrickchee commented Mar 4, 2023 •

edited

Loading

benob commented Mar 4, 2023

benob commented Mar 4, 2023 •

edited

Loading

baleksey commented Mar 4, 2023 •

edited

Loading

USBhost commented Mar 4, 2023

baleksey commented Mar 4, 2023

USBhost commented Mar 4, 2023

baleksey commented Mar 5, 2023

USBhost commented Mar 5, 2023 •

edited

Loading

MarkSchmidty commented Mar 5, 2023 •

edited

Loading

cedrickchee commented Mar 6, 2023 •

edited

Loading

MarkSchmidty commented Mar 7, 2023 •

edited

Loading

cedrickchee commented Mar 8, 2023 •

edited

Loading

MarkSchmidty commented Mar 8, 2023

amitsangani commented Sep 6, 2023

how to run the largest possible model on a single A100 80Gb #101

how to run the largest possible model on a single A100 80Gb #101

Comments

huey2531 commented Mar 4, 2023 • edited Loading

elephantpanda commented Mar 4, 2023

cedrickchee commented Mar 4, 2023 • edited Loading

benob commented Mar 4, 2023

benob commented Mar 4, 2023 • edited Loading

baleksey commented Mar 4, 2023 • edited Loading

USBhost commented Mar 4, 2023

baleksey commented Mar 4, 2023

USBhost commented Mar 4, 2023

baleksey commented Mar 5, 2023

USBhost commented Mar 5, 2023 • edited Loading

MarkSchmidty commented Mar 5, 2023 • edited Loading

cedrickchee commented Mar 6, 2023 • edited Loading

MarkSchmidty commented Mar 7, 2023 • edited Loading

cedrickchee commented Mar 8, 2023 • edited Loading

MarkSchmidty commented Mar 8, 2023

amitsangani commented Sep 6, 2023

huey2531 commented Mar 4, 2023 •

edited

Loading

cedrickchee commented Mar 4, 2023 •

edited

Loading

benob commented Mar 4, 2023 •

edited

Loading

baleksey commented Mar 4, 2023 •

edited

Loading

USBhost commented Mar 5, 2023 •

edited

Loading

MarkSchmidty commented Mar 5, 2023 •

edited

Loading

cedrickchee commented Mar 6, 2023 •

edited

Loading

MarkSchmidty commented Mar 7, 2023 •

edited

Loading

cedrickchee commented Mar 8, 2023 •

edited

Loading