Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

Closed
yirunwang opened this issue Jan 3, 2024 · 16 comments
Closed

Comments

@yirunwang
Copy link

I tested llama.cpp on two systems, one with 4xA100 GPU and the other with 8xH100 GPU. The test results show that the inference performance of 8xH100+nvlink(21 tokens per socond) is worse than that of 4xA100 pcie(31 token per second), which is very strange! Can anyone help explain this behavior? How can I improve H100? Thanks

image image
@JohannesGaessler
Copy link
Collaborator

I didn't test or optimize the CUDA code for H100s or A100s. But I would very much suspect that on such fast GPUs for a 7b q4_K_M model the synchronization overhead is higher than any potential speed gain. Just run models on a single GPU if you can.

@yirunwang
Copy link
Author

yirunwang commented Jan 3, 2024

@JohannesGaessler yes, it is a synchronization overhead issue, just tested with a single A100 and it performed much better than 4 GPUs (72>>31). thanks a lot.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 4, 2024

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU.
Maybe we'll see a layersplit option some day, that should solve it

@yirunwang
Copy link
Author

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue?

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 6, 2024

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue?

I'd say that's slightly unrelated because llama.cpp uses custom kernels for custom quantizations - I don't know much about Nvidias solution but my guess is that it is operating at fp16 and might support fp8 on latest generation.

No the solution is Layer-wise splitting of tensors (#4055)
Right now we split each tensor at a certain ratio between cards, so if you have 8 cards each tensor is split 8 times.
This is a very complex solution, in theory you might be able to compute the tensors in parallel (gaining speed) but in llama.cpp that is not the case (as in how the internal loops split tensors up). So the complex solution comes with a extreme amount of gpu synchronizations.

I first stumbled upon this mechanism when I attempted to add broadcasted multiplication (for falcon) into the GPU kernel and I realized I am looking at ten thousand GPU synchronizations among my 2 GPUs for just one token. These synchronizations alone made it slower than CPU-bound computation of the same tensor.
The nvlink might speedup the memory transfers but the total accumulated latencies will just eat that up.

The solution is to give up on the highly complex tensor splitting and instead split the computation by layers, this means the card does not have to synchronize hundreds to thousands of times - it just needs to receive one tensor at the beginning and deliver the result at the end.
This can be further optimized in some cases as in having some memory transfers run in the background while one tensor is calculated.

The EXL2 framework uses layer-splitting for that reason. I recently asked the author and he assumes that running a 7B model on 8 H100 cards is as fast as on 1 H100 card (no benefit, no slowdown).

So, in my opinion, the solution is to implement the simpler layer-split into llama.cpp. However that currently has no support and I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

@slaren
Copy link
Collaborator

slaren commented Jan 6, 2024

Layer splitting will be added in #4766

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 6, 2024

Layer splitting will be added in #4766

wow great job, I've lobbied for that quite a while.
That should improve inference massively on any semi-modern GPU mix

@JohannesGaessler
Copy link
Collaborator

I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

My general stance on things like this that I don't consider a priority is that I won't implement it myself but that I will still review other people's PRs if they want to implement it.

@yirunwang
Copy link
Author

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF"
MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 10, 2024

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090.
I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl)
I wonder if maybe the cards are downclocked or have a low power limit set ?

@yirunwang
Copy link
Author

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.
MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024
Driver Version : 535.129.03
CUDA Version : 12.2

Attached GPUs : 4
GPU 00000000:4F:00.0
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:52:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:D5:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:D6:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024
Driver Version : 535.129.03
CUDA Version : 12.2

Attached GPUs : 4
GPU 00000000:4F:00.0
GPU Power Readings
Power Draw : 62.23 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 62.52 W
Min : 61.93 W
Avg : 62.11 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:52:00.0
GPU Power Readings
Power Draw : 47.65 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 47.66 W
Min : 47.46 W
Avg : 47.58 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:D5:00.0
GPU Power Readings
Power Draw : 51.11 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 51.22 W
Min : 51.00 W
Avg : 51.12 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:D6:00.0
GPU Power Readings
Power Draw : 46.11 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 46.31 W
Min : 46.02 W
Avg : 46.18 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 10, 2024

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.
MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.
Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.
So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:52:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D5:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D6:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A ~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 GPU Power Readings Power Draw : 62.23 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 62.52 W Min : 61.93 W Avg : 62.11 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:52:00.0 GPU Power Readings Power Draw : 47.65 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 47.66 W Min : 47.46 W Avg : 47.58 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D5:00.0 GPU Power Readings Power Draw : 51.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 51.22 W Min : 51.00 W Avg : 51.12 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D6:00.0 GPU Power Readings Power Draw : 46.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 46.31 W Min : 46.02 W Avg : 46.18 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

I do not have a A100 or H100 system as reference, I'm using the slightly cheaper 4090/3090 :)
You'd need to look at the clock while the card is at full use, so you see what frequency it uses at max load.

The power target appears to be too low, a A100 should be 400W according to Google and the H100 should be 350W.
300W TDP would explain your lower A100 performance compared to a 3090 at 350W.

I found contradicting information, as some servers are at 350W and some at 400W.
Here it's listed as 300W https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf

@yirunwang
Copy link
Author

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

@cmp-nct When will the new backend be released? Do you have a schedule? Thanks

@slaren
Copy link
Collaborator

slaren commented Jan 24, 2024

The change that allows splitting models across multiple GPUs at the layer level already been merged, and this is now the default behavior when using multiple GPUs with llama.cpp. There is another change in the works (#4918) that will enable pipeline parallelism to improve multi GPU performance when processing large batches or prompts.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 24, 2024

Just as Slaren said, that's the answer.
I raised the point that we need layer-splits at least 4 times. Always was cast down.

Slaren made a beautiful implementation of it, it already works great. With the pipeline feature llama.cpp will be useful even for use in real power servers.

@jughurta
Copy link

I confirm the problem. the results with H100 are worse than the results on A100. has anyone found the cause of this problem ?

I had 4 x A100 PCIe I switched to 4 x H100 hoping to have better results with llamacpp but it's quite the opposite

has anyone found a solution to this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants