The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

yirunwang · 2024-01-03T03:52:08Z

I tested llama.cpp on two systems, one with 4xA100 GPU and the other with 8xH100 GPU. The test results show that the inference performance of 8xH100+nvlink(21 tokens per socond) is worse than that of 4xA100 pcie(31 token per second), which is very strange! Can anyone help explain this behavior? How can I improve H100? Thanks

JohannesGaessler · 2024-01-03T09:42:24Z

I didn't test or optimize the CUDA code for H100s or A100s. But I would very much suspect that on such fast GPUs for a 7b q4_K_M model the synchronization overhead is higher than any potential speed gain. Just run models on a single GPU if you can.

yirunwang · 2024-01-03T21:51:07Z

@JohannesGaessler yes, it is a synchronization overhead issue, just tested with a single A100 and it performed much better than 4 GPUs (72>>31). thanks a lot.

cmp-nct · 2024-01-04T02:30:00Z

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU.
Maybe we'll see a layersplit option some day, that should solve it

yirunwang · 2024-01-06T00:21:33Z

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue？

cmp-nct · 2024-01-06T02:22:58Z

It's because of the tensor-split, it's complex and requires up to thousands of synchronizations for each token for each GPU. Maybe we'll see a layersplit option some day, that should solve it

Can TensorRT-LLM solve this issue？

I'd say that's slightly unrelated because llama.cpp uses custom kernels for custom quantizations - I don't know much about Nvidias solution but my guess is that it is operating at fp16 and might support fp8 on latest generation.

No the solution is Layer-wise splitting of tensors (#4055)
Right now we split each tensor at a certain ratio between cards, so if you have 8 cards each tensor is split 8 times.
This is a very complex solution, in theory you might be able to compute the tensors in parallel (gaining speed) but in llama.cpp that is not the case (as in how the internal loops split tensors up). So the complex solution comes with a extreme amount of gpu synchronizations.

I first stumbled upon this mechanism when I attempted to add broadcasted multiplication (for falcon) into the GPU kernel and I realized I am looking at ten thousand GPU synchronizations among my 2 GPUs for just one token. These synchronizations alone made it slower than CPU-bound computation of the same tensor.
The nvlink might speedup the memory transfers but the total accumulated latencies will just eat that up.

The solution is to give up on the highly complex tensor splitting and instead split the computation by layers, this means the card does not have to synchronize hundreds to thousands of times - it just needs to receive one tensor at the beginning and deliver the result at the end.
This can be further optimized in some cases as in having some memory transfers run in the background while one tensor is calculated.

The EXL2 framework uses layer-splitting for that reason. I recently asked the author and he assumes that running a 7B model on 8 H100 cards is as fast as on 1 H100 card (no benefit, no slowdown).

So, in my opinion, the solution is to implement the simpler layer-split into llama.cpp. However that currently has no support and I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

slaren · 2024-01-06T02:33:28Z

Layer splitting will be added in #4766

cmp-nct · 2024-01-06T03:46:27Z

Layer splitting will be added in #4766

wow great job, I've lobbied for that quite a while.
That should improve inference massively on any semi-modern GPU mix

JohannesGaessler · 2024-01-06T07:53:40Z

I lack the time for a full implementation that might not even get accepted as it has to dig deep into offloading and OP functions.

My general stance on things like this that I don't consider a priority is that I won't implement it myself but that I will still review other people's PRs if they want to implement it.

yirunwang · 2024-01-10T01:56:14Z

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF"
MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

cmp-nct · 2024-01-10T02:27:32Z

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.

MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090.
I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl)
I wonder if maybe the cards are downclocked or have a low power limit set ?

yirunwang · 2024-01-10T17:18:06Z

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.
MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.

So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024
Driver Version : 535.129.03
CUDA Version : 12.2

Attached GPUs : 4
GPU 00000000:4F:00.0
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:52:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:D5:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

GPU 00000000:D6:00.0
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 1512 MHz
Video : 795 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024
Driver Version : 535.129.03
CUDA Version : 12.2

Attached GPUs : 4
GPU 00000000:4F:00.0
GPU Power Readings
Power Draw : 62.23 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 62.52 W
Min : 61.93 W
Avg : 62.11 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:52:00.0
GPU Power Readings
Power Draw : 47.65 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 47.66 W
Min : 47.46 W
Avg : 47.58 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:D5:00.0
GPU Power Readings
Power Draw : 51.11 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 51.22 W
Min : 51.00 W
Avg : 51.12 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

GPU 00000000:D6:00.0
GPU Power Readings
Power Draw : 46.11 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 2.38 sec
Number of Samples : 119
Max : 46.31 W
Min : 46.02 W
Avg : 46.18 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A

cmp-nct · 2024-01-10T17:53:30Z

I used the same hard drive and perform a single GPU test using CUDA_VISIBLE_DEVICES=0, but the A100 still performs slightly better than the H100 (71 tokens/second> 66 tokens/second). Can someone explain this? Thanks.
MODEL_ID = "TheBloke/Llama-2-7b-Chat-GGUF" MODEL_BASENAME = "llama-2-7b-chat.Q4_K_M.gguf"

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.
Regarding your A100 and H100 results, those CPUs are typically similar to the 3090 and the 4090. I'd assume you should get 80+ and 120+ generation speed on those with llama-2 7B.
So both cards are too slow, assumed you use full GPU offload (-ngl) I wonder if maybe the cards are downclocked or have a low power limit set ?

@cmp-nct here is Clock and Power of the A100 system:

~$ nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp : Wed Jan 10 08:58:04 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:52:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D5:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A

GPU 00000000:D6:00.0 Clocks Graphics : 210 MHz SM : 210 MHz Memory : 1512 MHz Video : 795 MHz Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Default Applications Clocks Graphics : 1410 MHz Memory : 1512 MHz Deferred Clocks Memory : N/A Max Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1290 MHz Max Customer Boost Clocks Graphics : 1410 MHz SM Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Memory Clock Samples Duration : Not Found Number of Samples : Not Found Max : Not Found Min : Not Found Avg : Not Found Clock Policy Auto Boost : N/A Auto Boost Default : N/A ~$ nvidia-smi -q -d POWER

==============NVSMI LOG==============

Timestamp : Wed Jan 10 09:03:55 2024 Driver Version : 535.129.03 CUDA Version : 12.2

Attached GPUs : 4 GPU 00000000:4F:00.0 GPU Power Readings Power Draw : 62.23 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 62.52 W Min : 61.93 W Avg : 62.11 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:52:00.0 GPU Power Readings Power Draw : 47.65 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 47.66 W Min : 47.46 W Avg : 47.58 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D5:00.0 GPU Power Readings Power Draw : 51.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 51.22 W Min : 51.00 W Avg : 51.12 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

GPU 00000000:D6:00.0 GPU Power Readings Power Draw : 46.11 W Current Power Limit : 300.00 W Requested Power Limit : 300.00 W Default Power Limit : 300.00 W Min Power Limit : 150.00 W Max Power Limit : 300.00 W Power Samples Duration : 2.38 sec Number of Samples : 119 Max : 46.31 W Min : 46.02 W Avg : 46.18 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A

I do not have a A100 or H100 system as reference, I'm using the slightly cheaper 4090/3090 :)
You'd need to look at the clock while the card is at full use, so you see what frequency it uses at max load.

The power target appears to be too low, a A100 should be 400W according to Google and the H100 should be 350W.
300W TDP would explain your lower A100 performance compared to a 3090 at 350W.

I found contradicting information, as some servers are at 350W and some at 400W.
Here it's listed as 300W https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf

yirunwang · 2024-01-24T19:02:25Z

The new backend will resolve the parallel problems, once we have pipelining it should also significantly speed up large context processing.

@cmp-nct When will the new backend be released? Do you have a schedule? Thanks

slaren · 2024-01-24T19:34:00Z

The change that allows splitting models across multiple GPUs at the layer level already been merged, and this is now the default behavior when using multiple GPUs with llama.cpp. There is another change in the works (#4918) that will enable pipeline parallelism to improve multi GPU performance when processing large batches or prompts.

cmp-nct · 2024-01-24T19:48:30Z

Just as Slaren said, that's the answer.
I raised the point that we need layer-splits at least 4 times. Always was cast down.

Slaren made a beautiful implementation of it, it already works great. With the pipeline feature llama.cpp will be useful even for use in real power servers.

jughurta · 2024-05-22T13:07:43Z

I confirm the problem. the results with H100 are worse than the results on A100. has anyone found the cause of this problem ?

I had 4 x A100 PCIe I switched to 4 x H100 hoping to have better results with llamacpp but it's quite the opposite

has anyone found a solution to this problem?

yirunwang added the bug-unconfirmed label Jan 3, 2024

yirunwang closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

yirunwang commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

yirunwang commented Jan 3, 2024 •

edited

cmp-nct commented Jan 4, 2024 •

edited

yirunwang commented Jan 6, 2024

cmp-nct commented Jan 6, 2024

slaren commented Jan 6, 2024

cmp-nct commented Jan 6, 2024

JohannesGaessler commented Jan 6, 2024

yirunwang commented Jan 10, 2024

cmp-nct commented Jan 10, 2024

yirunwang commented Jan 10, 2024

cmp-nct commented Jan 10, 2024 •

edited

yirunwang commented Jan 24, 2024

slaren commented Jan 24, 2024

cmp-nct commented Jan 24, 2024

jughurta commented May 22, 2024

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

Comments

yirunwang commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

yirunwang commented Jan 3, 2024 • edited

cmp-nct commented Jan 4, 2024 • edited

yirunwang commented Jan 6, 2024

cmp-nct commented Jan 6, 2024

slaren commented Jan 6, 2024

cmp-nct commented Jan 6, 2024

JohannesGaessler commented Jan 6, 2024

yirunwang commented Jan 10, 2024

cmp-nct commented Jan 10, 2024

yirunwang commented Jan 10, 2024

cmp-nct commented Jan 10, 2024 • edited

yirunwang commented Jan 24, 2024

slaren commented Jan 24, 2024

cmp-nct commented Jan 24, 2024

jughurta commented May 22, 2024

yirunwang commented Jan 3, 2024 •

edited

cmp-nct commented Jan 4, 2024 •

edited

cmp-nct commented Jan 10, 2024 •

edited