Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the BaM bandwidth is stopped to increase when the number of NVMe is more than 7 #17

Open
LiangZhou9527 opened this issue Aug 7, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@LiangZhou9527
Copy link

Hi there,

I'm doing benchmark testing on my machine which is configured with some H800 GPUs and 8 NVMe storages dedicated for the BaM.

The GPU is configured with PCIe5 x16 and the NVMe storage is configured with PCIe4 x4, which means in theory the max bandwidth of GPU is around 60 GBps and the max bandwidth of single NVMe storage is around 7.5 GBps.

But according to my testing using "nvm-block-bench", the result is not as expected. I summary thge result here: https://raw.githubusercontent.com/LiangZhou9527/some_stuff/8b48038465858846f864e43cef6d0e6df787a2c2/BaM%20bandwidth%20and%20the%20number%20of%20NVMe.png

In the pciture we can see that the bandwidth with 6 NVMe and 7 NVMe is almost the same, but when the number of NVMe reaches 8, the bandwitdh is dropped a lot.

Any thoughts about what happens here?

BTW, I didn't enable IOMMU on my machine, and the benchmark testing cmdline is as below (I executed the command 8 times, each time with different --n_ctrls value, say, 1, 2 ... 8)

./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

@msharmavikram
Copy link
Collaborator

Wow. There are two awesome news here.

We have not tested hopper generation or gen5 CPU yet and we are super excited to see benchmark and bring up working out of the box. Thanks for giving this first awesome news!

We are delighted to see linear scaling upto 5 SSDs. Agreed it is lower and not scaling but this is first gen5 platform results we are aware of. So thanks for second awesome news. Anyway, we faced similar trends when we moved from gen3 to gen4 and we want to help you out debug this issue better. We likely will not get access to gen5 platform immediately and hence can we schedule call to discuss what can be done (I believe you know my email address.)?

We have bunch of theories and only way to determine what may be going wrong is validating each of them. Iommu definitely is one of the culprit here but we require to understand the pcie topology and capabilities of the gen5 root complex. Previously we had faced issues where CPU was wrongly configured to handle such high throughput and we need to understand if that is not the case. There is a bit of debug for gen5 platform to be done and we want to help here!

Lastly, can you try the following-

./bin/nvm-block-bench --threads=$((1024*1024)) --blk_size=64 --reqs=1 --pages=$((1024*1024)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

I'm curious to see if latency is an issue here.

@msharmavikram msharmavikram added the enhancement New feature or request label Aug 7, 2023
@LiangZhou9527
Copy link
Author

LiangZhou9527 commented Aug 8, 2023

Hi @msharmavikram ,

We likely will not get access to gen5 platform immediately and hence can we schedule call to discuss what can be done (I believe you know my email address.)?

Much appreciated that you lend me a hand about this issue, yes I know your email address and I'm very happy to schedule a call when more info is available and clear.

Iommu definitely is one of the culprit here but we require to understand the pcie topology and capabilities of the gen5 root complex.

I didn't enable IOMMU in my host, there's nothing output when I run command "cat /proc/cmdline | grep iommu". And I also attached the pcie topo which is collected by running command "lspci -tv" and "lspci -vv", please refer to "lspci -tv" and "lspci -vv" .

Lastly, can you try the following- ./bin/nvm-block-bench --threads=$((10241024)) --blk_size=64 --reqs=1 --pages=$((10241024)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

please refer to the output below:

SQs: 135 CQs: 135 n_qps: 128
n_ranges_bits: 6
n_ranges_mask: 63
pages_dma: 0x7f9540010000 21a020410000
HEREN
Cond1
100000 8 1 100000
Finish Making Page Cache
finished creating cache
0000:18:00.0
atlaunch kernel
Elapsed Time: 686253 Number of Ops: 1048576 Data Size (bytes): 4294967296
Ops/sec: 1.52797e+06 Effective Bandwidth(GB/S): 5.82876

@msharmavikram
Copy link
Collaborator

Will look forward to your email.

Meanwhile can you try one more command and increase number of SSDs from 1 to 8 (below one is for 8 SSDs)

./bin/nvm-block-bench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=135 --random=true -S 1 --n_ctrls=8

@LiangZhou9527
Copy link
Author

LiangZhou9527 commented Aug 8, 2023

Hi @msharmavikram ,

Here's the log from 1 to 8, the result is simiar as what I summaried before.

https://raw.githubusercontent.com/LiangZhou9527/some_stuff/main/1-8.log

Please note, this line "in Controller::Controller, path = /dev/libnvm0" is for debug only, it will not impact the performance result.

@msharmavikram
Copy link
Collaborator

msharmavikram commented Aug 8, 2023

I believe this is Intel SSDs. At least that's how it looks like. What are the max iopa for 4kb and 512B accesses ?

The issue seems to be from the iommu/pcie switch or CPU. We want to determine if the issue is bandwidth or iops. Let's try 1 to 8 SSDs configuration with page_size=512 instead of 4kb.

Let's see what it shows.

(Reach out in email as we might require additional support from vendors here - broadcom, Intel. ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants