vmm: add NVIDIA GPUDirect P2P support. #6235

thomasbarrett · 2024-02-26T18:03:22Z

On platforms where PCIe P2P is supported, inject a PCI capability into NVIDIA GPU to indicate support.

A previous PR added support for PCIe P2P between VFIO devices. The NVIDIA driver does not utilize PCIe P2P unless it detects hardware support. The PCIe specification doesn't require platforms to support PCIe P2P traffic between root ports (although many systems do). Within a virtual machine, information about the host PCIe bridge is lost, so NVIDIA recommends the use of an injected PCI capability to signal support for PCIe P2P between GPUs.

This PR adds a new argument x_nv_gpudirect_clique to VFIO devices --device path=<PCI_DEVICE_PATH>,x_nv_gpudirect_clique=<CLIQUE_ID> that injects a PCI capability indicating the "P2P clique" that the GPU belongs to. This option is only valid for modern datacenter NVIDIA GPUs (NVIDIA Turing, Ampere, Hopper, and Lovelace). The NVIDIA provided specification can be found here.

After enabling this feature, we benchmarked a decrease in GPU P2P latency from 12 us to 1.4 us compared to the fallback shared memory communication mechanism.

The nvidia-smi utility can be used to confirm that the P2P support. For example, the following cloud-hypervisor guest has 4 NVIDIA L40S GPUS.

nvidia-smi topo -p2p r
 	GPU0	GPU1	GPU2	GPU3	
 GPU0	X	OK	OK	OK	
 GPU1	OK	X	OK	OK	
 GPU2	OK	OK	X	OK	
 GPU3	OK	OK	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

rbradford · 2024-02-27T14:37:18Z

I slightly prefer the naming QEMU has - x-nv-gpudirect-clique - will also make it easier for folks to google, wdyt?

thomasbarrett · 2024-02-27T14:42:58Z

Yeah that makes sense to me @rbradford. Do you like x-nv-gpudirect-clique or nv-gpudirect-clique? We don’t have a “x-” naming convention in cloud-hypervisor?

rbradford · 2024-02-27T15:13:09Z

Yeah that makes sense to me @rbradford. Do you like x-nv-gpudirect-clique or nv-gpudirect-clique? We don’t have a “x-” naming convention in cloud-hypervisor?

Yes, I think the x- prefix is a good idea.

rbradford · 2024-02-28T20:12:51Z

@thomasbarrett the CI configuration has changed so please rebase your work on the latest "main" branch

On platforms where PCIe P2P is supported, inject a PCI capability into NVIDIA GPU to indicate support. Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>

rbradford

Thanks @thomasbarrett - without access to this specific hardware am trusting your testing :-)

thomasbarrett requested a review from a team as a code owner February 26, 2024 18:03

thomasbarrett force-pushed the gpudirectp2p branch from a8e4c3c to da2cd84 Compare February 26, 2024 18:12

thomasbarrett force-pushed the gpudirectp2p branch 2 times, most recently from 9e2989d to fdb7a6b Compare February 28, 2024 19:22

vmm: add NVIDIA GPUDirect P2P support

6fc1b54

On platforms where PCIe P2P is supported, inject a PCI capability into NVIDIA GPU to indicate support. Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>

thomasbarrett force-pushed the gpudirectp2p branch from fdb7a6b to 6fc1b54 Compare February 28, 2024 20:50

rbradford approved these changes Feb 29, 2024

View reviewed changes

rbradford added this pull request to the merge queue Feb 29, 2024

rbradford removed this pull request from the merge queue due to a manual request Feb 29, 2024

rbradford added this pull request to the merge queue Feb 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 29, 2024

rbradford added this pull request to the merge queue Feb 29, 2024

Merged via the queue into cloud-hypervisor:main with commit b750c33 Feb 29, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmm: add NVIDIA GPUDirect P2P support. #6235

vmm: add NVIDIA GPUDirect P2P support. #6235

thomasbarrett commented Feb 26, 2024 •

edited

Loading

rbradford commented Feb 27, 2024

thomasbarrett commented Feb 27, 2024

rbradford commented Feb 27, 2024

rbradford commented Feb 28, 2024

rbradford left a comment

vmm: add NVIDIA GPUDirect P2P support. #6235

vmm: add NVIDIA GPUDirect P2P support. #6235

Conversation

thomasbarrett commented Feb 26, 2024 • edited Loading

rbradford commented Feb 27, 2024

thomasbarrett commented Feb 27, 2024

rbradford commented Feb 27, 2024

rbradford commented Feb 28, 2024

rbradford left a comment

Choose a reason for hiding this comment

thomasbarrett commented Feb 26, 2024 •

edited

Loading