Skip to content

Add CDI support to --devices  #3864

Closed
Closed
@klueska

Description

@klueska

Description

Feature Request:
Add support to the --device flag to process fully-qualified CDI device names in addition to standard device node paths.

Support for this was added in podman v4.1.0 in May 2022 and it would be good to have feature parity with this in docker.

Background:
The Container Device Interface (CDI) is a CNCF sponsored initiative to standardize the way in which complex devices are exposed to containers. It is based on the Container Networking Interface (CNI) model and specification.

With CDI, a "device" is more than just a single device node under /dev.

Instead, CDI defines a device as a high-level concept mapped to specific set of OCI runtime spec modifications. These modifications include not just the inclusion of device nodes, but also filesystem mounts, environment variables, and container lifecycle hooks.

Vendors are responsible for writing CDI specs that define their devices in terms of these modifications, and CDI-enabled container runtimes are responsible for reading these specs and making the proper modifications when requested to do to.

The list of container runtimes that already include support for CDI are:

Moreover, a new feature in Kubernetes called Dynamic Resource Allocation (DRA) uses CDI under the hood to do its device injection (with the assumption that a CDI-enabled runtime such as cri-o or containerd is there to support it).

As a concrete example, consider the following abbreviated CDI spec for injecting an NVIDIA GPU into a container:

---
cdiVersion: 0.4.0
kind: nvidia.com/gpu
devices:
- name: gpu0
  containerEdits:
   deviceNodes:
   - path: /dev/nvidia0
   deviceNodes:
   - path: /dev/nvidiactl
   mounts:
   - containerPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.91.03
     hostPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.91.03
     options: [ro, nosuid, nodev, bind]
   - containerPath: /usr/lib/x86_64-linux-gnu/libcuda.so.460.91.03
     hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.460.91.03
     options: [ro, nosuid, nodev, bind]
   - containerPath: /usr/bin/nvidia-smi
     hostPath: /usr/bin/nvidia-smi
     options: [ro, nosuid, nodev, bind]
   - containerPath: /usr/bin/nvidia-persistenced
     hostPath: /usr/bin/nvidia-persistenced
     options: [ro, nosuid, nodev, bind]
   - containerPath: /var/run/nvidia-persistenced/socket
     hostPath: /var/run/nvidia-persistenced/socket
     options: [ro, nosuid, nodev, bind]
   hooks:
   - hookName: createContainer
     path: /usr/bin/nvidia-ctk
     args:
     - /usr/bin/nvidia-ctk
     - hook
     - update-ldcache
     - --folder
     - /usr/lib/x86_64-linux-gnu

This spec defines a device whose name is gpu0 associated with vendor nvidia.com and device class gpu -- resulting in a fully-qualified CDI device name of nvidia.com/gpu=gpu0.

Referencing this fully-qualified device name and passing it to a CDI-enabled runtime would ensure that not only does the /dev/nvidia0 device node get injected into a container, but also the required /dev/nvidiactl control device, as well as a set of host-level libraries and hooks to make working with NVIDIA devices in containers easier.

Note: For those of you familiar with nvidia-docker and / or the nvidia-container-toolkit, this effectively obsoletes the need for for these tools, because everything these tools have traditionally been responsible for can now be encoded in a CDI spec.

As such, saving the above spec under /etc/cdi/nvidia.yaml and running podman as below results in the container starting with the desired modifications:

$ podman --version
podman version 4.1.0
$ podman run --device nvidia.com/gpu=gpu0 ubuntu:20.04 nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-2e5daa54-7530-ede9-5b96-d4e31fbbe7f8)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions