Skip to content

[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96

Draft
Davanum Srinivas (dims) wants to merge 3 commits into
agent-substrate:mainfrom
dims:feat/gpu-passthrough
Draft

[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96
Davanum Srinivas (dims) wants to merge 3 commits into
agent-substrate:mainfrom
dims:feat/gpu-passthrough

Conversation

@dims
Copy link
Copy Markdown
Collaborator

@dims Davanum Srinivas (dims) commented May 27, 2026

Summary

End-to-end GPU passthrough for substrate actors via gVisor's nvproxy + cuda-checkpoint path.

Changes

  • CRDpkg/api/v1alpha1/actortemplate_types.go::Container grows an optional Resources *ContainerResources carrying GPU *GPUResource{Count, Device, DriverCapabilities, DriverVersion}. zz_generated.deepcopy.go and manifests/.../ate.dev_actortemplates.yaml regenerated via go generate ./....

  • Protosinternal/proto/ateletpb and internal/proto/ateompb Container messages gain a GpuSpec gpu field mirroring the CRD shape.

  • ateapi → ateletcmd/ateapi/internal/controlapi/gpu.go adds toAteletGpuSpec(*v1alpha1.ContainerResources) *ateletpb.GpuSpec; the resume and suspend workflows populate it on each ateletpb.Container.

  • atelet → ateomcmd/atelet/main.go projects ateletpb.GpuSpec → ateompb.GpuSpec via toAteomGpuSpec. cmd/atelet/oci.go::prepareOCIDirectory gains a gpu *ateletpb.GpuSpec parameter and addGPUToOCISpec() helper that injects /dev/nvidia* device nodes (host major/minor) into Linux.Devices and bind-mounts /usr/local/bin/cuda-checkpoint + the wrapper script when the workload requests GPU. Pause containers pass nil.

  • ateom-gvisorcmd/ateom-gvisor/runsc.go gains a gpu *ateompb.GpuSpec field on the runsc struct (populated via firstGPUSpec from the workload's containers). gpuGlobalFlags() emits --nvproxy [--nvproxy-driver-version=X] [--nvproxy-allowed-driver-capabilities=...] on runsc create/checkpoint/restore. gpuSaveRestoreFlags() is gated to the root container only (the supervisor sub-container restore must not re-invoke the wrapper, and gVisor's nvproxy auto-registers cuda-checkpoint internally on release-20260520.0).

  • Wrapperhack/cuda-checkpoint-wrapper.sh (20 lines). Idempotent cuda-checkpoint --toggle over every CUDA-touching PID found in /proc/*/maps. Only skips $$ (self) — not PID 1, because inside a substrate sandbox the workload is PID 1. Used by the bare-metal validate-bare.sh in the demo dir; substrate proper leans on gVisor's internal cuda-checkpoint registration.

  • Testscmd/ateom-gvisor/runsc_test.go (linux build tag) covers gpuGlobalFlags, gpuSaveRestoreFlags, firstGPUSpec. The duration-string regression (30s, not 30000) is asserted explicitly. go test ./pkg/api/v1alpha1/... clean.

Constraints (from the gVisor source crawl)

  • NVIDIA driver must be R570+ (cuda-checkpoint NVML support) AND appear in runsc nvproxy list-supported-drivers. release-20260520.0 supports 16 versions across 535/550/570/580/590.
  • Driver version must match across checkpoint and restore — migration across driver upgrades is rejected.
  • x86_64 only. cuda-checkpoint isn't supported on arm64.
  • Operational gotcha (documented in the demo README): on hosts with nvidia-persistenced running, replace /run/nvidia-persistenced/socket with a regular file — gVisor's gofer can't bind-mount Unix sockets, and nvidia-container-cli hard-codes that mount.

Test plan

  • go vet ./pkg/... ./internal/proto/... ./cmd/atelet/... ./cmd/ateapi/...
  • GOOS=linux go build ./cmd/atelet ./cmd/ateapi ./cmd/ateom-gvisor
  • go test ./pkg/api/v1alpha1/...

Wires the ActorTemplate CRD, atelet's OCI builder, and ateom-gvisor's
runsc invocations end-to-end so a containerised actor can declare GPU
intent and have it survive checkpoint/restore through gVisor's
official cuda-checkpoint path.

Changes:

- pkg/api/v1alpha1/actortemplate_types.go: Container grows an
  optional Resources block carrying a GPUResource. Generated
  zz_generated.deepcopy.go + ate.dev_actortemplates.yaml regenerated
  via go generate.

- internal/proto/ateletpb + ateompb: Container grows a GpuSpec field
  mirroring the CRD shape. cmd/atelet projects it from ActorTemplate
  into the ateom workload spec.

- cmd/atelet/oci.go::prepareOCIDirectory: when a container requests
  GPU, inject /dev/nvidia* device entries (host major/minor) and
  bind-mount /usr/local/bin/cuda-checkpoint plus the wrapper script
  read-only into the bundle. The pause container never gets GPU.

- cmd/ateom-gvisor/runsc.go: per-sandbox runsc struct learns to emit
  --nvproxy, --nvproxy-driver-version, --nvproxy-allowed-driver-
  capabilities on create/restore, and --save-restore-exec-argv +
  --save-restore-exec-timeout on checkpoint/restore. Without these,
  runsc panics with `nvproxy.frontendFDMemmapFile is not saveable`
  the moment an actor with live CUDA state is checkpointed.

- hack/cuda-checkpoint-wrapper.sh: the script runsc invokes inside
  the sandbox via --save-restore-exec-argv. Enumerates CUDA-touching
  PIDs by grepping /proc/*/maps and toggles each via cuda-checkpoint.
  Skips only $$ (self), not PID 1 -- the workload is PID 1 inside
  the sandbox, and skipping it makes the script find nothing.

Constraints from the gVisor source crawl:
  * driver must be R570+ and in runsc nvproxy list-supported-drivers
  * driver version must match across checkpoint and restore
  * x86_64 only (cuda-checkpoint unsupported on arm64)
  * /run/nvidia-persistenced/socket must be a regular file on the
    host, because gVisor's gofer can't bind-mount Unix sockets and
    nvidia-container-cli hard-codes this bind regardless of
    persistenced state.

Verified on an NVIDIA L40S with driver 580.126.09:
  nvidia-smi works inside runsc-gpu sandboxes;
  cuda-checkpoint --toggle drains a live CUDA context.

go vet ./..., GOOS=linux go build ./cmd/..., go test
./pkg/api/v1alpha1/... all clean.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims Davanum Srinivas (dims) changed the title feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint [WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint May 27, 2026
@dims Davanum Srinivas (dims) marked this pull request as draft May 27, 2026 12:11
Davanum Srinivas (dims) added a commit to dims/openshell-driver-substrate that referenced this pull request May 27, 2026
The previous commit (0b46450) landed examples/gpu-counter/ but only the
gpu-counter README itself. The repo-root README, docs/poc-intro.md, and
the helpdesk demo's "Further reading" still referred to helpdesk as the
only demo. This commit adds the cross-references so a reader landing
anywhere in the docs can find gpu-counter.

Changes:

- README.md "Read first" gains a gpu-counter entry; "What's in the box"
  table gains a row; "Companion changes upstream" table gains
  agent-substrate/substrate#96 (the load-bearing substrate-side PR).

- docs/poc-intro.md "Demo entry point" becomes "Demo entry points"
  with both helpdesk and gpu-counter listed. "Companion changes" gains
  the substrate#96 entry.

- examples/helpdesk/README.md "Further reading" cross-refs the new
  sibling and substrate#96.

- examples/gpu-counter/README.md expanded ~10x to match helpdesk's
  depth: a 6-beat table organized as three acts; prereqs +
  companion-changes-upstream tables; explicit one-time host pre-flight
  block (persistenced socket replacement, cuda-checkpoint download,
  wrapper install, kind-node prep); Quick start; What's in this
  folder; Verified output (excerpt from the 2026-05-27 brev L40S run,
  including the substrate atelet RPC log excerpts that prove --nvproxy
  is on every runsc invocation); Troubleshooting matrix of six
  symptom→fix rows; Cleanup; Open follow-ups (the two items already
  in the linked impl-log note); Further reading.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
…A buffer

The original feat/gpu-passthrough commit (c358dff) wired the CRD,
proto and runsc flags but the demo only got as far as golden actor
Run + Checkpoint; user actor Restore failed with `inconsistent
private memory files on restore: savedMFOwners=[pause:/]` and the
CUDA buffer in the workload was never observed to survive a
substrate suspend/resume cycle.

This commit lands the five additional fixes the demo needed on the
H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor
nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via
cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a
`kubectl ate suspend` + idle + `kubectl ate resume` cycle.

1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries
   for every nvidia char-device. Without these the OCI bundle gives
   nvproxy the path but the host's cgroup eBPF device filter denies
   ioctl access in the sandbox boot path.

2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause
   container's prepareOCIDirectory too. Previously only the
   supervisor sub-container got --nvproxy via its OCI spec; runsc
   create pause launched the sandbox kernel with nvproxy disabled
   (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was
   never wired up and supervisor sub-container ioctls failed inside
   the sandbox with `nvproxy: failed to open device gofer nvidiactl:
   devutil.CtxDevGoferClient is not set`.

3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and
   cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files
   (the shared HostPath volume) into /usr/local/bin inside the
   sandbox, falling back to /usr/local/bin on the atelet host.
   atelet runs inside the kind-control-plane container which doesn't
   have /usr/local/bin/cuda-checkpoint, so the previous os.Stat
   silently skipped both mounts.

4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and
   cmdUntoggleCUDA helpers that `runsc exec supervisor
   /usr/local/bin/cuda-checkpoint --toggle --pid 1` before
   CheckpointWorkload and after RestoreWorkload respectively.
   gVisor's --save-restore-exec-argv flag runs the binary inside the
   container being checkpointed (pause for substrate's root
   sandbox), but pause is the k8s pause image — distroless,
   no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with
   `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh:
   no such file or directory`. Running cuda-checkpoint in the
   supervisor sub-container instead works because libcuda is there
   and the supervisor's PID 1 is the workload Python process.

5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and
   the comment explains why (vs. the previous comment which claimed
   nvproxy auto-registers; on the gVisor versions we use it does
   not — there's no auto-registration code anywhere in the source —
   and explicit registration via the CLI flag conflicts with the
   external drain in agent-substrate#4).

Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC):

    BEAT3   /set?val=99      → {"ok": true, "val": 99}
            /sum             → {"sum": 405504, "sample": 99, ...}
            /info            → {"dev_ptr": "0x7fe846600000", ...}

    BEAT4   kubectl ate suspend actor gpu1   → STATUS_SUSPENDED
    BEAT5   5 s idle
    BEAT6   kubectl ate resume  actor gpu1   → STATUS_RUNNING
            /info            → {"dev_ptr": "0x7fe846600000", ...}
                              ^^^ same address — CUDA context restored
            /sum             → {"sum": 405504, "sample": 99, ...}
                              ^^^ same data  — buffer survived suspend

Two operational notes for the gpu-counter demo (live in the openshell
driver repo):
- the workload image must bake the host's `libcuda.so.<host-driver>`;
  on kind there is no `nvidia-container-cli configure` hook to inject
  it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base
  is rejected by nvproxy 570 with cuInit=NO_DEVICE.
- the runsc binary substrate uses must be the 2026-05-26 nightly or
  later; the release-20260520.0 tag has a multi-container nvproxy
  dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor
  sub-container even when pause has --nvproxy.

Companion notes:
  - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md
  - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md
New helper injectNVIDIAAssetsIntoRootfs (cmd/atelet/oci.go) mirrors
the host's NVIDIA driver libs from
/run/ateom-gvisor/static-files/nvidia-libs/ into each new actor's
<rootfs>/usr/lib/x86_64-linux-gnu at sandbox-create time. Real .so
files are copied byte-for-byte; symlinks are recreated as symlinks.

Operators stage those libs once per box (Appendix I of
2026-05-27-gpu-passthrough-runbook.md drops a copy of bigbox's
nvidia-container-cli list --libraries output + the transitive
SONAME / dev symlinks into the kind-node).

Effect: workload images no longer have to COPY libcuda.so.<host-driver>
in their Dockerfile to satisfy dlopen("libcuda.so.1") inside the
gVisor sandbox. This is the substrate-side equivalent of what
nvidia-container-cli configure --compute --utility --device=all does
in the standard docker+nvidia-container-runtime flow.

Why a Go mirror rather than exec'ing nvidia-container-cli configure:
atelet ships on distroless/static-debian13, so it has no dynamic
linker for nvidia-container-cli's libnvidia-container.so.1 dep. The
end state (driver libs at the linker's default search path) is
identical.

Hard-fails if the staging dir is missing or empty so an operator
misconfiguration surfaces immediately instead of crashing inside the
sandbox.

End-to-end verified on bigbox-h200 (NVIDIA H200 NVL, driver
580.159.03) with an unmodified ubuntu:24.04 + python3 workload image
(no libcuda baked in) — full 6-beat suspend/resume preserves
dev_ptr=0x7f9f23e00000 and GPU buffer byte 0xa7.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only nvidia GPUs, we should probably name it appropriately instead of "GPU"

What about other CDI Devices? TPUs? At the very least we probably want to leave API shape for this.

@@ -0,0 +1,20 @@
#!/bin/sh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I don't think we're shipping any other hack/ script to prod. We should probably move this (or replace it with a binary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants