[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint by dims · Pull Request #96 · agent-substrate/substrate

Davanum Srinivas (dims) · 2026-05-27T12:09:16Z

Summary

End-to-end GPU passthrough for substrate actors via gVisor's nvproxy + cuda-checkpoint path.

Changes

CRD — pkg/api/v1alpha1/actortemplate_types.go::Container grows an optional Resources *ContainerResources carrying GPU *GPUResource{Count, Device, DriverCapabilities, DriverVersion}. zz_generated.deepcopy.go and manifests/.../ate.dev_actortemplates.yaml regenerated via go generate ./....
Protos — internal/proto/ateletpb and internal/proto/ateompb Container messages gain a GpuSpec gpu field mirroring the CRD shape.
ateapi → atelet — cmd/ateapi/internal/controlapi/gpu.go adds toAteletGpuSpec(*v1alpha1.ContainerResources) *ateletpb.GpuSpec; the resume and suspend workflows populate it on each ateletpb.Container.
atelet → ateom — cmd/atelet/main.go projects ateletpb.GpuSpec → ateompb.GpuSpec via toAteomGpuSpec. cmd/atelet/oci.go::prepareOCIDirectory gains a gpu *ateletpb.GpuSpec parameter and addGPUToOCISpec() helper that injects /dev/nvidia* device nodes (host major/minor) into Linux.Devices and bind-mounts /usr/local/bin/cuda-checkpoint + the wrapper script when the workload requests GPU. Pause containers pass nil.
ateom-gvisor — cmd/ateom-gvisor/runsc.go gains a gpu *ateompb.GpuSpec field on the runsc struct (populated via firstGPUSpec from the workload's containers). gpuGlobalFlags() emits --nvproxy [--nvproxy-driver-version=X] [--nvproxy-allowed-driver-capabilities=...] on runsc create/checkpoint/restore. gpuSaveRestoreFlags() is gated to the root container only (the supervisor sub-container restore must not re-invoke the wrapper, and gVisor's nvproxy auto-registers cuda-checkpoint internally on release-20260520.0).
Wrapper — hack/cuda-checkpoint-wrapper.sh (20 lines). Idempotent cuda-checkpoint --toggle over every CUDA-touching PID found in /proc/*/maps. Only skips $$ (self) — not PID 1, because inside a substrate sandbox the workload is PID 1. Used by the bare-metal validate-bare.sh in the demo dir; substrate proper leans on gVisor's internal cuda-checkpoint registration.
Tests — cmd/ateom-gvisor/runsc_test.go (linux build tag) covers gpuGlobalFlags, gpuSaveRestoreFlags, firstGPUSpec. The duration-string regression (30s, not 30000) is asserted explicitly. go test ./pkg/api/v1alpha1/... clean.

Constraints (from the gVisor source crawl)

NVIDIA driver must be R570+ (cuda-checkpoint NVML support) AND appear in runsc nvproxy list-supported-drivers. release-20260520.0 supports 16 versions across 535/550/570/580/590.
Driver version must match across checkpoint and restore — migration across driver upgrades is rejected.
x86_64 only. cuda-checkpoint isn't supported on arm64.
Operational gotcha (documented in the demo README): on hosts with nvidia-persistenced running, replace /run/nvidia-persistenced/socket with a regular file — gVisor's gofer can't bind-mount Unix sockets, and nvidia-container-cli hard-codes that mount.

Test plan

go vet ./pkg/... ./internal/proto/... ./cmd/atelet/... ./cmd/ateapi/...
GOOS=linux go build ./cmd/atelet ./cmd/ateapi ./cmd/ateom-gvisor
go test ./pkg/api/v1alpha1/...

Wires the ActorTemplate CRD, atelet's OCI builder, and ateom-gvisor's runsc invocations end-to-end so a containerised actor can declare GPU intent and have it survive checkpoint/restore through gVisor's official cuda-checkpoint path. Changes: - pkg/api/v1alpha1/actortemplate_types.go: Container grows an optional Resources block carrying a GPUResource. Generated zz_generated.deepcopy.go + ate.dev_actortemplates.yaml regenerated via go generate. - internal/proto/ateletpb + ateompb: Container grows a GpuSpec field mirroring the CRD shape. cmd/atelet projects it from ActorTemplate into the ateom workload spec. - cmd/atelet/oci.go::prepareOCIDirectory: when a container requests GPU, inject /dev/nvidia* device entries (host major/minor) and bind-mount /usr/local/bin/cuda-checkpoint plus the wrapper script read-only into the bundle. The pause container never gets GPU. - cmd/ateom-gvisor/runsc.go: per-sandbox runsc struct learns to emit --nvproxy, --nvproxy-driver-version, --nvproxy-allowed-driver- capabilities on create/restore, and --save-restore-exec-argv + --save-restore-exec-timeout on checkpoint/restore. Without these, runsc panics with `nvproxy.frontendFDMemmapFile is not saveable` the moment an actor with live CUDA state is checkpointed. - hack/cuda-checkpoint-wrapper.sh: the script runsc invokes inside the sandbox via --save-restore-exec-argv. Enumerates CUDA-touching PIDs by grepping /proc/*/maps and toggles each via cuda-checkpoint. Skips only $$ (self), not PID 1 -- the workload is PID 1 inside the sandbox, and skipping it makes the script find nothing. Constraints from the gVisor source crawl: * driver must be R570+ and in runsc nvproxy list-supported-drivers * driver version must match across checkpoint and restore * x86_64 only (cuda-checkpoint unsupported on arm64) * /run/nvidia-persistenced/socket must be a regular file on the host, because gVisor's gofer can't bind-mount Unix sockets and nvidia-container-cli hard-codes this bind regardless of persistenced state. Verified on an NVIDIA L40S with driver 580.126.09: nvidia-smi works inside runsc-gpu sandboxes; cuda-checkpoint --toggle drains a live CUDA context. go vet ./..., GOOS=linux go build ./cmd/..., go test ./pkg/api/v1alpha1/... all clean. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

The previous commit (0b46450) landed examples/gpu-counter/ but only the gpu-counter README itself. The repo-root README, docs/poc-intro.md, and the helpdesk demo's "Further reading" still referred to helpdesk as the only demo. This commit adds the cross-references so a reader landing anywhere in the docs can find gpu-counter. Changes: - README.md "Read first" gains a gpu-counter entry; "What's in the box" table gains a row; "Companion changes upstream" table gains agent-substrate/substrate#96 (the load-bearing substrate-side PR). - docs/poc-intro.md "Demo entry point" becomes "Demo entry points" with both helpdesk and gpu-counter listed. "Companion changes" gains the substrate#96 entry. - examples/helpdesk/README.md "Further reading" cross-refs the new sibling and substrate#96. - examples/gpu-counter/README.md expanded ~10x to match helpdesk's depth: a 6-beat table organized as three acts; prereqs + companion-changes-upstream tables; explicit one-time host pre-flight block (persistenced socket replacement, cuda-checkpoint download, wrapper install, kind-node prep); Quick start; What's in this folder; Verified output (excerpt from the 2026-05-27 brev L40S run, including the substrate atelet RPC log excerpts that prove --nvproxy is on every runsc invocation); Troubleshooting matrix of six symptom→fix rows; Cleanup; Open follow-ups (the two items already in the linked impl-log note); Further reading. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

…A buffer The original feat/gpu-passthrough commit (c358dff) wired the CRD, proto and runsc flags but the demo only got as far as golden actor Run + Checkpoint; user actor Restore failed with `inconsistent private memory files on restore: savedMFOwners=[pause:/]` and the CUDA buffer in the workload was never observed to survive a substrate suspend/resume cycle. This commit lands the five additional fixes the demo needed on the H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a `kubectl ate suspend` + idle + `kubectl ate resume` cycle. 1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries for every nvidia char-device. Without these the OCI bundle gives nvproxy the path but the host's cgroup eBPF device filter denies ioctl access in the sandbox boot path. 2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause container's prepareOCIDirectory too. Previously only the supervisor sub-container got --nvproxy via its OCI spec; runsc create pause launched the sandbox kernel with nvproxy disabled (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was never wired up and supervisor sub-container ioctls failed inside the sandbox with `nvproxy: failed to open device gofer nvidiactl: devutil.CtxDevGoferClient is not set`. 3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files (the shared HostPath volume) into /usr/local/bin inside the sandbox, falling back to /usr/local/bin on the atelet host. atelet runs inside the kind-control-plane container which doesn't have /usr/local/bin/cuda-checkpoint, so the previous os.Stat silently skipped both mounts. 4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and cmdUntoggleCUDA helpers that `runsc exec supervisor /usr/local/bin/cuda-checkpoint --toggle --pid 1` before CheckpointWorkload and after RestoreWorkload respectively. gVisor's --save-restore-exec-argv flag runs the binary inside the container being checkpointed (pause for substrate's root sandbox), but pause is the k8s pause image — distroless, no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh: no such file or directory`. Running cuda-checkpoint in the supervisor sub-container instead works because libcuda is there and the supervisor's PID 1 is the workload Python process. 5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and the comment explains why (vs. the previous comment which claimed nvproxy auto-registers; on the gVisor versions we use it does not — there's no auto-registration code anywhere in the source — and explicit registration via the CLI flag conflicts with the external drain in agent-substrate#4). Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC): BEAT3 /set?val=99 → {"ok": true, "val": 99} /sum → {"sum": 405504, "sample": 99, ...} /info → {"dev_ptr": "0x7fe846600000", ...} BEAT4 kubectl ate suspend actor gpu1 → STATUS_SUSPENDED BEAT5 5 s idle BEAT6 kubectl ate resume actor gpu1 → STATUS_RUNNING /info → {"dev_ptr": "0x7fe846600000", ...} ^^^ same address — CUDA context restored /sum → {"sum": 405504, "sample": 99, ...} ^^^ same data — buffer survived suspend Two operational notes for the gpu-counter demo (live in the openshell driver repo): - the workload image must bake the host's `libcuda.so.<host-driver>`; on kind there is no `nvidia-container-cli configure` hook to inject it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base is rejected by nvproxy 570 with cuInit=NO_DEVICE. - the runsc binary substrate uses must be the 2026-05-26 nightly or later; the release-20260520.0 tag has a multi-container nvproxy dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor sub-container even when pause has --nvproxy. Companion notes: - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md

New helper injectNVIDIAAssetsIntoRootfs (cmd/atelet/oci.go) mirrors the host's NVIDIA driver libs from /run/ateom-gvisor/static-files/nvidia-libs/ into each new actor's <rootfs>/usr/lib/x86_64-linux-gnu at sandbox-create time. Real .so files are copied byte-for-byte; symlinks are recreated as symlinks. Operators stage those libs once per box (Appendix I of 2026-05-27-gpu-passthrough-runbook.md drops a copy of bigbox's nvidia-container-cli list --libraries output + the transitive SONAME / dev symlinks into the kind-node). Effect: workload images no longer have to COPY libcuda.so.<host-driver> in their Dockerfile to satisfy dlopen("libcuda.so.1") inside the gVisor sandbox. This is the substrate-side equivalent of what nvidia-container-cli configure --compute --utility --device=all does in the standard docker+nvidia-container-runtime flow. Why a Go mirror rather than exec'ing nvidia-container-cli configure: atelet ships on distroless/static-debian13, so it has no dynamic linker for nvidia-container-cli's libnvidia-container.so.1 dep. The end state (driver libs at the linker's default search path) is identical. Hard-fails if the staging dir is missing or empty so an operator misconfiguration surfaces immediately instead of crashing inside the sandbox. End-to-end verified on bigbox-h200 (NVIDIA H200 NVL, driver 580.159.03) with an unmodified ubuntu:24.04 + python3 workload image (no libcuda baked in) — full 6-beat suspend/resume preserves dev_ptr=0x7f9f23e00000 and GPU buffer byte 0xa7.

Benjamin Elder (BenTheElder)

If this is only nvidia GPUs, we should probably name it appropriately instead of "GPU"

What about other CDI Devices? TPUs? At the very least we probably want to leave API shape for this.

Benjamin Elder (BenTheElder) · 2026-05-28T03:23:07Z

@@ -0,0 +1,20 @@
+#!/bin/sh


Currently I don't think we're shipping any other hack/ script to prod. We should probably move this (or replace it with a binary)

Davanum Srinivas (dims) changed the title ~~feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint~~ [WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint May 27, 2026

Davanum Srinivas (dims) marked this pull request as draft May 27, 2026 12:11

Davanum Srinivas (dims) added 2 commits May 27, 2026 12:11

Benjamin Elder (BenTheElder) reviewed May 28, 2026

View reviewed changes

a4-a4s1 Bot mentioned this pull request May 28, 2026

fix(atelet): prevent path traversal in OCI tar extraction #101

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96

[WIP] feat: GPU passthrough for actors via gVisor nvproxy + cuda-checkpoint#96
Davanum Srinivas (dims) wants to merge 3 commits into
agent-substrate:mainfrom
dims:feat/gpu-passthrough

Davanum Srinivas (dims) commented May 27, 2026 •

edited

Loading

Uh oh!

Benjamin Elder (BenTheElder) left a comment

Uh oh!

Benjamin Elder (BenTheElder) May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davanum Srinivas (dims) commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Constraints (from the gVisor source crawl)

Test plan

Uh oh!

Benjamin Elder (BenTheElder) left a comment

Choose a reason for hiding this comment

Uh oh!

Benjamin Elder (BenTheElder) May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Davanum Srinivas (dims) commented May 27, 2026 •

edited

Loading