Use Antigravity in README by rakyll · Pull Request #4 · agent-substrate/substrate

Jaana Dogan (rakyll) · 2026-05-20T00:34:35Z

No description provided.

Benjamin Elder (BenTheElder)

LGTM

…A buffer The original feat/gpu-passthrough commit (c358dff) wired the CRD, proto and runsc flags but the demo only got as far as golden actor Run + Checkpoint; user actor Restore failed with `inconsistent private memory files on restore: savedMFOwners=[pause:/]` and the CUDA buffer in the workload was never observed to survive a substrate suspend/resume cycle. This commit lands the five additional fixes the demo needed on the H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a `kubectl ate suspend` + idle + `kubectl ate resume` cycle. 1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries for every nvidia char-device. Without these the OCI bundle gives nvproxy the path but the host's cgroup eBPF device filter denies ioctl access in the sandbox boot path. 2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause container's prepareOCIDirectory too. Previously only the supervisor sub-container got --nvproxy via its OCI spec; runsc create pause launched the sandbox kernel with nvproxy disabled (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was never wired up and supervisor sub-container ioctls failed inside the sandbox with `nvproxy: failed to open device gofer nvidiactl: devutil.CtxDevGoferClient is not set`. 3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files (the shared HostPath volume) into /usr/local/bin inside the sandbox, falling back to /usr/local/bin on the atelet host. atelet runs inside the kind-control-plane container which doesn't have /usr/local/bin/cuda-checkpoint, so the previous os.Stat silently skipped both mounts. 4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and cmdUntoggleCUDA helpers that `runsc exec supervisor /usr/local/bin/cuda-checkpoint --toggle --pid 1` before CheckpointWorkload and after RestoreWorkload respectively. gVisor's --save-restore-exec-argv flag runs the binary inside the container being checkpointed (pause for substrate's root sandbox), but pause is the k8s pause image — distroless, no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh: no such file or directory`. Running cuda-checkpoint in the supervisor sub-container instead works because libcuda is there and the supervisor's PID 1 is the workload Python process. 5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and the comment explains why (vs. the previous comment which claimed nvproxy auto-registers; on the gVisor versions we use it does not — there's no auto-registration code anywhere in the source — and explicit registration via the CLI flag conflicts with the external drain in agent-substrate#4). Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC): BEAT3 /set?val=99 → {"ok": true, "val": 99} /sum → {"sum": 405504, "sample": 99, ...} /info → {"dev_ptr": "0x7fe846600000", ...} BEAT4 kubectl ate suspend actor gpu1 → STATUS_SUSPENDED BEAT5 5 s idle BEAT6 kubectl ate resume actor gpu1 → STATUS_RUNNING /info → {"dev_ptr": "0x7fe846600000", ...} ^^^ same address — CUDA context restored /sum → {"sum": 405504, "sample": 99, ...} ^^^ same data — buffer survived suspend Two operational notes for the gpu-counter demo (live in the openshell driver repo): - the workload image must bake the host's `libcuda.so.<host-driver>`; on kind there is no `nvidia-container-cli configure` hook to inject it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base is rejected by nvproxy 570 with cuInit=NO_DEVICE. - the runsc binary substrate uses must be the 2026-05-26 nightly or later; the release-20260520.0 tag has a multi-container nvproxy dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor sub-container even when pause has --nvproxy. Companion notes: - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md

Use Antigravity in README

9f50596

Benjamin Elder (BenTheElder) approved these changes May 20, 2026

View reviewed changes

Benjamin Elder (BenTheElder) merged commit e24f170 into main May 20, 2026
4 checks passed

Tim Hockin (thockin) deleted the agy branch May 20, 2026 00:43

Benjamin Elder (BenTheElder) self-assigned this May 21, 2026

Benjamin Elder (BenTheElder) added the documentation Improvements or additions to documentation label May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Antigravity in README#4

Use Antigravity in README#4
Benjamin Elder (BenTheElder) merged 1 commit into
mainfrom
agy

Jaana Dogan (rakyll) commented May 20, 2026

Uh oh!

Benjamin Elder (BenTheElder) left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jaana Dogan (rakyll) commented May 20, 2026

Uh oh!

Benjamin Elder (BenTheElder) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants