Use Antigravity in README#4
Merged
Merged
Conversation
Benjamin Elder (BenTheElder)
approved these changes
May 20, 2026
Collaborator
Benjamin Elder (BenTheElder)
left a comment
There was a problem hiding this comment.
LGTM
Davanum Srinivas (dims)
added a commit
to dims/substrate
that referenced
this pull request
May 27, 2026
…A buffer The original feat/gpu-passthrough commit (c358dff) wired the CRD, proto and runsc flags but the demo only got as far as golden actor Run + Checkpoint; user actor Restore failed with `inconsistent private memory files on restore: savedMFOwners=[pause:/]` and the CUDA buffer in the workload was never observed to survive a substrate suspend/resume cycle. This commit lands the five additional fixes the demo needed on the H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a `kubectl ate suspend` + idle + `kubectl ate resume` cycle. 1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries for every nvidia char-device. Without these the OCI bundle gives nvproxy the path but the host's cgroup eBPF device filter denies ioctl access in the sandbox boot path. 2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause container's prepareOCIDirectory too. Previously only the supervisor sub-container got --nvproxy via its OCI spec; runsc create pause launched the sandbox kernel with nvproxy disabled (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was never wired up and supervisor sub-container ioctls failed inside the sandbox with `nvproxy: failed to open device gofer nvidiactl: devutil.CtxDevGoferClient is not set`. 3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files (the shared HostPath volume) into /usr/local/bin inside the sandbox, falling back to /usr/local/bin on the atelet host. atelet runs inside the kind-control-plane container which doesn't have /usr/local/bin/cuda-checkpoint, so the previous os.Stat silently skipped both mounts. 4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and cmdUntoggleCUDA helpers that `runsc exec supervisor /usr/local/bin/cuda-checkpoint --toggle --pid 1` before CheckpointWorkload and after RestoreWorkload respectively. gVisor's --save-restore-exec-argv flag runs the binary inside the container being checkpointed (pause for substrate's root sandbox), but pause is the k8s pause image — distroless, no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh: no such file or directory`. Running cuda-checkpoint in the supervisor sub-container instead works because libcuda is there and the supervisor's PID 1 is the workload Python process. 5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and the comment explains why (vs. the previous comment which claimed nvproxy auto-registers; on the gVisor versions we use it does not — there's no auto-registration code anywhere in the source — and explicit registration via the CLI flag conflicts with the external drain in agent-substrate#4). Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC): BEAT3 /set?val=99 → {"ok": true, "val": 99} /sum → {"sum": 405504, "sample": 99, ...} /info → {"dev_ptr": "0x7fe846600000", ...} BEAT4 kubectl ate suspend actor gpu1 → STATUS_SUSPENDED BEAT5 5 s idle BEAT6 kubectl ate resume actor gpu1 → STATUS_RUNNING /info → {"dev_ptr": "0x7fe846600000", ...} ^^^ same address — CUDA context restored /sum → {"sum": 405504, "sample": 99, ...} ^^^ same data — buffer survived suspend Two operational notes for the gpu-counter demo (live in the openshell driver repo): - the workload image must bake the host's `libcuda.so.<host-driver>`; on kind there is no `nvidia-container-cli configure` hook to inject it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base is rejected by nvproxy 570 with cuInit=NO_DEVICE. - the runsc binary substrate uses must be the 2026-05-26 nightly or later; the release-20260520.0 tag has a multi-container nvproxy dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor sub-container even when pause has --nvproxy. Companion notes: - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.