Skip to content

Use Antigravity in README#4

Merged
Benjamin Elder (BenTheElder) merged 1 commit into
mainfrom
agy
May 20, 2026
Merged

Use Antigravity in README#4
Benjamin Elder (BenTheElder) merged 1 commit into
mainfrom
agy

Conversation

@rakyll
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BenTheElder Benjamin Elder (BenTheElder) merged commit e24f170 into main May 20, 2026
4 checks passed
@thockin Tim Hockin (thockin) deleted the agy branch May 20, 2026 00:43
@BenTheElder Benjamin Elder (BenTheElder) added the documentation Improvements or additions to documentation label May 21, 2026
Davanum Srinivas (dims) added a commit to dims/substrate that referenced this pull request May 27, 2026
…A buffer

The original feat/gpu-passthrough commit (c358dff) wired the CRD,
proto and runsc flags but the demo only got as far as golden actor
Run + Checkpoint; user actor Restore failed with `inconsistent
private memory files on restore: savedMFOwners=[pause:/]` and the
CUDA buffer in the workload was never observed to survive a
substrate suspend/resume cycle.

This commit lands the five additional fixes the demo needed on the
H100 brev box `front-emerald-krill` (driver 570.195.03, gVisor
nightly 2026-05-26). With these, a 1 MiB CUDA buffer set via
cuMemsetD8_v2 to byte 0x63 reads back at the same dev_ptr after a
`kubectl ate suspend` + idle + `kubectl ate resume` cycle.

1. cmd/atelet/oci.go: add spec.Linux.Resources.Devices allow entries
   for every nvidia char-device. Without these the OCI bundle gives
   nvproxy the path but the host's cgroup eBPF device filter denies
   ioctl access in the sandbox boot path.

2. cmd/atelet/main.go: pass `firstGpuSpec(...)` to the pause
   container's prepareOCIDirectory too. Previously only the
   supervisor sub-container got --nvproxy via its OCI spec; runsc
   create pause launched the sandbox kernel with nvproxy disabled
   (`--dev-io-fd=-1` in the runsc debug log), so the dev gofer was
   never wired up and supervisor sub-container ioctls failed inside
   the sandbox with `nvproxy: failed to open device gofer nvidiactl:
   devutil.CtxDevGoferClient is not set`.

3. cmd/atelet/oci.go: bind-mount cuda-checkpoint and
   cuda-checkpoint-wrapper.sh from /run/ateom-gvisor/static-files
   (the shared HostPath volume) into /usr/local/bin inside the
   sandbox, falling back to /usr/local/bin on the atelet host.
   atelet runs inside the kind-control-plane container which doesn't
   have /usr/local/bin/cuda-checkpoint, so the previous os.Stat
   silently skipped both mounts.

4. cmd/ateom-gvisor/runsc.go + main.go: add cmdDrainCUDA and
   cmdUntoggleCUDA helpers that `runsc exec supervisor
   /usr/local/bin/cuda-checkpoint --toggle --pid 1` before
   CheckpointWorkload and after RestoreWorkload respectively.
   gVisor's --save-restore-exec-argv flag runs the binary inside the
   container being checkpointed (pause for substrate's root
   sandbox), but pause is the k8s pause image — distroless,
   no /bin/sh — so wrapper scripts with #!/bin/sh shebangs fail with
   `failed to load /usr/local/bin/cuda-checkpoint-wrapper.sh:
   no such file or directory`. Running cuda-checkpoint in the
   supervisor sub-container instead works because libcuda is there
   and the supervisor's PID 1 is the workload Python process.

5. cmd/ateom-gvisor/runsc.go: gpuSaveRestoreFlags returns nil and
   the comment explains why (vs. the previous comment which claimed
   nvproxy auto-registers; on the gVisor versions we use it does
   not — there's no auto-registration code anywhere in the source —
   and explicit registration via the CLI flag conflicts with the
   external drain in agent-substrate#4).

Empirical demo trace (front-emerald-krill, 2026-05-27 15:42 UTC):

    BEAT3   /set?val=99      → {"ok": true, "val": 99}
            /sum             → {"sum": 405504, "sample": 99, ...}
            /info            → {"dev_ptr": "0x7fe846600000", ...}

    BEAT4   kubectl ate suspend actor gpu1   → STATUS_SUSPENDED
    BEAT5   5 s idle
    BEAT6   kubectl ate resume  actor gpu1   → STATUS_RUNNING
            /info            → {"dev_ptr": "0x7fe846600000", ...}
                              ^^^ same address — CUDA context restored
            /sum             → {"sum": 405504, "sample": 99, ...}
                              ^^^ same data  — buffer survived suspend

Two operational notes for the gpu-counter demo (live in the openshell
driver repo):
- the workload image must bake the host's `libcuda.so.<host-driver>`;
  on kind there is no `nvidia-container-cli configure` hook to inject
  it from the host. The 580.x libcuda from the nvidia/cuda:12.6 base
  is rejected by nvproxy 570 with cuInit=NO_DEVICE.
- the runsc binary substrate uses must be the 2026-05-26 nightly or
  later; the release-20260520.0 tag has a multi-container nvproxy
  dev-gofer bug that returns cuInit=NO_DEVICE inside the supervisor
  sub-container even when pause has --nvproxy.

Companion notes:
  - notes/openshell-on-substrate/2026-05-27-gpu-passthrough-impl-log.md
  - notes/openshell-on-substrate/2026-05-25-gpu-passthrough-analysis.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants