Skip to content

fix nvidia runtime edge case#1552

Merged
luke-lombardi merged 2 commits intomainfrom
ll/bump-blobcache-and-fix-void
Mar 11, 2026
Merged

fix nvidia runtime edge case#1552
luke-lombardi merged 2 commits intomainfrom
ll/bump-blobcache-and-fix-void

Conversation

@luke-lombardi
Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi commented Mar 11, 2026

Summary by cubic

Preserve NVIDIA_VISIBLE_DEVICES during CDI generation to prevent void from hiding GPUs from the worker. Adds GPU discovery edge-case tests and improves allocation error messages.

  • Bug Fixes

    • Capture and restore NVIDIA_VISIBLE_DEVICES around nvidia-ctk cdi generate so the worker’s env isn’t overwritten with void.
    • Enriched “not enough GPUs” error with requested, allocable, visible, configured, already_allocated, and current NVIDIA_VISIBLE_DEVICES.
    • Added tests for AvailableGPUDevices: mismatched visibility, missing /proc entries, procfs errors, and query failures.
  • Dependencies

    • Bumped github.com/VictoriaMetrics/metrics and github.com/beam-cloud/blobcache-v2.
    • Moved github.com/opencontainers/runc and github.com/containerd/console to indirect; removed sdk/uv.lock.

Written for commit 5b67b27. Summary will update on new commits.

@luke-lombardi luke-lombardi merged commit 5cdcdb0 into main Mar 11, 2026
2 of 3 checks passed
@luke-lombardi luke-lombardi deleted the ll/bump-blobcache-and-fix-void branch March 11, 2026 16:10
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/worker/nvidia.go">

<violation number="1" location="pkg/worker/nvidia.go:47">
P2: Preserve whether `NVIDIA_VISIBLE_DEVICES` was originally unset; using `Getenv` + unconditional `Setenv` can incorrectly create an empty env var.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/worker/nvidia.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants