Skip to content

fix strict isolation bug#1554

Merged
luke-lombardi merged 1 commit intomainfrom
ll/gpu-fix-3
Mar 11, 2026
Merged

fix strict isolation bug#1554
luke-lombardi merged 1 commit intomainfrom
ll/gpu-fix-3

Conversation

@luke-lombardi
Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi commented Mar 11, 2026

Summary by cubic

Fixes strict GPU isolation by resolving the runtime-injected NVIDIA_VISIBLE_DEVICES and enforcing it during device discovery. Prevents workers from seeing unintended GPUs and improves allocation error messages.

  • Bug Fixes
    • Resolve visible devices via a child shell process to read the hook-injected NVIDIA_VISIBLE_DEVICES; fall back to the current env on error.
    • Pass the resolved value into NvidiaInfoClient and use it for filtering and error reporting.
    • Treat "void" and "" as no access; only "all" exposes all GPUs.
    • Keep a per-manager snapshot of visible devices instead of reading PID 1’s env each time.
    • Tests refactored for clarity and added coverage for void/empty/single-UUID and non-zero PCI domain cases.

Written for commit 5cd7c8c. Summary will update on new commits.

@luke-lombardi luke-lombardi requested a review from mernit March 11, 2026 19:36
@luke-lombardi luke-lombardi merged commit e87add0 into main Mar 11, 2026
2 of 4 checks passed
@luke-lombardi luke-lombardi deleted the ll/gpu-fix-3 branch March 11, 2026 19:36
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/worker/gpu_info_test.go">

<violation number="1" location="pkg/worker/gpu_info_test.go:178">
P2: These tests stub `resolveVisibleDevices` itself, so they don't verify the real resolution logic and can pass even if production behavior is broken.</violation>
</file>

<file name="pkg/worker/nvidia.go">

<violation number="1" location="pkg/worker/nvidia.go:117">
P1: This type assertion can panic when `infoClient` is any `GPUInfoClient` implementation other than `*NvidiaInfoClient`, turning a normal allocation error path into a runtime crash.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/worker/nvidia.go
Comment thread pkg/worker/gpu_info_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants