Skip to content

different approach to fixing isolation bug#1556

Merged
luke-lombardi merged 1 commit intomainfrom
ll/different-approach
Mar 11, 2026
Merged

different approach to fixing isolation bug#1556
luke-lombardi merged 1 commit intomainfrom
ll/different-approach

Conversation

@luke-lombardi
Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi commented Mar 11, 2026

Summary by cubic

Fixes GPU isolation by resolving the assigned GPU UUID from the kubelet device plugin checkpoint instead of PID 1 env. Workers now reliably select the correct GPU and avoid the NVIDIA_VISIBLE_DEVICES=void bug.

  • Bug Fixes
    • Read /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint to map POD_UID to nvidia.com/gpu UUIDs, with fallback to NVIDIA_VISIBLE_DEVICES.
    • Mount /var/lib/kubelet/device-plugins read-only into worker pods in both external and local schedulers.
    • Inject POD_UID into worker env for lookup.
    • Added tests for checkpoint parsing and updated integration test to validate correct GPU resolution and allocation.

Written for commit 95ea521. Summary will update on new commits.

@luke-lombardi luke-lombardi merged commit be2fdbc into main Mar 11, 2026
2 of 3 checks passed
@luke-lombardi luke-lombardi deleted the ll/different-approach branch March 11, 2026 20:31
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/worker/gpu_info.go">

<violation number="1" location="pkg/worker/gpu_info.go:68">
P2: This returns only one `DeviceIDs` map value, which can drop assigned GPUs when IDs are spread across multiple entries. Flatten all UUID slices before joining.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread pkg/worker/gpu_info.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants