Skip to content

Conversation

@un-def
Copy link
Collaborator

@un-def un-def commented Nov 5, 2025

Previously, KubernetesCompute only used GPU from the first offer to set node affinity, and if that type of GPU was not available (e.g., another job or even some non-dstack pod had already taken it), the job eventually failed with FAILED_TO_START_DUE_TO_NO_CAPACITY, even if there were other GPUs matching the run spec requirements.

Now, we inspect all nodes to request all suitable GPUs (any of).

In addition, we now use upper bounds of Ranges (CPU, memory, disk) as limits except for GPU, which cannot have request =/= limit (as it cannot be overcommited).

Part-of: #3126

Previously, KubernetesCompute only used GPU from the first offer to
set node affinity, and if that type of GPU was not available (e.g.,
another job or even some non-dstack pod had already taken it), the
job eventually failed with FAILED_TO_START_DUE_TO_NO_CAPACITY, even
if there were other GPUs matching the run spec requirements.

Now, we inspect all nodes to request all suitable GPUs (any of).

In addition, we now use upper bounds of Ranges (CPU, memory, disk)
as limits except for GPU, which cannot have request =/= limit (as it
cannot be overcommited).

Part-of: #3126
@un-def un-def requested review from jvstme and r4victor November 5, 2025 15:05
@un-def un-def merged commit ea555f3 into master Nov 6, 2025
28 checks passed
@un-def un-def deleted the issue_3126_k8s_request_all_gpu_models branch November 6, 2025 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants