fix(server): pass GPU resource limits to Kubernetes pods#782
Merged
Pangjiping merged 1 commit intoalibaba:mainfrom Apr 25, 2026
Merged
fix(server): pass GPU resource limits to Kubernetes pods#782Pangjiping merged 1 commit intoalibaba:mainfrom
Pangjiping merged 1 commit intoalibaba:mainfrom
Conversation
The sandbox schema advertises `resourceLimits.gpu` and the Docker runtime honors it (alibaba#775), but the Kubernetes runtime silently dropped the value: `_build_main_container` in `services/k8s/provider_common.py` passed the raw `resource_limits` dict straight into `V1ResourceRequirements.limits` and `.requests`, so a `gpu` key flowed through unchanged. Kubernetes treats `gpu` as an unknown extended resource — the device plugin never sees the request, and pods schedule with no GPU even when clients ask for one. Both `AgentSandboxProvider` and `BatchSandboxProvider` go through this single chokepoint, so both runtime modes were affected. Add a small `_translate_resource_limits_for_k8s` helper that reuses the existing `parse_gpu_request` from `services/helpers.py`, translates the portable `gpu` key to the canonical NVIDIA extended-resource name (`nvidia.com/gpu`), and pops the raw `gpu` key so it cannot leak onto the pod as an unknown resource. Hardcoding `nvidia.com/gpu` mirrors the NVIDIA-only scope of alibaba#775 (which uses Docker `DeviceRequest` `capabilities=[["gpu"]]`); other vendor keys (`amd.com/gpu`, `gpu.intel.com/i915`) can be added as a follow-up. Reject the `"all"` sentinel with HTTP 400 rather than silently dropping it. Docker accepts `"all"` (mapped to an unbounded `DeviceRequest`), but Kubernetes extended resources require an integer count, and silent fallback would mask the misconfiguration. This mirrors PR alibaba#775's "failures remain visible rather than silent" principle. The Kubernetes Windows profile path already strips the entire resources block in `apply_windows_profile_overrides` (`pop("resources", None)`), so GPU requests on Windows are still suppressed on the outer pod — no separate change is required there. Closes alibaba#781.
Pangjiping
approved these changes
Apr 25, 2026
Collaborator
Pangjiping
left a comment
There was a problem hiding this comment.
Approved.
Clean fix — the single-chokepoint approach in _build_main_container correctly covers both providers. Good error signaling (rejecting "all" with 400 instead of silently falling back), proper reuse of parse_gpu_request, and thorough test coverage across unit and provider levels.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CreateSandboxRequest.resourceLimits.gpuis documented in the schema and honored by the Docker runtime (fix(server): pass GPU resource limits to Docker containers #775), but the Kubernetes runtime silently dropped it._build_main_containerinservices/k8s/provider_common.pypassed the rawresource_limitsdict straight intoV1ResourceRequirements.limitsand.requests, so agpukey flowed through unchanged. Kubernetes treatsgpuas an unknown extended resource — the NVIDIA device plugin never sees the request, and pods schedule with no GPU. BothAgentSandboxProviderandBatchSandboxProviderroute through this single chokepoint, so both K8s providers were affected._translate_resource_limits_for_k8sinprovider_common.pythat reuses the existingparse_gpu_requestfromservices/helpers.py, translates the portablegpukey to the canonicalnvidia.com/gpuextended-resource name, and pops the rawgpukey so it cannot leak onto the pod as an unknown resource. Hardcodingnvidia.com/gpumirrors the NVIDIA-only scope of fix(server): pass GPU resource limits to Docker containers #775 (DockerDeviceRequest capabilities=[["gpu"]]); other vendor keys (amd.com/gpu,gpu.intel.com/i915) can be a follow-up."all"sentinel withHTTP 400rather than silently dropping it. Docker accepts"all"(unboundedDeviceRequest), but Kubernetes extended resources require an integer count, and silent fallback would mask the misconfiguration. Mirrors fix(server): pass GPU resource limits to Docker containers #775's "failures remain visible rather than silent" principle.Scope note: The Kubernetes Windows profile path (
apply_windows_profile_overrides) already strips the entire resources block viapop("resources", None), so GPU on Windows is still implicitly suppressed on the outer pod — no separate change is needed there.Failure mode: On clusters without the NVIDIA device plugin, the pod will surface a clear scheduling failure (Pod stays in
PendingwithInsufficient nvidia.com/gpu), so GPU request failures remain visible rather than silent.Testing
Focused checks from
server/AGENTS.md:Result: 141 passed, ruff clean,
uv run pyrighton the changed file: 0 errors / 0 warnings.Broader validation per
server/AGENTS.md:uv run pytest(full server suite) — 787 passed.Added tests:
tests/k8s/test_provider_common.py(new) — 7 unit tests covering positive int translation, raw-gpu-key strip regression, cpu/memory passthrough, no-gpuno-mutation guard,"all"rejection, parametrized invalid-value drop ("0","-1","bad",""), and the empty-dict path.tests/k8s/test_agent_sandbox_provider.py—test_create_workload_translates_gpu_to_nvidia_extended_resource,test_create_workload_without_gpu_omits_nvidia_extended_resource,test_create_workload_rejects_gpu_all_sentinel(full provider path throughcreate_custom_objectmock).tests/k8s/test_batchsandbox_provider.py— same three tests on the BatchSandbox path.Integration / e2e note:
tests/k8s/are mock-based provider tests (no live cluster), matching the convention used in #775. End-to-end verification of the actualnvidia.com/gpuextended-resource scheduling requires a Kubernetes cluster with the NVIDIA device plugin and a GPU-capable node, which I do not have available locally. Happy to run a real-cluster verification if a maintainer can point me at a CI lane or fixture.Breaking Changes
Backward compatible —
resourceLimits.gpuwas previously a silently-ignored key on the Kubernetes runtime. Clients that passed it now get it honored; clients that didn't pass it observe no change. Clients that previously passed"all"(which was already silently dropped on K8s) now receive a clear400instead of an unobservable failure.Checklist
gpu; no doc changes needed.resourceLimits.