fix(server): pass GPU resource limits to Kubernetes pods by bytkim · Pull Request #782 · alibaba/OpenSandbox

bytkim · 2026-04-25T00:54:55Z

Summary

CreateSandboxRequest.resourceLimits.gpu is documented in the schema and honored by the Docker runtime (fix(server): pass GPU resource limits to Docker containers #775), but the Kubernetes runtime silently dropped it. _build_main_container in services/k8s/provider_common.py passed the raw resource_limits dict straight into V1ResourceRequirements.limits and .requests, so a gpu key flowed through unchanged. Kubernetes treats gpu as an unknown extended resource — the NVIDIA device plugin never sees the request, and pods schedule with no GPU. Both AgentSandboxProvider and BatchSandboxProvider route through this single chokepoint, so both K8s providers were affected.
Adds _translate_resource_limits_for_k8s in provider_common.py that reuses the existing parse_gpu_request from services/helpers.py, translates the portable gpu key to the canonical nvidia.com/gpu extended-resource name, and pops the raw gpu key so it cannot leak onto the pod as an unknown resource. Hardcoding nvidia.com/gpu mirrors the NVIDIA-only scope of fix(server): pass GPU resource limits to Docker containers #775 (Docker DeviceRequest capabilities=[["gpu"]]); other vendor keys (amd.com/gpu, gpu.intel.com/i915) can be a follow-up.
Rejects the "all" sentinel with HTTP 400 rather than silently dropping it. Docker accepts "all" (unbounded DeviceRequest), but Kubernetes extended resources require an integer count, and silent fallback would mask the misconfiguration. Mirrors fix(server): pass GPU resource limits to Docker containers #775's "failures remain visible rather than silent" principle.
Closes [BUG] Kubernetes runtime silently drops resourceLimits.gpu #781.

Scope note: The Kubernetes Windows profile path (apply_windows_profile_overrides) already strips the entire resources block via pop("resources", None), so GPU on Windows is still implicitly suppressed on the outer pod — no separate change is needed there.

Failure mode: On clusters without the NVIDIA device plugin, the pod will surface a clear scheduling failure (Pod stays in Pending with Insufficient nvidia.com/gpu), so GPU request failures remain visible rather than silent.

Testing

Not run (explain why)
Unit tests
Integration tests (see note below)
e2e / manual verification

Focused checks from server/AGENTS.md:

uv run ruff check
uv run pytest tests/k8s/test_provider_common.py tests/k8s/test_agent_sandbox_provider.py tests/k8s/test_batchsandbox_provider.py

Result: 141 passed, ruff clean, uv run pyright on the changed file: 0 errors / 0 warnings.

Broader validation per server/AGENTS.md: uv run pytest (full server suite) — 787 passed.

Added tests:

tests/k8s/test_provider_common.py (new) — 7 unit tests covering positive int translation, raw-gpu-key strip regression, cpu/memory passthrough, no-gpu no-mutation guard, "all" rejection, parametrized invalid-value drop ("0", "-1", "bad", ""), and the empty-dict path.
tests/k8s/test_agent_sandbox_provider.py — test_create_workload_translates_gpu_to_nvidia_extended_resource, test_create_workload_without_gpu_omits_nvidia_extended_resource, test_create_workload_rejects_gpu_all_sentinel (full provider path through create_custom_object mock).
tests/k8s/test_batchsandbox_provider.py — same three tests on the BatchSandbox path.

Integration / e2e note: tests/k8s/ are mock-based provider tests (no live cluster), matching the convention used in #775. End-to-end verification of the actual nvidia.com/gpu extended-resource scheduling requires a Kubernetes cluster with the NVIDIA device plugin and a GPU-capable node, which I do not have available locally. Happy to run a real-cluster verification if a maintainer can point me at a CI lane or fixture.

Breaking Changes

None
Yes (describe impact and migration path)

Backward compatible — resourceLimits.gpu was previously a silently-ignored key on the Kubernetes runtime. Clients that passed it now get it honored; clients that didn't pass it observe no change. Clients that previously passed "all" (which was already silently dropped on K8s) now receive a clear 400 instead of an unobservable failure.

Checklist

Linked Issue or clearly described motivation — [BUG] Kubernetes runtime silently drops resourceLimits.gpu #781
Added/updated docs (if needed) — schema example and SDK docstring already documented gpu; no doc changes needed.
Added/updated tests (if needed) — helper unit tests + positive integration + regression guard on both K8s providers (mirrors fix(server): pass GPU resource limits to Docker containers #775's testing bar).
Security impact considered — GPU passthrough on K8s is opt-in (only triggered when a caller requests it), respects all existing security gates (security context, RBAC, namespace policies), and does not weaken the default posture. The translator strips unknown keys, so there is no path for arbitrary extended-resource injection through resourceLimits.
Backward compatibility considered

The sandbox schema advertises `resourceLimits.gpu` and the Docker runtime honors it (alibaba#775), but the Kubernetes runtime silently dropped the value: `_build_main_container` in `services/k8s/provider_common.py` passed the raw `resource_limits` dict straight into `V1ResourceRequirements.limits` and `.requests`, so a `gpu` key flowed through unchanged. Kubernetes treats `gpu` as an unknown extended resource — the device plugin never sees the request, and pods schedule with no GPU even when clients ask for one. Both `AgentSandboxProvider` and `BatchSandboxProvider` go through this single chokepoint, so both runtime modes were affected. Add a small `_translate_resource_limits_for_k8s` helper that reuses the existing `parse_gpu_request` from `services/helpers.py`, translates the portable `gpu` key to the canonical NVIDIA extended-resource name (`nvidia.com/gpu`), and pops the raw `gpu` key so it cannot leak onto the pod as an unknown resource. Hardcoding `nvidia.com/gpu` mirrors the NVIDIA-only scope of alibaba#775 (which uses Docker `DeviceRequest` `capabilities=[["gpu"]]`); other vendor keys (`amd.com/gpu`, `gpu.intel.com/i915`) can be added as a follow-up. Reject the `"all"` sentinel with HTTP 400 rather than silently dropping it. Docker accepts `"all"` (mapped to an unbounded `DeviceRequest`), but Kubernetes extended resources require an integer count, and silent fallback would mask the misconfiguration. This mirrors PR alibaba#775's "failures remain visible rather than silent" principle. The Kubernetes Windows profile path already strips the entire resources block in `apply_windows_profile_overrides` (`pop("resources", None)`), so GPU requests on Windows are still suppressed on the outer pod — no separate change is required there. Closes alibaba#781.

Pangjiping

Approved.

Clean fix — the single-chokepoint approach in _build_main_container correctly covers both providers. Good error signaling (rejecting "all" with 400 instead of silently falling back), proper reuse of parse_gpu_request, and thorough test coverage across unit and provider levels.

bytkim requested review from Generalwin, Pangjiping, hittyt, jwx0925 and ninan-nn as code owners April 25, 2026 00:54

Pangjiping added bug Something isn't working component/server labels Apr 25, 2026

Pangjiping self-assigned this Apr 25, 2026

Pangjiping approved these changes Apr 25, 2026

View reviewed changes

Pangjiping merged commit c315d6a into alibaba:main Apr 25, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): pass GPU resource limits to Kubernetes pods#782

fix(server): pass GPU resource limits to Kubernetes pods#782
Pangjiping merged 1 commit intoalibaba:mainfrom
bytkim:fix/server/k8s-gpu-passthrough

bytkim commented Apr 25, 2026

Uh oh!

Pangjiping left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bytkim commented Apr 25, 2026

Summary

Testing

Breaking Changes

Checklist

Uh oh!

Pangjiping left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants