Skip to content

fix(server): pass GPU resource limits to Kubernetes pods#782

Merged
Pangjiping merged 1 commit intoalibaba:mainfrom
bytkim:fix/server/k8s-gpu-passthrough
Apr 25, 2026
Merged

fix(server): pass GPU resource limits to Kubernetes pods#782
Pangjiping merged 1 commit intoalibaba:mainfrom
bytkim:fix/server/k8s-gpu-passthrough

Conversation

@bytkim
Copy link
Copy Markdown
Contributor

@bytkim bytkim commented Apr 25, 2026

Summary

  • CreateSandboxRequest.resourceLimits.gpu is documented in the schema and honored by the Docker runtime (fix(server): pass GPU resource limits to Docker containers #775), but the Kubernetes runtime silently dropped it. _build_main_container in services/k8s/provider_common.py passed the raw resource_limits dict straight into V1ResourceRequirements.limits and .requests, so a gpu key flowed through unchanged. Kubernetes treats gpu as an unknown extended resource — the NVIDIA device plugin never sees the request, and pods schedule with no GPU. Both AgentSandboxProvider and BatchSandboxProvider route through this single chokepoint, so both K8s providers were affected.
  • Adds _translate_resource_limits_for_k8s in provider_common.py that reuses the existing parse_gpu_request from services/helpers.py, translates the portable gpu key to the canonical nvidia.com/gpu extended-resource name, and pops the raw gpu key so it cannot leak onto the pod as an unknown resource. Hardcoding nvidia.com/gpu mirrors the NVIDIA-only scope of fix(server): pass GPU resource limits to Docker containers #775 (Docker DeviceRequest capabilities=[["gpu"]]); other vendor keys (amd.com/gpu, gpu.intel.com/i915) can be a follow-up.
  • Rejects the "all" sentinel with HTTP 400 rather than silently dropping it. Docker accepts "all" (unbounded DeviceRequest), but Kubernetes extended resources require an integer count, and silent fallback would mask the misconfiguration. Mirrors fix(server): pass GPU resource limits to Docker containers #775's "failures remain visible rather than silent" principle.
  • Closes [BUG] Kubernetes runtime silently drops resourceLimits.gpu #781.

Scope note: The Kubernetes Windows profile path (apply_windows_profile_overrides) already strips the entire resources block via pop("resources", None), so GPU on Windows is still implicitly suppressed on the outer pod — no separate change is needed there.

Failure mode: On clusters without the NVIDIA device plugin, the pod will surface a clear scheduling failure (Pod stays in Pending with Insufficient nvidia.com/gpu), so GPU request failures remain visible rather than silent.

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests (see note below)
  • e2e / manual verification

Focused checks from server/AGENTS.md:

uv run ruff check
uv run pytest tests/k8s/test_provider_common.py tests/k8s/test_agent_sandbox_provider.py tests/k8s/test_batchsandbox_provider.py

Result: 141 passed, ruff clean, uv run pyright on the changed file: 0 errors / 0 warnings.

Broader validation per server/AGENTS.md: uv run pytest (full server suite) — 787 passed.

Added tests:

  • tests/k8s/test_provider_common.py (new) — 7 unit tests covering positive int translation, raw-gpu-key strip regression, cpu/memory passthrough, no-gpu no-mutation guard, "all" rejection, parametrized invalid-value drop ("0", "-1", "bad", ""), and the empty-dict path.
  • tests/k8s/test_agent_sandbox_provider.pytest_create_workload_translates_gpu_to_nvidia_extended_resource, test_create_workload_without_gpu_omits_nvidia_extended_resource, test_create_workload_rejects_gpu_all_sentinel (full provider path through create_custom_object mock).
  • tests/k8s/test_batchsandbox_provider.py — same three tests on the BatchSandbox path.

Integration / e2e note: tests/k8s/ are mock-based provider tests (no live cluster), matching the convention used in #775. End-to-end verification of the actual nvidia.com/gpu extended-resource scheduling requires a Kubernetes cluster with the NVIDIA device plugin and a GPU-capable node, which I do not have available locally. Happy to run a real-cluster verification if a maintainer can point me at a CI lane or fixture.

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Backward compatible — resourceLimits.gpu was previously a silently-ignored key on the Kubernetes runtime. Clients that passed it now get it honored; clients that didn't pass it observe no change. Clients that previously passed "all" (which was already silently dropped on K8s) now receive a clear 400 instead of an unobservable failure.

Checklist

  • Linked Issue or clearly described motivation — [BUG] Kubernetes runtime silently drops resourceLimits.gpu #781
  • Added/updated docs (if needed) — schema example and SDK docstring already documented gpu; no doc changes needed.
  • Added/updated tests (if needed) — helper unit tests + positive integration + regression guard on both K8s providers (mirrors fix(server): pass GPU resource limits to Docker containers #775's testing bar).
  • Security impact considered — GPU passthrough on K8s is opt-in (only triggered when a caller requests it), respects all existing security gates (security context, RBAC, namespace policies), and does not weaken the default posture. The translator strips unknown keys, so there is no path for arbitrary extended-resource injection through resourceLimits.
  • Backward compatibility considered

The sandbox schema advertises `resourceLimits.gpu` and the Docker runtime
honors it (alibaba#775), but the Kubernetes runtime silently dropped the value:
`_build_main_container` in `services/k8s/provider_common.py` passed the
raw `resource_limits` dict straight into `V1ResourceRequirements.limits`
and `.requests`, so a `gpu` key flowed through unchanged. Kubernetes
treats `gpu` as an unknown extended resource — the device plugin never
sees the request, and pods schedule with no GPU even when clients ask
for one. Both `AgentSandboxProvider` and `BatchSandboxProvider` go
through this single chokepoint, so both runtime modes were affected.

Add a small `_translate_resource_limits_for_k8s` helper that reuses the
existing `parse_gpu_request` from `services/helpers.py`, translates the
portable `gpu` key to the canonical NVIDIA extended-resource name
(`nvidia.com/gpu`), and pops the raw `gpu` key so it cannot leak onto
the pod as an unknown resource. Hardcoding `nvidia.com/gpu` mirrors the
NVIDIA-only scope of alibaba#775 (which uses Docker `DeviceRequest`
`capabilities=[["gpu"]]`); other vendor keys (`amd.com/gpu`,
`gpu.intel.com/i915`) can be added as a follow-up.

Reject the `"all"` sentinel with HTTP 400 rather than silently dropping
it. Docker accepts `"all"` (mapped to an unbounded `DeviceRequest`),
but Kubernetes extended resources require an integer count, and silent
fallback would mask the misconfiguration. This mirrors PR alibaba#775's
"failures remain visible rather than silent" principle.

The Kubernetes Windows profile path already strips the entire resources
block in `apply_windows_profile_overrides` (`pop("resources", None)`),
so GPU requests on Windows are still suppressed on the outer pod —
no separate change is required there.

Closes alibaba#781.
@Pangjiping Pangjiping added bug Something isn't working component/server labels Apr 25, 2026
@Pangjiping Pangjiping self-assigned this Apr 25, 2026
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Clean fix — the single-chokepoint approach in _build_main_container correctly covers both providers. Good error signaling (rejecting "all" with 400 instead of silently falling back), proper reuse of parse_gpu_request, and thorough test coverage across unit and provider levels.

@Pangjiping Pangjiping merged commit c315d6a into alibaba:main Apr 25, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Kubernetes runtime silently drops resourceLimits.gpu

2 participants