feat(api): expose runtimeClassName on InferenceServiceSpec (closes #375)#380
Merged
feat(api): expose runtimeClassName on InferenceServiceSpec (closes #375)#380
Conversation
Adds a RuntimeClassName field on InferenceServiceSpec that the deployment builder forwards directly to PodSpec.RuntimeClassName. Most commonly set to "nvidia" on clusters where the NVIDIA Container Runtime is not the cluster default — without it, GPU pods schedule onto the GPU node but never get the device files bind-mounted, and the container fails at runtime with "no CUDA-capable device is detected". Originally surfaced as a Discord question that we filed as #375 yesterday. The fix is local: one new optional field on the CRD, one new line in the PodSpec construction, plus a unit test that asserts both the set and unset paths. Helm chart CRD synced via make chart-crds. RBAC unchanged. The field is optional and nil-safe; existing clusters see no behavior change. Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #375.
Summary
Adds
RuntimeClassName *stringtoInferenceServiceSpec. The deployment builder forwards it toPodSpec.RuntimeClassName. Most commonly set to"nvidia"on clusters where the NVIDIA Container Runtime is not configured as the cluster default.Originally surfaced via a Discord question yesterday: a community member's GPU pods were scheduling onto the GPU node but never getting the device files bind-mounted. They needed to set
runtimeClassName: nvidiaon the Pod, and there was no way to plumb that throughInferenceServiceSpec. The available workarounds were either reconfiguring containerd globally or running a Kyverno mutating webhook against the inference label selector. Both clunkier than just having the field on the CRD.Changes
api/v1alpha1/inferenceservice_types.goRuntimeClassName *string \json:"runtimeClassName,omitempty"`` field with usage docsapi/v1alpha1/zz_generated.deepcopy.gomake generateconfig/crd/bases/...make manifestscharts/llmkube/templates/crds/inferenceservices.yamlmake chart-crdsinternal/controller/deployment_builder.goRuntimeClassName: isvc.Spec.RuntimeClassNamein PodSpec constructioninternal/controller/inferenceservice_deployment_test.go*PodSpec.RuntimeClassName == "nvidia", unset path assertsPodSpec.RuntimeClassName == nilNet: +125 lines, no deletions. Field is optional and nil-safe; existing clusters see no behavior change.
Why now
Tagged for the 0.7.6 release window (the same window that closed #374 via PR #376). It's a small, isolated, user-visible feature that closes a real Discord-reported gap. Doesn't overlap with any other open PR (#340 lives in
runtime_*.go; this lives indeployment_builder.go).Out of scope (deferred to follow-ups if needed)
runtimeClassexample in the README — happy to add as a one-paragraph follow-up if the team wants itTest plan
make manifests generate fmt vetcleanmake chart-crdssynced (no drift)go test ./internal/controller/... ./pkg/agent/...passes (new tests included)golangci-lint v2.4.0reports 0 issuesruntimeClassName: nvidia, confirmkubectl get pod -o yamlshows the expectedruntimeClassNamevalueRelated