Skip to content

fix(controller): recreate InferenceService Deployment on immutable selector change (#606)#607

Merged
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:fix/inferenceservice-selector-immutable-recreate
Jun 2, 2026
Merged

fix(controller): recreate InferenceService Deployment on immutable selector change (#606)#607
Defilan merged 1 commit into
defilantech:mainfrom
Defilan:fix/inferenceservice-selector-immutable-recreate

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Jun 2, 2026

Important

Upgrade impact (not an API break, but plan for it). On the first reconcile
after upgrading, each InferenceService Deployment created by a pre-0.8.0
operator is deleted and recreated to migrate its immutable selector. That
is a one-time pod restart + model reload for affected services (minutes of
unavailability for large models, once). 0.8.x-created Deployments are
unaffected.

What

Migrates InferenceService Deployments whose pod selector predates the current operator's selector label set, by recreating them instead of hot-looping on an immutable in-place update.

Why

Fixes #606

A Deployment created by an older operator carries a smaller selector than 0.8.x generates (pre-0.8: {app: <name>}; now also inference.llmkube.dev/service: <name>). Deployment.spec.selector is immutable, so reconcileDeployment's wholesale Spec replace + Update failed forever with field is immutable, and the controller hot-looped, unable to reconcile any pre-0.8 InferenceService Deployment. Observed live after upgrading a cluster to 0.8.1 (~14 errors/min across every pre-0.8 service).

How

In reconcileDeployment, after fetching the existing Deployment, compare its selector against the desired one. On mismatch (immutable), delete and recreate the Deployment with the correct selector; if the old one is still terminating, requeue. One-time, logged recreate; the brief pod churn is unavoidable for an immutable-selector change. Service selectors are mutable and unaffected.

Testing

New envtest spec (inferenceservice_selector_migration_test.go) seeds a Deployment with the pre-0.8 {app}-only selector and calls reconcileDeployment with the model ready, asserting the selector is migrated to the current label set with no error. Because envtest runs a real apiserver, the immutable-selector rejection is genuine, so the test is red before the fix and green after.

Checklist

  • Tests added/updated (envtest regression, faithful red-before-fix)
  • make test passes locally (full controller suite green)
  • make lint passes locally (native + GOOS=linux)
  • Commit signed off (DCO)
  • No CRD/RBAC/API changes (controller logic only)

…lector change (defilantech#606)

A Deployment created by an older operator can carry a smaller pod selector
than the current operator generates (pre-0.8 used {app: <name>}; 0.8.x also
adds inference.llmkube.dev/service). Deployment.spec.selector is immutable, so
the in-place Update in reconcileDeployment failed permanently with
"field is immutable" and the controller hot-looped, unable to reconcile any
pre-0.8 InferenceService Deployment (observed live on a cluster upgraded to
0.8.1: ~14 errors/min across every pre-0.8 service).

Detect the selector mismatch before the update and migrate by deleting and
recreating the Deployment with the correct selector, requeuing if the old
Deployment is still terminating. One-time, logged recreate.

Test: an envtest spec seeds a Deployment with the pre-0.8 {app}-only selector
and asserts reconcileDeployment migrates it to the current selector without
error. The real apiserver rejects the immutable in-place update, so the test
reproduces the production failure faithfully (red before the fix).

Fixes defilantech#606

Signed-off-by: Christopher Maher <chris@mahercode.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 42.85714% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/controller/inferenceservice_controller.go 42.85% 6 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@Defilan Defilan force-pushed the fix/inferenceservice-selector-immutable-recreate branch from 4dedd17 to 63aeb86 Compare June 2, 2026 06:43
@Defilan Defilan merged commit faf9151 into defilantech:main Jun 2, 2026
45 checks passed
@github-actions github-actions Bot mentioned this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] InferenceService controller hot-loops "spec.selector is immutable" instead of recreating Deployments from an older operator

1 participant