Skip to content

fix(controller): stop hot-spinning on unreachable file:// model sources (closes #405)#412

Merged
Defilan merged 2 commits intodefilantech:mainfrom
Defilan:fix/model-controller-hotspin
May 7, 2026
Merged

fix(controller): stop hot-spinning on unreachable file:// model sources (closes #405)#412
Defilan merged 2 commits intodefilantech:mainfrom
Defilan:fix/model-controller-hotspin

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 7, 2026

Summary

Closes #405. When a Model's spec.source is a file:// URL pointing at a path that exists on the metal-agent host but not inside the controller pod (the canonical Mac kind / k3s + out-of-cluster agent topology), the reconciler entered a tight loop: every iteration returned an error, controller-runtime's rate-limited workqueue requeued at ~5ms initial backoff ramping into hundreds of reconciles per second, and a Mac kind cluster pinned a CPU core at ~295% for 35 hours before anyone noticed.

Root cause: returning a non-nil error to controller-runtime invokes the rate limiter, which then dominates the schedule and eats the RequeueAfter interval that was supposed to space retries.

Fix

New isUnrecoverableFetchError helper in source.go detects fs.ErrNotExist and fs.ErrPermission (the canonical "this won't heal without operator intervention" errors). When the fetch fails with one of these, the reconciler:

  • still marks the Model Failed with a clear Degraded condition
  • still emits the failure metric
  • but returns ctrl.Result{RequeueAfter: 5*time.Minute}, nil (note the nil) so controller-runtime treats this as a "successful, recheck later" reconcile rather than a transient error

Steady-state cost: one reconcile every 5 minutes until the operator fixes the spec. Recoverable errors (network timeouts, transient apiserver issues) keep the original err+RequeueAfter shape so the rate limiter still provides backoff for genuinely transient problems.

Testing

Unit (6 new cases on isUnrecoverableFetchError):

  • nil error → false
  • fs.ErrNotExist directly → true
  • fs.ErrPermission directly → true
  • The exact fmt.Errorf("failed to open local model file: %w", os.Open-ENOENT) wrap shape from copyLocalModel → true (the Model controller hot-spins on file:// sources unreachable from controller pod #405 reproduction; load-bearing assertion)
  • Generic non-fs error → false
  • Wrapped non-fs error → false
  • Double-wrapped fs.ErrNotExist → true (errors.Is walks the wrap chain)

Reconcile-level (2 new + 2 updated):

  • New: "no rate-limited tight retry" test asserts err == nil, RequeueAfter == 5m, Requeue == false, and Failed + Degraded/CopyFailed condition for a missing file:// source. The nil err assertion is the load-bearing regression guard.
  • New: "repeated reconciles stay on fixed interval" loops three times against the same broken source and asserts each iteration returns the same shape (this catches future changes that take a different code path on subsequent reconciles).
  • Updated: should set Failed with CopyFailed for nonexistent local file flipped from asserting err.HaveOccurred() to asserting err == nil (the new behavior).
  • Updated: should not leave partial file at final path on fetch failure same flip; the atomic-rename guarantee is now checked against status conditions instead of err return.

Documentation

Added a "file:// caveat for hybrid topologies" paragraph to the Model.spec.source GoDoc explaining the controller-pod-readability constraint, what happens when the path is unreachable (Failed + 5min backoff, not hot-spin), and the two workarounds (pre-stage on pvc:// or use the equivalent https://huggingface.co/.../<file>.gguf URL). make manifests regenerated the Model CRD description.

Compatibility

No CRD field changes. Behavior change is limited to the error-return shape on unreachable sources: pre-fix, controller-runtime saw err != nil and tight-retried; post-fix, controller-runtime sees err == nil and respects the 5m RequeueAfter. Operators won't notice unless they were debugging the hot-spin (in which case CPU drops from 200-300% to flat).

Closes #405

…es (closes defilantech#405)

When a Model's spec.source is a file:// URL pointing at a path that exists on
the metal-agent host but not inside the controller pod (the canonical Mac
kind / k3s + out-of-cluster agent topology), the reconciler entered a tight
loop: every iteration returned an error, controller-runtime's rate-limited
workqueue requeued at ~5ms initial backoff ramping into hundreds of
reconciles per second, and a Mac kind cluster pinned a CPU core at ~295%
for 35 hours in the wild before anyone noticed.

The root cause was the err+RequeueAfter return shape: even with a 5-minute
RequeueAfter set, returning a non-nil error invokes the rate limiter, which
then dominates the schedule and eats the requeue interval.

This change adds an `isUnrecoverableFetchError` helper in source.go that
detects fs.ErrNotExist and fs.ErrPermission (the canonical "this won't heal
without operator intervention" errors). When the fetch fails with one of
these, the reconciler:

- still marks the Model Failed with a clear Degraded condition
- still emits the failure metric
- but returns `ctrl.Result{RequeueAfter: 5*time.Minute}, nil` (note the
  nil) so controller-runtime treats this as a "successful, recheck later"
  reconcile rather than a transient error. Steady-state cost: one
  reconcile every 5 minutes until the operator fixes the spec.

Recoverable errors (network timeouts, transient apiserver issues, etc.)
keep the original `err+RequeueAfter` shape so the rate limiter still
provides backoff for genuinely transient problems.

Tests
-----

- Six new unit cases for `isUnrecoverableFetchError` covering nil, the
  exact `fmt.Errorf` wrap shape that copyLocalModel produces (the defilantech#405
  reproduction), generic non-fs errors, and double-wrapped chains.
- New "no rate-limited tight retry" reconcile test that asserts err == nil,
  RequeueAfter == 5m, and Failed/Degraded condition for a missing file://
  source. The nil-err assertion is the load-bearing one: any future change
  that regresses the hot-spin guard fails this test.
- New "repeated reconciles stay on fixed interval" test that loops three
  times against the same broken source and asserts each iteration returns
  the same nil-err + 5m + no-immediate-requeue result.
- Updated two existing tests that previously asserted err.HaveOccurred() on
  this path; they now assert the post-fix nil-err shape and check the
  Failed condition instead.

Documentation
-------------

- Added a "file:// caveat for hybrid topologies" paragraph to the
  Model.spec.source GoDoc explaining the controller-pod-readability
  constraint, what happens when the path is unreachable (Failed +
  5min backoff, not hot-spin), and the two workarounds (pre-stage on
  pvc:// or use the equivalent https://huggingface.co/.../<file>.gguf URL).
- `make manifests` regenerated the Model CRD with the updated description.

Closes defilantech#405

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan added the bug Something isn't working label May 7, 2026
@Defilan Defilan self-assigned this May 7, 2026
@Defilan Defilan added component/controller Related to the operator controller priority/high High priority size/small Small effort (< 1 day) labels May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Two CI fixes for defilantech#412:

1. Lint: golangci-lint via staticcheck flagged `result.Requeue` as
   deprecated (SA1019) in newer controller-runtime. Removed the two
   `Expect(result.Requeue).To(BeFalse())` assertions; the
   `Expect(result.RequeueAfter).To(Equal(5*time.Minute))` assertion is
   sufficient — RequeueAfter being set to a positive duration means
   controller-runtime schedules the next reconcile at that time, which
   is what the test was actually trying to assert.

2. CRD Sync Check: the Model CRD doc update on spec.source landed in
   config/crd/bases/inference.llmkube.dev_models.yaml via `make
   manifests` but the chart-bundled copy at
   charts/llmkube/templates/crds/models.yaml was not regenerated. Ran
   `make chart-crds` to sync.

No behavior change; tests still green locally.

Refs defilantech#405

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 4ac6f57 into defilantech:main May 7, 2026
19 checks passed
@github-actions github-actions Bot mentioned this pull request May 7, 2026
Defilan added a commit to Defilan/LLMKube that referenced this pull request May 9, 2026
…refs P0.8)

Adds docs/operations/runbooks/ with:

- README.md: runbook index, conventions (Trigger / Diagnose / Mitigate /
  Resolve / Verify / Related), tabletop cadence, authoring guide, and
  a planned-runbooks list of the failure modes still to cover.

- controller-hot-spin-on-file-source.md: triage for the defilantech#405 / PR defilantech#412
  scenario where a Model's spec.source is a file:// path unreachable
  from the controller pod. Covers diagnose, immediate mitigations, and
  the structural fix shipped in defilantech#412.

- metal-agent-memory-pressure.md: triage for the watchdog-driven
  Warning / Critical / Evicted condition path on Apple Silicon hosts.
  Maps to the K8s events shipped in defilantech#411 (MemoryPressureLevelChanged,
  Evicted, EvictionSkipped, RespawnBlocked) and the priority/protection
  mechanics from the underlying watchdog (PRs defilantech#382 + defilantech#386 in 0.7.6).

Each runbook follows a consistent six-section shape so on-call engineers
can find and follow them under pressure. Quarterly tabletop cadence
documented in the README to keep them honest.

Refs P0.8 (operational runbooks for Tier-1 oncall material)

Signed-off-by: Christopher Maher <chris@mahercode.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/controller Related to the operator controller priority/high High priority size/small Small effort (< 1 day)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model controller hot-spins on file:// sources unreachable from controller pod

1 participant