fix(controller): stop hot-spinning on unreachable file:// model sources (closes #405)#412
Merged
Defilan merged 2 commits intodefilantech:mainfrom May 7, 2026
Merged
Conversation
…es (closes defilantech#405) When a Model's spec.source is a file:// URL pointing at a path that exists on the metal-agent host but not inside the controller pod (the canonical Mac kind / k3s + out-of-cluster agent topology), the reconciler entered a tight loop: every iteration returned an error, controller-runtime's rate-limited workqueue requeued at ~5ms initial backoff ramping into hundreds of reconciles per second, and a Mac kind cluster pinned a CPU core at ~295% for 35 hours in the wild before anyone noticed. The root cause was the err+RequeueAfter return shape: even with a 5-minute RequeueAfter set, returning a non-nil error invokes the rate limiter, which then dominates the schedule and eats the requeue interval. This change adds an `isUnrecoverableFetchError` helper in source.go that detects fs.ErrNotExist and fs.ErrPermission (the canonical "this won't heal without operator intervention" errors). When the fetch fails with one of these, the reconciler: - still marks the Model Failed with a clear Degraded condition - still emits the failure metric - but returns `ctrl.Result{RequeueAfter: 5*time.Minute}, nil` (note the nil) so controller-runtime treats this as a "successful, recheck later" reconcile rather than a transient error. Steady-state cost: one reconcile every 5 minutes until the operator fixes the spec. Recoverable errors (network timeouts, transient apiserver issues, etc.) keep the original `err+RequeueAfter` shape so the rate limiter still provides backoff for genuinely transient problems. Tests ----- - Six new unit cases for `isUnrecoverableFetchError` covering nil, the exact `fmt.Errorf` wrap shape that copyLocalModel produces (the defilantech#405 reproduction), generic non-fs errors, and double-wrapped chains. - New "no rate-limited tight retry" reconcile test that asserts err == nil, RequeueAfter == 5m, and Failed/Degraded condition for a missing file:// source. The nil-err assertion is the load-bearing one: any future change that regresses the hot-spin guard fails this test. - New "repeated reconciles stay on fixed interval" test that loops three times against the same broken source and asserts each iteration returns the same nil-err + 5m + no-immediate-requeue result. - Updated two existing tests that previously asserted err.HaveOccurred() on this path; they now assert the post-fix nil-err shape and check the Failed condition instead. Documentation ------------- - Added a "file:// caveat for hybrid topologies" paragraph to the Model.spec.source GoDoc explaining the controller-pod-readability constraint, what happens when the path is unreachable (Failed + 5min backoff, not hot-spin), and the two workarounds (pre-stage on pvc:// or use the equivalent https://huggingface.co/.../<file>.gguf URL). - `make manifests` regenerated the Model CRD with the updated description. Closes defilantech#405 Signed-off-by: Christopher Maher <chris@mahercode.io>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Two CI fixes for defilantech#412: 1. Lint: golangci-lint via staticcheck flagged `result.Requeue` as deprecated (SA1019) in newer controller-runtime. Removed the two `Expect(result.Requeue).To(BeFalse())` assertions; the `Expect(result.RequeueAfter).To(Equal(5*time.Minute))` assertion is sufficient — RequeueAfter being set to a positive duration means controller-runtime schedules the next reconcile at that time, which is what the test was actually trying to assert. 2. CRD Sync Check: the Model CRD doc update on spec.source landed in config/crd/bases/inference.llmkube.dev_models.yaml via `make manifests` but the chart-bundled copy at charts/llmkube/templates/crds/models.yaml was not regenerated. Ran `make chart-crds` to sync. No behavior change; tests still green locally. Refs defilantech#405 Signed-off-by: Christopher Maher <chris@mahercode.io>
Merged
Defilan
added a commit
to Defilan/LLMKube
that referenced
this pull request
May 9, 2026
…refs P0.8) Adds docs/operations/runbooks/ with: - README.md: runbook index, conventions (Trigger / Diagnose / Mitigate / Resolve / Verify / Related), tabletop cadence, authoring guide, and a planned-runbooks list of the failure modes still to cover. - controller-hot-spin-on-file-source.md: triage for the defilantech#405 / PR defilantech#412 scenario where a Model's spec.source is a file:// path unreachable from the controller pod. Covers diagnose, immediate mitigations, and the structural fix shipped in defilantech#412. - metal-agent-memory-pressure.md: triage for the watchdog-driven Warning / Critical / Evicted condition path on Apple Silicon hosts. Maps to the K8s events shipped in defilantech#411 (MemoryPressureLevelChanged, Evicted, EvictionSkipped, RespawnBlocked) and the priority/protection mechanics from the underlying watchdog (PRs defilantech#382 + defilantech#386 in 0.7.6). Each runbook follows a consistent six-section shape so on-call engineers can find and follow them under pressure. Quarterly tabletop cadence documented in the README to keep them honest. Refs P0.8 (operational runbooks for Tier-1 oncall material) Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #405. When a
Model'sspec.sourceis afile://URL pointing at a path that exists on the metal-agent host but not inside the controller pod (the canonical Mac kind / k3s + out-of-cluster agent topology), the reconciler entered a tight loop: every iteration returned an error, controller-runtime's rate-limited workqueue requeued at ~5ms initial backoff ramping into hundreds of reconciles per second, and a Mac kind cluster pinned a CPU core at ~295% for 35 hours before anyone noticed.Root cause: returning a non-nil error to controller-runtime invokes the rate limiter, which then dominates the schedule and eats the
RequeueAfterinterval that was supposed to space retries.Fix
New
isUnrecoverableFetchErrorhelper insource.godetectsfs.ErrNotExistandfs.ErrPermission(the canonical "this won't heal without operator intervention" errors). When the fetch fails with one of these, the reconciler:Failedwith a clearDegradedconditionctrl.Result{RequeueAfter: 5*time.Minute}, nil(note the nil) so controller-runtime treats this as a "successful, recheck later" reconcile rather than a transient errorSteady-state cost: one reconcile every 5 minutes until the operator fixes the spec. Recoverable errors (network timeouts, transient apiserver issues) keep the original
err+RequeueAftershape so the rate limiter still provides backoff for genuinely transient problems.Testing
Unit (6 new cases on
isUnrecoverableFetchError):fs.ErrNotExistdirectly → truefs.ErrPermissiondirectly → truefmt.Errorf("failed to open local model file: %w", os.Open-ENOENT)wrap shape fromcopyLocalModel→ true (the Model controller hot-spins on file:// sources unreachable from controller pod #405 reproduction; load-bearing assertion)fs.ErrNotExist→ true (errors.Is walks the wrap chain)Reconcile-level (2 new + 2 updated):
err == nil,RequeueAfter == 5m,Requeue == false, andFailed+Degraded/CopyFailedcondition for a missingfile://source. Thenilerr assertion is the load-bearing regression guard.should set Failed with CopyFailed for nonexistent local fileflipped from asserting err.HaveOccurred() to assertingerr == nil(the new behavior).should not leave partial file at final path on fetch failuresame flip; the atomic-rename guarantee is now checked against status conditions instead of err return.Documentation
Added a "file:// caveat for hybrid topologies" paragraph to the
Model.spec.sourceGoDoc explaining the controller-pod-readability constraint, what happens when the path is unreachable (Failed + 5min backoff, not hot-spin), and the two workarounds (pre-stage onpvc://or use the equivalenthttps://huggingface.co/.../<file>.ggufURL).make manifestsregenerated the Model CRD description.Compatibility
No CRD field changes. Behavior change is limited to the error-return shape on unreachable sources: pre-fix, controller-runtime saw
err != niland tight-retried; post-fix, controller-runtime seeserr == niland respects the 5mRequeueAfter. Operators won't notice unless they were debugging the hot-spin (in which case CPU drops from 200-300% to flat).Closes #405