fix(hermeneus): graceful degradation when CC provider becomes unavailable (#3158)#3183
Merged
forkwright merged 1 commit intomainfrom Apr 15, 2026
Merged
Conversation
…able (#3158) When the Claude Code subprocess provider becomes unavailable mid-turn (process crash, binary disappeared, auth expiry), the server now degrades gracefully instead of returning opaque 500 errors. Three root causes fixed: 1. ProviderInit errors (CC binary spawn failures) were ignored by the health state machine — they fell through to the wildcard arm in record_error(). Now they follow the same Degraded/Down transition path as ApiRequest errors, so the circuit breaker activates after repeated failures. 2. ProviderInit and CC-specific ApiRequest errors were not classified as retryable, so the execute stage's degraded-mode fallback (distillation cache or honest "unavailable" message) never activated. Now is_retryable() returns true for ProviderInit and CC subprocess error messages. 3. ProviderInit, PipelineStage("unavailable"), and ServiceDegraded errors mapped to 500 Internal Server Error via pylon's catch-all. Now they map to 503 Service Unavailable, giving clients a clear signal to retry. Gate-Passed: kanon 0.1.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This was referenced Apr 15, 2026
forkwright
pushed a commit
that referenced
this pull request
Apr 15, 2026
🤖 I have created a release *beep* *boop* --- ## [0.18.0](v0.17.0...v0.18.0) (2026-04-15) ### Features * **aletheia:** add benchmark CLI for LongMemEval and LoCoMo ([#3195](#3195)) ([3374a10](3374a10)) * **eidos:** add schema_version to TrainingRecord ([#3186](#3186)) ([dc3e0f2](dc3e0f2)) * **episteme,eidos,melete:** parameterize knowledge constants via taxis config ([#2306](#2306)) ([#3132](#3132)) ([22564b4](22564b4)) * **nous,taxis:** self-tuning feedback loop ([#2306](#2306) wave 6) ([#3137](#3137)) ([c1e8b70](c1e8b70)) * **organon,agora,dianoia:** parameterize tool and planning constants via taxis config ([#2306](#2306)) ([#3133](#3133)) ([96d71b3](96d71b3)) * **organon,agora:** wire tool and channel constants to taxis config reads ([#2306](#2306)) ([#3136](#3136)) ([5562860](5562860)) * **pylon,hermeneus,daemon:** parameterize infra constants via taxis config ([#2306](#2306)) ([#3130](#3130)) ([81727b7](81727b7)) * **taxis,organon,aletheia:** parameter registry + agent tool + CLI describe ([#2306](#2306)) ([#3135](#3135)) ([9b52daf](9b52daf)) * **training:** enrich records with episteme labels and add shard rotation ([#3193](#3193)) ([86563bc](86563bc)) ### Bug Fixes * **aletheia:** detect fjall lock contention in CLI memory commands ([#3181](#3181)) ([4741469](4741469)) * archived sessions, review-skills lock, lock handling, inclusive language ([#3171](#3171)) ([6bb51a9](6bb51a9)) * audit expect() calls — add missing annotations, fix misleading messages ([#3231](#3231)) ([#3310](#3310)) ([82d5e4c](82d5e4c)) * deployment upgrade path, CC provider routing, clippy zero-warnings ([#3154](#3154)) ([0d5d304](0d5d304)) * fsync temp scripts to prevent ETXTBSY, set GTK dark theme for CSD ([#3156](#3156)) ([e2f956d](e2f956d)), closes [#3146](#3146) * **hermeneus:** graceful degradation when CC provider becomes unavailable ([#3158](#3158)) ([#3183](#3183)) ([86a0ca5](86a0ca5)) * **krites:** replace 137 unreachable!() with proper error returns ([#3172](#3172)) ([a1a3347](a1a3347)), closes [#3169](#3169) * **mneme:** include mneme-engine in default features ([#3187](#3187)) ([453def5](453def5)) * **mneme:** tighten training capture quality gate ([#3178](#3178)) ([#3185](#3185)) ([13733c6](13733c6)) * pricing, CC parser, export validation, credential refresh, session field naming ([#3168](#3168)) ([ce1a488](ce1a488)) * **proskenion:** embed CSS via include_str for reliable theme loading ([#3155](#3155)) ([ca0885e](ca0885e)), closes [#3145](#3145) * **pylon:** return 404 for archived sessions on GET ([#3196](#3196)) ([#3204](#3204)) ([20904ba](20904ba)) * **pylon:** surface root cause in SSE turn_failed errors ([#3182](#3182)) ([76c38b7](76c38b7)) * resolve all high-severity security lint findings from kanon QA ([#3170](#3170)) ([0a87d23](0a87d23)), closes [#3169](#3169) * **security:** address CodeQL cleartext and hardcoded crypto alerts ([#3201](#3201)) ([6a79b89](6a79b89)) * **security:** redact sensitive data in log output (CodeQL cleartext-logging) ([#3200](#3200)) ([42bd197](42bd197)) * **security:** validate paths before filesystem operations ([#3203](#3203)) ([b34264e](b34264e)) * surface silent failures in hermeneus, agora, nous, and pylon ([#3311](#3311)) ([49cfc4b](49cfc4b)) * systemd readiness probe and OpenAPI version tracking ([#3302](#3302)) ([f946568](f946568)) * token refresh error handling, restart backoff reset, atomic deploy ([#3262](#3262)) ([42ff625](42ff625)) * zombie actor cleanup and stale architecture docs ([#3248](#3248), [#3244](#3244)) ([#3299](#3299)) ([57489ae](57489ae)) ### Documentation * wave 7 constants completion audit ([#2306](#2306)) ([#3134](#3134)) ([1f8f93e](1f8f93e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
ProviderIniterrors (CC binary spawn failures) now transition the health state machine through Degraded -> Down, activating the circuit breaker after repeated failuresProviderInitand CC-specificApiRequesterrors are now classified as retryable, enabling the execute stage's degraded-mode fallback (distillation cache or honest "unavailable" message)ProviderInit,PipelineStage("unavailable"), andServiceDegradederrors now return 503 Service Unavailable instead of 500 Internal Server ErrorCloses #3158
Test plan
cargo check --workspacepassescargo test -p hermeneus -p nous -p pylonpasses (10 new tests)cargo clippy --workspacezero new warningsProviderInitis retryable, transitions health to Degraded/Down, maps to 503 through all error paths (direct hermeneus, nous::Llm wrapper, PipelineStage, ServiceDegraded)🤖 Generated with Claude Code