Skip to content

fix(hermeneus): graceful degradation when CC provider becomes unavailable (#3158)#3183

Merged
forkwright merged 1 commit intomainfrom
fix/cc-provider-graceful-degradation
Apr 15, 2026
Merged

fix(hermeneus): graceful degradation when CC provider becomes unavailable (#3158)#3183
forkwright merged 1 commit intomainfrom
fix/cc-provider-graceful-degradation

Conversation

@forkwright
Copy link
Copy Markdown
Owner

Summary

  • Health tracking: ProviderInit errors (CC binary spawn failures) now transition the health state machine through Degraded -> Down, activating the circuit breaker after repeated failures
  • Retryable classification: ProviderInit and CC-specific ApiRequest errors are now classified as retryable, enabling the execute stage's degraded-mode fallback (distillation cache or honest "unavailable" message)
  • HTTP status mapping: ProviderInit, PipelineStage("unavailable"), and ServiceDegraded errors now return 503 Service Unavailable instead of 500 Internal Server Error

Closes #3158

Test plan

  • cargo check --workspace passes
  • cargo test -p hermeneus -p nous -p pylon passes (10 new tests)
  • cargo clippy --workspace zero new warnings
  • New tests verify: ProviderInit is retryable, transitions health to Degraded/Down, maps to 503 through all error paths (direct hermeneus, nous::Llm wrapper, PipelineStage, ServiceDegraded)

🤖 Generated with Claude Code

…able (#3158)

When the Claude Code subprocess provider becomes unavailable mid-turn
(process crash, binary disappeared, auth expiry), the server now
degrades gracefully instead of returning opaque 500 errors.

Three root causes fixed:

1. ProviderInit errors (CC binary spawn failures) were ignored by the
   health state machine — they fell through to the wildcard arm in
   record_error(). Now they follow the same Degraded/Down transition
   path as ApiRequest errors, so the circuit breaker activates after
   repeated failures.

2. ProviderInit and CC-specific ApiRequest errors were not classified
   as retryable, so the execute stage's degraded-mode fallback
   (distillation cache or honest "unavailable" message) never
   activated. Now is_retryable() returns true for ProviderInit and
   CC subprocess error messages.

3. ProviderInit, PipelineStage("unavailable"), and ServiceDegraded
   errors mapped to 500 Internal Server Error via pylon's catch-all.
   Now they map to 503 Service Unavailable, giving clients a clear
   signal to retry.

Gate-Passed: kanon 0.1.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@forkwright forkwright merged commit 86a0ca5 into main Apr 15, 2026
3 checks passed
@forkwright forkwright deleted the fix/cc-provider-graceful-degradation branch April 15, 2026 03:29
forkwright pushed a commit that referenced this pull request Apr 15, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.18.0](v0.17.0...v0.18.0)
(2026-04-15)


### Features

* **aletheia:** add benchmark CLI for LongMemEval and LoCoMo
([#3195](#3195))
([3374a10](3374a10))
* **eidos:** add schema_version to TrainingRecord
([#3186](#3186))
([dc3e0f2](dc3e0f2))
* **episteme,eidos,melete:** parameterize knowledge constants via taxis
config ([#2306](#2306))
([#3132](#3132))
([22564b4](22564b4))
* **nous,taxis:** self-tuning feedback loop
([#2306](#2306) wave 6)
([#3137](#3137))
([c1e8b70](c1e8b70))
* **organon,agora,dianoia:** parameterize tool and planning constants
via taxis config
([#2306](#2306))
([#3133](#3133))
([96d71b3](96d71b3))
* **organon,agora:** wire tool and channel constants to taxis config
reads ([#2306](#2306))
([#3136](#3136))
([5562860](5562860))
* **pylon,hermeneus,daemon:** parameterize infra constants via taxis
config ([#2306](#2306))
([#3130](#3130))
([81727b7](81727b7))
* **taxis,organon,aletheia:** parameter registry + agent tool + CLI
describe ([#2306](#2306))
([#3135](#3135))
([9b52daf](9b52daf))
* **training:** enrich records with episteme labels and add shard
rotation ([#3193](#3193))
([86563bc](86563bc))


### Bug Fixes

* **aletheia:** detect fjall lock contention in CLI memory commands
([#3181](#3181))
([4741469](4741469))
* archived sessions, review-skills lock, lock handling, inclusive
language ([#3171](#3171))
([6bb51a9](6bb51a9))
* audit expect() calls — add missing annotations, fix misleading
messages ([#3231](#3231))
([#3310](#3310))
([82d5e4c](82d5e4c))
* deployment upgrade path, CC provider routing, clippy zero-warnings
([#3154](#3154))
([0d5d304](0d5d304))
* fsync temp scripts to prevent ETXTBSY, set GTK dark theme for CSD
([#3156](#3156))
([e2f956d](e2f956d)),
closes [#3146](#3146)
* **hermeneus:** graceful degradation when CC provider becomes
unavailable
([#3158](#3158))
([#3183](#3183))
([86a0ca5](86a0ca5))
* **krites:** replace 137 unreachable!() with proper error returns
([#3172](#3172))
([a1a3347](a1a3347)),
closes [#3169](#3169)
* **mneme:** include mneme-engine in default features
([#3187](#3187))
([453def5](453def5))
* **mneme:** tighten training capture quality gate
([#3178](#3178))
([#3185](#3185))
([13733c6](13733c6))
* pricing, CC parser, export validation, credential refresh, session
field naming
([#3168](#3168))
([ce1a488](ce1a488))
* **proskenion:** embed CSS via include_str for reliable theme loading
([#3155](#3155))
([ca0885e](ca0885e)),
closes [#3145](#3145)
* **pylon:** return 404 for archived sessions on GET
([#3196](#3196))
([#3204](#3204))
([20904ba](20904ba))
* **pylon:** surface root cause in SSE turn_failed errors
([#3182](#3182))
([76c38b7](76c38b7))
* resolve all high-severity security lint findings from kanon QA
([#3170](#3170))
([0a87d23](0a87d23)),
closes [#3169](#3169)
* **security:** address CodeQL cleartext and hardcoded crypto alerts
([#3201](#3201))
([6a79b89](6a79b89))
* **security:** redact sensitive data in log output (CodeQL
cleartext-logging)
([#3200](#3200))
([42bd197](42bd197))
* **security:** validate paths before filesystem operations
([#3203](#3203))
([b34264e](b34264e))
* surface silent failures in hermeneus, agora, nous, and pylon
([#3311](#3311))
([49cfc4b](49cfc4b))
* systemd readiness probe and OpenAPI version tracking
([#3302](#3302))
([f946568](f946568))
* token refresh error handling, restart backoff reset, atomic deploy
([#3262](#3262))
([42ff625](42ff625))
* zombie actor cleanup and stale architecture docs
([#3248](#3248),
[#3244](#3244))
([#3299](#3299))
([57489ae](57489ae))


### Documentation

* wave 7 constants completion audit
([#2306](#2306))
([#3134](#3134))
([1f8f93e](1f8f93e))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server process exits when CC provider becomes unavailable during a turn

1 participant