Skip to content

feat: provider fallback + per-project config (0.7.0)#20

Merged
blackaxgit merged 12 commits into
mainfrom
feat/provider-fallback
May 2, 2026
Merged

feat: provider fallback + per-project config (0.7.0)#20
blackaxgit merged 12 commits into
mainfrom
feat/provider-fallback

Conversation

@blackaxgit
Copy link
Copy Markdown
Owner

Summary

CLX 0.7.0 — automatic primary→secondary LLM provider fallback with per-capability cooldown, plus per-project config overrides via figment-layered loading.

Reverses the 0.6.x design's explicit "no automatic fallback" decision based on user feedback that default_decision: allow-style silent degradation under Azure outage is unsafe for a risk-scoring CLI.

What's new

  • Per-capability fallbackllm.chat.fallback and llm.embeddings.fallback config blocks. Fall back on transient errors (Connection, Timeout, RateLimit, 5xx, 408); fail-fast on terminal errors (Auth, DeploymentNotFound, ContentFilter). Fallback's model overrides the caller's model name (providers don't share identifiers).
  • 30-second cooldown after a fallback event — primary is skipped so sustained outages don't pay the latency penalty of always retrying.
  • Per-project config at <repo>/.clx/config.yaml, walk-up discovery from CWD to $HOME. Env-var escape: CLX_CONFIG_PROJECT=/path or CLX_CONFIG_PROJECT=none.
  • Layered config loading via figment (defaults → global → project → env vars).
  • Inert-keys allowlist — project configs may override routing/threshold keys; security-sensitive keys (providers.*, logging.file, validator.enabled) are silently dropped with a WARN.

Architecture

  • New LlmClient::Fallback(FallbackClient) enum variant. FallbackClient { primary: Box<LlmClient>, fallback: Box<LlmClient>, fallback_model, last_primary_failure }. Implements LlmBackend itself; single insertion point at the factory; zero production call sites changed.
  • is_transient promoted to LlmError::is_transient method (was a free function in azure.rs). Backends and the new fallback wrapper share one predicate.
  • New crates/clx-core/src/config/project.rs for walk-up + allowlist filter.
  • Config::load switched from raw serde_yml::from_str to figment::Figment with Yaml::file + Yaml::string (filtered project) + Env::prefixed("CLX_").

Backwards compatibility

  • ✅ No fallback: field anywhere → behavior bit-for-bit unchanged.
  • ✅ No project file → behavior bit-for-bit unchanged (figment with one layer ≡ today's serde_yml::from_str).
  • ✅ Legacy ollama: block auto-translation (0.6.0) preserved.
  • CLX_LLM_CHAT_* env-var shortcuts still work.

Tests

  • 3 new wiremock tests in llm/fallback.rs: fallback-on-503, no-fallback-on-401, cooldown-skips-primary.
  • 4 unit tests for project_config_path + filter_inert_only in config/project.rs.
  • 2 integration tests for layered loader in config::schema_tests.
  • All existing tests preserved (884+ passing in workspace).

Deferred to v0.7.x (not in this PR)

  • clx trust <repo> UX command for promoting non-inert project keys past the allowlist.
  • Multi-fallback chains (fallback: [a, b, c]).
  • Cross-process cooldown persistence.

Test plan

  • cargo build --release --workspace — clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings — clean
  • cargo test --workspace — 884+ pass
  • cargo fmt --all -- --check — clean
  • cargo audit — exit 0
  • CI green on the PR
  • Real-tenant fallback smoke test from CONTRIBUTING.md after merge

Spec & plan

  • Design: specs/2026-04-30-provider-fallback-design.md
  • Plan: specs/plans/2026-04-30-provider-fallback.md

- Move config.rs → config/mod.rs and add config/project.rs module
- project_config_path_with_stop: walk up from CWD to home boundary,
  skipping directories outside the home tree (prevents cross-user leaks)
- filter_inert_only: strips providers.*, logging.file, validator.enabled
  from project YAML before merging (security gate)
- figment-layered Config::load: global → project (filtered) → env vars
- Stop boundary derived from config_dir().parent() for consistency when
  HOME is overridden (e.g. in integration tests with temp home dirs)
- 6 new unit tests in project::tests, 2 integration tests in schema_tests
@blackaxgit blackaxgit merged commit 8d97c2d into main May 2, 2026
6 of 7 checks passed
@blackaxgit blackaxgit deleted the feat/provider-fallback branch May 2, 2026 19:34
blackaxgit added a commit that referenced this pull request May 2, 2026
PR #20 burned 6h on a hung instrumented coverage step; PR #21 hung 58
min on the same pattern. Three mitigations:

1. `timeout-minutes: 30` at job level + `timeout-minutes: 20` on the
LCOV step. Hangs fail in 20-30 min instead of 6h.
2. `RUST_TEST_THREADS: 1` to serialize under instrumentation. Likely
cause of the hang is wiremock + serial_test + tokio + llvm-cov
parallelism deadlock.
3. `continue-on-error: true` so coverage flakes do not block PRs from
merging while we investigate the root cause separately.

Coverage stays informational. Quality gates (clippy, tests, build,
audit) still block merges.

---------

Co-authored-by: blackax <blackaxgit@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant