Skip to content

feat: mid-session circuit breaker for proxy mode (Phase 3)#51

Merged
risjai merged 3 commits into
masterfrom
feat/proxy-circuit-breaker
Apr 12, 2026
Merged

feat: mid-session circuit breaker for proxy mode (Phase 3)#51
risjai merged 3 commits into
masterfrom
feat/proxy-circuit-breaker

Conversation

@risjai
Copy link
Copy Markdown
Collaborator

@risjai risjai commented Apr 12, 2026

Summary

Implements Phase 3 from my-docs/plan-proxy-resilience.md / my-docs/plan-proxy-circuit-breaker.md.

Test plan

  • python3 -m pytest python/tests/test_circuit_breaker.py -v — 23 new tests pass
  • python3 -m pytest python/tests/ -v — 156 passed, 4 skipped, 0 regressions
  • cargo clippy --workspace — zero warnings
  • Manual: start proxy, init(mode="proxy"), make calls, kill proxy → next call succeeds via direct upstream
  • Manual: verify steps recorded in local store after circuit trip
  • Manual: restart proxy after 30s → circuit closes, resumes proxy mode

🤖 Generated with Claude Code

Add a ProxyCircuitBreaker that detects proxy failure mid-session and
transparently falls through to direct recording. The circuit breaker:

- Detects failures from LLM SDK connection errors (APIConnectionError,
  APITimeoutError) — no per-request health checks, zero latency on the
  happy path
- Trips to OPEN after 2 consecutive connection failures
- In OPEN state, preemptively skips the proxy and retries via a
  throwaway client pointing at the original upstream URL (thread-safe,
  no shared mutable state)
- Creates a local Store + Recorder for direct-mode recording while
  proxy is down (session named "<original> (proxy-fallback)")
- After 30s recovery timeout, probes the proxy in HALF_OPEN state
- On successful probe, tears down direct resources and resumes proxy
  mode

Design decisions (from review):
- Throwaway client for retries (resolves thread-safety issue with
  base_url swap — review finding #1)
- Preemptive skip in OPEN state (zero latency penalty — finding #2)
- New session in local store on trip (simple, no cross-store merge —
  finding #3)
- APITimeoutError included in detection (finding #5)

Known limitations:
- Mid-stream proxy failure cannot be retried (LLM responses are
  non-deterministic)
- Timeout detection takes SDK timeout period (60-120s) × threshold

23 new tests: error detection (7), state machine (13), integration (3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@risjai risjai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #51 — Mid-Session Circuit Breaker (Phase 3)

520 lines of new code in circuit_breaker.py, 17 lines modified in patch.py, 328 lines of tests. The architecture is clean: separate file, throwaway clients for thread safety, preemptive skip in OPEN state. All five plan review findings are addressed. There are a few issues to fix before merge.


Finding 1: str | None type hint breaks Python 3.9 [BUG]

Severity: High — build-breaking for supported Python versions

pyproject.toml declares requires-python = ">=3.9". The str | None union syntax is Python 3.10+:

def __init__(
    self,
    proxy_url: str,
    original_openai_url: str | None,  # 💥 SyntaxError on Python 3.9
    original_anthropic_url: str | None,
    ...

Fix: Use Optional[str] from typing, or add from __future__ import annotations.


Finding 2: APITimeoutError may cause false trips [DESIGN]

Severity: Medium — proxy is fine, upstream is slow

The CB includes APITimeoutError in _is_connection_error() per plan review finding #5. But a timeout can originate from the upstream provider, not the proxy:

  1. SDK connects to proxy (success)
  2. Proxy forwards to upstream (upstream is slow)
  3. SDK times out → APITimeoutError
  4. CB records failure → after 2 timeouts, trips to OPEN

The proxy was healthy — the upstream was slow. Tripping the CB removes proxy recording while gaining nothing (the upstream is still slow).

APIConnectionError is safe because the proxy returns HTTP 502 for upstream failures — a connection error only fires when the proxy itself is unreachable. APITimeoutError has no such distinction.

Options:

  • (a) Remove APITimeoutError from detection. Connection errors alone are sufficient for the "proxy is down" signal.
  • (b) Keep it but add a shorter timeout threshold for the proxy connection phase specifically (requires httpx internals).
  • (c) Keep it and document as a known tradeoff. Accept that very rare upstream timeouts might cause false trips that self-correct in 30s.

Recommend (a) — it's simpler and more correct. Upstream timeouts are common; proxy crashes are rare. Mixing them contaminates the signal.


Finding 3: Premature record_success on streaming calls [BUG]

Severity: Medium — failure counter never trips for flaky proxies

In all 6 patch methods, record_success() is called immediately after original() returns:

def patched(messages_self, *args, **kwargs):
    if cb.should_try_proxy():
        try:
            result = original(messages_self, *args, **kwargs)
            cb.record_success()  # ← called here
            return result

For streaming calls, original() returns a stream/manager object — no data has been received yet. The TCP connection to the proxy succeeded, but the proxy could die mid-stream. record_success() resets failure_count to 0.

Impact: A proxy that establishes connections but dies mid-stream will never trip the CB because every call records a success before data flows. The failure count keeps resetting.

Fix: For streaming methods, defer record_success(). Don't call it at the patch level — let it be called when the stream completes successfully. This requires either:

  • Wrapping the stream with a CB-aware wrapper that calls record_success() on stream completion
  • Or accepting this as a known limitation (mid-stream failures are already documented as non-retriable)

If accepting as a known limitation, at minimum don't call record_success() for streaming calls — just leave the failure count unchanged:

if not kwargs.get("stream", False):
    cb.record_success()

Finding 4: No test for the actual retry path [TESTS]

Severity: Medium — the most critical behavior is untested

The 23 tests cover error detection (7), state machine (13), and integration wiring (3). None test the core value proposition: calling a patched SDK method → proxy fails → fallthrough to direct upstream → call succeeds → step recorded.

This would require mocking both the proxy-bound original method (to raise APIConnectionError) and the direct client creation (to return a mock response). It's the most important behavioral test for this feature.

Suggested test:

def test_openai_call_retries_on_proxy_failure(self):
    """Patched create() retries via direct upstream when proxy is down."""
    # Setup: CB in CLOSED state, mock original to raise APIConnectionError,
    # mock openai.OpenAI to return a working client
    # Assert: call succeeds, failure recorded, step recorded in direct store

Finding 5: Throwaway client only copies api_key [MINOR]

Severity: Low — works for standard configs, breaks for custom ones

direct_client = openai.OpenAI(
    base_url=self.original_openai_url,
    api_key=existing_client.api_key,
)

This doesn't copy organization, default_headers, max_retries, timeout, or custom http_client settings. Users with Azure OpenAI, custom proxies, or org-scoped keys would get auth errors on retry.

Minimum fix: Also copy organization for OpenAI:

direct_client = openai.OpenAI(
    base_url=self.original_openai_url,
    api_key=existing_client.api_key,
    organization=getattr(existing_client, 'organization', None),
)

Fully robust would copy timeout and max_retries too, but api_key + organization covers 95% of configurations.


Finding 6: Multiple concurrent probes in HALF_OPEN [MINOR]

Severity: Low — can miss recovery, self-corrects in 30s

Standard circuit breakers allow only ONE probe in HALF_OPEN. This implementation allows all concurrent calls through:

  1. Thread A's probe succeeds → record_success() → CLOSED
  2. Thread B's probe fails (race) → record_failure() → HALF_OPEN, sees state=CLOSED now → records failure_count=1, stays CLOSED

This is fine. But the worse race:

  1. Thread B's probe fails first → record_failure() → HALF_OPEN state → reopens to OPEN
  2. Thread A's probe succeeds → record_success() → state is OPEN (not HALF_OPEN) → just resets failure_count, doesn't close

Now the proxy is healthy but the CB stays OPEN for another 30s. Self-corrects on next probe. Acceptable for a local dev tool, but worth noting.


Finding 7: Throwaway client per retry — no cleanup [MINOR]

Severity: Low — resource accumulation under high load

Every retry creates a new openai.OpenAI() / anthropic.Anthropic() client (which creates an httpx.Client with a connection pool). These are never explicitly closed. Under high-frequency OPEN-state retries, clients accumulate until GC.

Optional improvement: Create the throwaway client once in _trip_to_open(), store it, and reuse during OPEN state. Tear down in _close_circuit().

Not a blocker — LLM call frequency is naturally bounded, and GC handles cleanup.


Summary

# Finding Severity Action
1 str | None breaks Python 3.9 High (bug) Fix before merge
2 APITimeoutError may false-trip Medium (design) Remove from detection or document
3 Premature record_success on streams Medium (bug) Skip for streaming calls
4 No test for actual retry path Medium (test) Add integration test
5 Throwaway client only copies api_key Low Copy organization too
6 Multiple HALF_OPEN probes Low Document, acceptable
7 Throwaway client per retry Low Optional optimization

Verdict: Fix #1 (Python 3.9 compat) before merge — it's a build-breaking bug. Strongly recommend fixing #3 (premature success on streams) too. The rest are non-blocking.

- Fix #1 (HIGH): Add `from __future__ import annotations` for Python
  3.9 compat — `str | None` syntax is 3.10+
- Fix #2 (MEDIUM): Remove APITimeoutError from detection — timeouts
  may originate from slow upstream, not dead proxy. False trips when
  proxy is healthy but LLM is slow.
- Fix #3 (MEDIUM): Skip record_success() for streaming calls — stream
  object returned doesn't mean data has flowed. Proxy could die
  mid-stream after failure count was reset.
- Fix #4 (MEDIUM): Add retry-path test — verifies direct recorder
  stores steps correctly after circuit trips to OPEN
- Fix #5 (LOW): Copy `organization` in throwaway OpenAI client for
  org-scoped API key compatibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@risjai risjai enabled auto-merge (squash) April 12, 2026 15:58
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@risjai risjai merged commit 2f7cac3 into master Apr 12, 2026
4 checks passed
@risjai risjai deleted the feat/proxy-circuit-breaker branch April 12, 2026 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant