feat: mid-session circuit breaker for proxy mode (Phase 3)#51
Conversation
Add a ProxyCircuitBreaker that detects proxy failure mid-session and transparently falls through to direct recording. The circuit breaker: - Detects failures from LLM SDK connection errors (APIConnectionError, APITimeoutError) — no per-request health checks, zero latency on the happy path - Trips to OPEN after 2 consecutive connection failures - In OPEN state, preemptively skips the proxy and retries via a throwaway client pointing at the original upstream URL (thread-safe, no shared mutable state) - Creates a local Store + Recorder for direct-mode recording while proxy is down (session named "<original> (proxy-fallback)") - After 30s recovery timeout, probes the proxy in HALF_OPEN state - On successful probe, tears down direct resources and resumes proxy mode Design decisions (from review): - Throwaway client for retries (resolves thread-safety issue with base_url swap — review finding #1) - Preemptive skip in OPEN state (zero latency penalty — finding #2) - New session in local store on trip (simple, no cross-store merge — finding #3) - APITimeoutError included in detection (finding #5) Known limitations: - Mid-stream proxy failure cannot be retried (LLM responses are non-deterministic) - Timeout detection takes SDK timeout period (60-120s) × threshold 23 new tests: error detection (7), state machine (13), integration (3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
risjai
left a comment
There was a problem hiding this comment.
Code Review: PR #51 — Mid-Session Circuit Breaker (Phase 3)
520 lines of new code in circuit_breaker.py, 17 lines modified in patch.py, 328 lines of tests. The architecture is clean: separate file, throwaway clients for thread safety, preemptive skip in OPEN state. All five plan review findings are addressed. There are a few issues to fix before merge.
Finding 1: str | None type hint breaks Python 3.9 [BUG]
Severity: High — build-breaking for supported Python versions
pyproject.toml declares requires-python = ">=3.9". The str | None union syntax is Python 3.10+:
def __init__(
self,
proxy_url: str,
original_openai_url: str | None, # 💥 SyntaxError on Python 3.9
original_anthropic_url: str | None,
...Fix: Use Optional[str] from typing, or add from __future__ import annotations.
Finding 2: APITimeoutError may cause false trips [DESIGN]
Severity: Medium — proxy is fine, upstream is slow
The CB includes APITimeoutError in _is_connection_error() per plan review finding #5. But a timeout can originate from the upstream provider, not the proxy:
- SDK connects to proxy (success)
- Proxy forwards to upstream (upstream is slow)
- SDK times out →
APITimeoutError - CB records failure → after 2 timeouts, trips to OPEN
The proxy was healthy — the upstream was slow. Tripping the CB removes proxy recording while gaining nothing (the upstream is still slow).
APIConnectionError is safe because the proxy returns HTTP 502 for upstream failures — a connection error only fires when the proxy itself is unreachable. APITimeoutError has no such distinction.
Options:
- (a) Remove
APITimeoutErrorfrom detection. Connection errors alone are sufficient for the "proxy is down" signal. - (b) Keep it but add a shorter timeout threshold for the proxy connection phase specifically (requires httpx internals).
- (c) Keep it and document as a known tradeoff. Accept that very rare upstream timeouts might cause false trips that self-correct in 30s.
Recommend (a) — it's simpler and more correct. Upstream timeouts are common; proxy crashes are rare. Mixing them contaminates the signal.
Finding 3: Premature record_success on streaming calls [BUG]
Severity: Medium — failure counter never trips for flaky proxies
In all 6 patch methods, record_success() is called immediately after original() returns:
def patched(messages_self, *args, **kwargs):
if cb.should_try_proxy():
try:
result = original(messages_self, *args, **kwargs)
cb.record_success() # ← called here
return resultFor streaming calls, original() returns a stream/manager object — no data has been received yet. The TCP connection to the proxy succeeded, but the proxy could die mid-stream. record_success() resets failure_count to 0.
Impact: A proxy that establishes connections but dies mid-stream will never trip the CB because every call records a success before data flows. The failure count keeps resetting.
Fix: For streaming methods, defer record_success(). Don't call it at the patch level — let it be called when the stream completes successfully. This requires either:
- Wrapping the stream with a CB-aware wrapper that calls
record_success()on stream completion - Or accepting this as a known limitation (mid-stream failures are already documented as non-retriable)
If accepting as a known limitation, at minimum don't call record_success() for streaming calls — just leave the failure count unchanged:
if not kwargs.get("stream", False):
cb.record_success()Finding 4: No test for the actual retry path [TESTS]
Severity: Medium — the most critical behavior is untested
The 23 tests cover error detection (7), state machine (13), and integration wiring (3). None test the core value proposition: calling a patched SDK method → proxy fails → fallthrough to direct upstream → call succeeds → step recorded.
This would require mocking both the proxy-bound original method (to raise APIConnectionError) and the direct client creation (to return a mock response). It's the most important behavioral test for this feature.
Suggested test:
def test_openai_call_retries_on_proxy_failure(self):
"""Patched create() retries via direct upstream when proxy is down."""
# Setup: CB in CLOSED state, mock original to raise APIConnectionError,
# mock openai.OpenAI to return a working client
# Assert: call succeeds, failure recorded, step recorded in direct storeFinding 5: Throwaway client only copies api_key [MINOR]
Severity: Low — works for standard configs, breaks for custom ones
direct_client = openai.OpenAI(
base_url=self.original_openai_url,
api_key=existing_client.api_key,
)This doesn't copy organization, default_headers, max_retries, timeout, or custom http_client settings. Users with Azure OpenAI, custom proxies, or org-scoped keys would get auth errors on retry.
Minimum fix: Also copy organization for OpenAI:
direct_client = openai.OpenAI(
base_url=self.original_openai_url,
api_key=existing_client.api_key,
organization=getattr(existing_client, 'organization', None),
)Fully robust would copy timeout and max_retries too, but api_key + organization covers 95% of configurations.
Finding 6: Multiple concurrent probes in HALF_OPEN [MINOR]
Severity: Low — can miss recovery, self-corrects in 30s
Standard circuit breakers allow only ONE probe in HALF_OPEN. This implementation allows all concurrent calls through:
- Thread A's probe succeeds →
record_success()→ CLOSED - Thread B's probe fails (race) →
record_failure()→ HALF_OPEN, sees state=CLOSED now → records failure_count=1, stays CLOSED
This is fine. But the worse race:
- Thread B's probe fails first →
record_failure()→ HALF_OPEN state → reopens to OPEN - Thread A's probe succeeds →
record_success()→ state is OPEN (not HALF_OPEN) → just resets failure_count, doesn't close
Now the proxy is healthy but the CB stays OPEN for another 30s. Self-corrects on next probe. Acceptable for a local dev tool, but worth noting.
Finding 7: Throwaway client per retry — no cleanup [MINOR]
Severity: Low — resource accumulation under high load
Every retry creates a new openai.OpenAI() / anthropic.Anthropic() client (which creates an httpx.Client with a connection pool). These are never explicitly closed. Under high-frequency OPEN-state retries, clients accumulate until GC.
Optional improvement: Create the throwaway client once in _trip_to_open(), store it, and reuse during OPEN state. Tear down in _close_circuit().
Not a blocker — LLM call frequency is naturally bounded, and GC handles cleanup.
Summary
| # | Finding | Severity | Action |
|---|---|---|---|
| 1 | str | None breaks Python 3.9 |
High (bug) | Fix before merge |
| 2 | APITimeoutError may false-trip |
Medium (design) | Remove from detection or document |
| 3 | Premature record_success on streams |
Medium (bug) | Skip for streaming calls |
| 4 | No test for actual retry path | Medium (test) | Add integration test |
| 5 | Throwaway client only copies api_key |
Low | Copy organization too |
| 6 | Multiple HALF_OPEN probes | Low | Document, acceptable |
| 7 | Throwaway client per retry | Low | Optional optimization |
Verdict: Fix #1 (Python 3.9 compat) before merge — it's a build-breaking bug. Strongly recommend fixing #3 (premature success on streams) too. The rest are non-blocking.
- Fix #1 (HIGH): Add `from __future__ import annotations` for Python 3.9 compat — `str | None` syntax is 3.10+ - Fix #2 (MEDIUM): Remove APITimeoutError from detection — timeouts may originate from slow upstream, not dead proxy. False trips when proxy is healthy but LLM is slow. - Fix #3 (MEDIUM): Skip record_success() for streaming calls — stream object returned doesn't mean data has flowed. Proxy could die mid-stream after failure count was reset. - Fix #4 (MEDIUM): Add retry-path test — verifies direct recorder stores steps correctly after circuit trips to OPEN - Fix #5 (LOW): Copy `organization` in throwaway OpenAI client for org-scoped API key compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
base_urlstate (addresses review finding Fix CI badge to reference master branch #1 from plan review)"<original> (proxy-fallback)"in local store (finding Add MCP server for AI assistant integration #3)APIConnectionErrorandAPITimeoutError(finding Bump Python SDK to v0.3.0 #5)patch.py:_init_proxy()creates + installs CB,uninit()tears it downImplements Phase 3 from
my-docs/plan-proxy-resilience.md/my-docs/plan-proxy-circuit-breaker.md.Test plan
python3 -m pytest python/tests/test_circuit_breaker.py -v— 23 new tests passpython3 -m pytest python/tests/ -v— 156 passed, 4 skipped, 0 regressionscargo clippy --workspace— zero warningsinit(mode="proxy"), make calls, kill proxy → next call succeeds via direct upstream🤖 Generated with Claude Code