fix(eval): surface provider failure cause and harden trajectory redaction#834
Conversation
…tion Issue #830 fix#2: durably surface the upstream provider error cause for triage, and keep it (plus all newly-surfaced failure bodies) free of leaked secrets. Provider failure cause (Option A): - The pi/litellm failure callback now falls back to kwargs['exception'] when response_obj is None, so error.message carries the real upstream reason (e.g. a context-window overflow) instead of the literal 'None'. It rides the existing redacted llm_trajectory.jsonl pipeline; result.json stays status-code-only, so the #546/#564 posture is preserved. Redaction hardening (Option A now surfaces full provider bodies for ALL failure types, so redact_trajectory_text must cover them safely): - Carriers fire on json.dumps-escaped JSON (\"k\": \"v\") at any escape depth -- the form Trajectory.to_jsonl produces -- not just raw/dict-repr. - Fix catastrophic O(n^2) ReDoS on long alnum runs (a base64 image field stalled the redactor ~45s): length-cap the variable name prefixes and left-anchor the env/JSON carriers. - New coverage: URL userinfo for credential-bearing schemes (incl. un-encoded '@' in passwords), sk-benchflow/svcacct/admin/or-v1 and gsk_/xai-/r8_/hf_/fw_ and JWT key families, AWS_SECRET_ACCESS_KEY / *_ACCOUNT_KEY, GCP PEM private-key blocks, master_key/private_key carriers, AWS SigV4 / Azure SAS query params. - Avoid over-redaction: kebab slugs (sk-proj-refactor-...), primary_key / foreign_key, ?key=name, short ?sig=<version>, vscode:// deep-links, and eyJ-prefixed method chains. Refs #830 (fix#1 landed in #831). Verified: 108 redaction/logging tests incl. ReDoS timing guards; full suite 4528 passed; two adversarial verification passes; ruff + ty clean.
Greptile SummaryThis PR surfaces upstream provider failure reasons (e.g. context-window overflow) that were previously swallowed as the literal string
Confidence Score: 4/5Safe to merge; the functional change is a one-liner fallback and the redaction hardening is extensively tested with both leak and over-redaction guards. The provider-failure fix is minimal and correct — _failure_detail uses is not None (not truthiness), so falsy-but-valid response objects are handled safely. The redaction expansion is complex but every new pattern is paired with a parametrized test that verifies both the positive (secret redacted) and negative (non-secret preserved) case, plus ReDoS timing guards. The two comments are style-level: a comment that overstates the escape-depth bound, and a scenario where traceback.format_exc() could diverge from error.message if litellm clears the exception context before invoking the async callback. src/benchflow/trajectories/types.py — the redaction pattern list is now long and intricate; future additions should continue to follow the length-capped, left-anchored pattern established here to avoid re-introducing ReDoS. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant LiteLLM
participant CB as async_log_failure_event
participant FD as _failure_detail
participant RD as redact_trajectory_text
participant JL as llm_trajectory.jsonl
LiteLLM->>CB: "response_obj=None, kwargs[exception]=ContextWindowError"
CB->>FD: "response_obj=None, exception=ContextWindowError"
FD-->>CB: "detail = ContextWindowError (fix: was None before)"
CB->>JL: "write {error.type=ContextWindowError, error.message=...tokens...}"
Note over CB,JL: "response field stays null"
LiteLLM->>CB: "response_obj=ErrorResponse"
CB->>FD: "response_obj=ErrorResponse"
FD-->>CB: "detail = ErrorResponse (existing path unchanged)"
CB->>JL: "write {error.type=ErrorResponse, ...}"
JL->>RD: "json.dumps-escaped line with potential secrets"
Note over RD: "_ESCQ handles backslash-escaped quotes"
Note over RD: "_NAME cap prevents ReDoS"
RD-->>JL: "line with ***REDACTED*** substitutions"
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant LiteLLM
participant CB as async_log_failure_event
participant FD as _failure_detail
participant RD as redact_trajectory_text
participant JL as llm_trajectory.jsonl
LiteLLM->>CB: "response_obj=None, kwargs[exception]=ContextWindowError"
CB->>FD: "response_obj=None, exception=ContextWindowError"
FD-->>CB: "detail = ContextWindowError (fix: was None before)"
CB->>JL: "write {error.type=ContextWindowError, error.message=...tokens...}"
Note over CB,JL: "response field stays null"
LiteLLM->>CB: "response_obj=ErrorResponse"
CB->>FD: "response_obj=ErrorResponse"
FD-->>CB: "detail = ErrorResponse (existing path unchanged)"
CB->>JL: "write {error.type=ErrorResponse, ...}"
JL->>RD: "json.dumps-escaped line with potential secrets"
Note over RD: "_ESCQ handles backslash-escaped quotes"
Note over RD: "_NAME cap prevents ReDoS"
RD-->>JL: "line with ***REDACTED*** substitutions"
|
| # form is how a provider error body embedded in ``error.message`` reaches the | ||
| # redactor after ``Trajectory.to_jsonl`` runs ``json.dumps`` over the record | ||
| # before redacting it (#830). The ``{0,8}`` cap keeps it ReDoS-bounded. | ||
| _ESCQ = r'\\{0,8}["\']?' |
There was a problem hiding this comment.
_ESCQ comment overstates escape-depth coverage
The comment says "at any escape depth" but \\{0,8} tops out at 8 backslashes, which reliably covers ~3 levels of json.dumps nesting (level 1 → 1 backslash per quote, level 2 → 3, level 3 → 7). A fourth json.dumps would produce 15 backslashes and would not be matched. In practice the deepest realistic nesting is a single to_jsonl pass over a record that already contains an escaped-JSON string (level 2), so the bound is perfectly adequate — the comment should say "up to ~3 nesting levels" rather than "any nesting depth" to avoid misleading future maintainers who widen the patterns.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
…age (#835) Follow-up to #834 (Greptile P2). Add _failure_traceback(detail): when format_exc() is the 'NoneType: None' sentinel (litellm cleared the exception context) but detail is the cause recovered from kwargs['exception'], format that exception directly so the traceback isn't blank under a meaningful error.message.
Summary
Closes the remaining open half of #830 (fix #2 — fix #1 landed in #831): durably surface the upstream provider error cause for triage, and keep it — plus every newly-surfaced failure body — free of leaked secrets.
Provider failure cause (Option A)
The pi/litellm failure callback set
error.message = str(response_obj), but on a deterministic provider reject (context-window overflow, etc.) litellm fires the hook withresponse_obj=None, so the message became the literal"None"and the real cause was lost on teardown. It now falls back tokwargs['exception'], soerror.messagecarries the real upstream reason. This rides the existing redactedllm_trajectory.jsonlpipeline;result.jsonstays status-code-only, so the #546/#564 secret posture is untouched.Redaction hardening
Option A now routes the full provider error body into
error.messagefor all failure types, soredact_trajectory_texthad to be hardened. Two adversarial verification passes drove these out (each empirically reproduced, then locked with a test):json.dumps-escaped JSON (\"k\": \"v\") at any escape depth — the exact formTrajectory.to_jsonlproduces — not just raw/dict-repr.@in passwords);sk-benchflow/svcacct/admin/or-v1,gsk_/xai-/r8_/hf_/fw_, and JWT key families;AWS_SECRET_ACCESS_KEY/*_ACCOUNT_KEY; GCP PEM private-key blocks;master_key/private_keycarriers; AWS SigV4 / Azure SAS query params.sk-proj-refactor-...),primary_key/foreign_key,?key=name, short?sig=<version>,vscode://deep-links,eyJ-prefixed method chains.Reproduction → Verification
Test plan
pytest tests/trajectories/test_redaction.py tests/test_litellm_logging.py— 108 passed (incl. escaped-JSON, double-escape, ReDoS timing guards, AWS/GCP key vectors, over-redaction guards)ruff check+ruff format --check+ty checkcleanRelated
Refs #830 (intentionally left open — a few LOW residuals noted below are deferred). fix #1: #831.
Consciously deferred (LOW / noted)
&truncates at the&(query-sibling protection prioritized; real keys don't contain&).Signature=inside anAuthorizationheader (over-redaction risk on the common word "signature").clickhouse://,kafka://).🤖 Generated with Claude Code