fix(eval): surface provider failure cause and harden trajectory redaction by Yiminnn · Pull Request #834 · benchflow-ai/benchflow

Yiminnn · 2026-06-27T02:55:19Z

Summary

Closes the remaining open half of #830 (fix #2 — fix #1 landed in #831): durably surface the upstream provider error cause for triage, and keep it — plus every newly-surfaced failure body — free of leaked secrets.

Provider failure cause (Option A)

The pi/litellm failure callback set error.message = str(response_obj), but on a deterministic provider reject (context-window overflow, etc.) litellm fires the hook with response_obj=None, so the message became the literal "None" and the real cause was lost on teardown. It now falls back to kwargs['exception'], so error.message carries the real upstream reason. This rides the existing redacted llm_trajectory.jsonl pipeline; result.json stays status-code-only, so the #546/#564 secret posture is untouched.

Redaction hardening

Option A now routes the full provider error body into error.message for all failure types, so redact_trajectory_text had to be hardened. Two adversarial verification passes drove these out (each empirically reproduced, then locked with a test):

Escaped-JSON bypass (was leaking): carriers now fire on json.dumps-escaped JSON (\"k\": \"v\") at any escape depth — the exact form Trajectory.to_jsonl produces — not just raw/dict-repr.
Catastrophic O(n²) ReDoS (pre-existing, exposed here): a base64 image field stalled the redactor ~45 s → now <0.5 s. Variable name prefixes are length-capped and the env/JSON carriers are left-anchored.
New coverage: URL userinfo for credential-bearing schemes (incl. un-encoded @ in passwords); sk-benchflow/svcacct/admin/or-v1, gsk_/xai-/r8_/hf_/fw_, and JWT key families; AWS_SECRET_ACCESS_KEY / *_ACCOUNT_KEY; GCP PEM private-key blocks; master_key/private_key carriers; AWS SigV4 / Azure SAS query params.
Over-redaction guards: kebab slugs (sk-proj-refactor-...), primary_key/foreign_key, ?key=name, short ?sig=<version>, vscode:// deep-links, eyJ-prefixed method chains.

Reproduction → Verification

# Before (issue #830): cause lost
error.message = "None"   (response_obj=None; real reason only in the deleted proxy stderr.log)

# After: cause preserved (redacted), result.json unchanged
error.message = "...Requested token count exceeds the model's maximum context length of 16384 tokens..."

Test plan

pytest tests/trajectories/test_redaction.py tests/test_litellm_logging.py — 108 passed (incl. escaped-JSON, double-escape, ReDoS timing guards, AWS/GCP key vectors, over-redaction guards)
Full suite: 4528 passed (4 pre-existing unrelated failures, confirmed on the clean base)
ruff check + ruff format --check + ty check clean
Two adversarial verification passes (leak / over-redaction / ReDoS), all findings fixed + re-verified

A secret value containing a literal & truncates at the & (query-sibling protection prioritized; real keys don't contain &).
SigV4 Signature= inside an Authorization header (over-redaction risk on the common word "signature").
Non-allowlisted credential URL schemes (e.g. clickhouse://, kafka://).

🤖 Generated with Claude Code

…tion Issue #830 fix#2: durably surface the upstream provider error cause for triage, and keep it (plus all newly-surfaced failure bodies) free of leaked secrets. Provider failure cause (Option A): - The pi/litellm failure callback now falls back to kwargs['exception'] when response_obj is None, so error.message carries the real upstream reason (e.g. a context-window overflow) instead of the literal 'None'. It rides the existing redacted llm_trajectory.jsonl pipeline; result.json stays status-code-only, so the #546/#564 posture is preserved. Redaction hardening (Option A now surfaces full provider bodies for ALL failure types, so redact_trajectory_text must cover them safely): - Carriers fire on json.dumps-escaped JSON (\"k\": \"v\") at any escape depth -- the form Trajectory.to_jsonl produces -- not just raw/dict-repr. - Fix catastrophic O(n^2) ReDoS on long alnum runs (a base64 image field stalled the redactor ~45s): length-cap the variable name prefixes and left-anchor the env/JSON carriers. - New coverage: URL userinfo for credential-bearing schemes (incl. un-encoded '@' in passwords), sk-benchflow/svcacct/admin/or-v1 and gsk_/xai-/r8_/hf_/fw_ and JWT key families, AWS_SECRET_ACCESS_KEY / *_ACCOUNT_KEY, GCP PEM private-key blocks, master_key/private_key carriers, AWS SigV4 / Azure SAS query params. - Avoid over-redaction: kebab slugs (sk-proj-refactor-...), primary_key / foreign_key, ?key=name, short ?sig=<version>, vscode:// deep-links, and eyJ-prefixed method chains. Refs #830 (fix#1 landed in #831). Verified: 108 redaction/logging tests incl. ReDoS timing guards; full suite 4528 passed; two adversarial verification passes; ruff + ty clean.

greptile-apps · 2026-06-27T03:04:12Z

Greptile Summary

This PR surfaces upstream provider failure reasons (e.g. context-window overflow) that were previously swallowed as the literal string \"None\" when litellm fires the failure callback with response_obj=None. It also substantially hardens the trajectory redaction layer against the new surface area introduced by routing full provider error bodies into error.message.

Provider failure fix: _failure_detail() falls back to kwargs['exception'] when response_obj is None, so error.type/message carry the real upstream reason. The existing response field is unchanged (stays null), and result.json remains status-code-only.
Redaction hardening: New and updated patterns cover escaped-JSON bypass (via _ESCQ), ReDoS on base64 image fields (via length-capped _NAME), URL userinfo credentials, curated query-param secrets, JWT tokens, PEM key blocks, and a dozen additional provider key families (Groq, xAI, Replicate, HuggingFace, Fireworks, AWS, Azure). Extensive parametrized tests—including ReDoS timing guards and over-redaction guards—are added for each.

Confidence Score: 4/5

Safe to merge; the functional change is a one-liner fallback and the redaction hardening is extensively tested with both leak and over-redaction guards.

The provider-failure fix is minimal and correct — _failure_detail uses is not None (not truthiness), so falsy-but-valid response objects are handled safely. The redaction expansion is complex but every new pattern is paired with a parametrized test that verifies both the positive (secret redacted) and negative (non-secret preserved) case, plus ReDoS timing guards. The two comments are style-level: a comment that overstates the escape-depth bound, and a scenario where traceback.format_exc() could diverge from error.message if litellm clears the exception context before invoking the async callback.

src/benchflow/trajectories/types.py — the redaction pattern list is now long and intricate; future additions should continue to follow the length-capped, left-anchored pattern established here to avoid re-introducing ReDoS.

Important Files Changed

Filename	Overview
src/benchflow/providers/litellm_logging.py	Adds _failure_detail() inside the embedded callback module source to fall back to kwargs['exception'] when response_obj is None; async_log_failure_event is updated to use it for error.type/message while keeping the response field unchanged.
src/benchflow/trajectories/types.py	Adds _ESCQ / _SECVAL / _NAME helper atoms and substantially expands _REDACTION_PATTERNS with PEM key, JWT, URL userinfo, query-param, and provider key family patterns. ReDoS is addressed with length caps and left-anchors throughout.
tests/test_litellm_logging.py	Adds _callback_namespace() helper and four new tests covering _failure_detail() unit behavior and an end-to-end failure-event write path; async test is correctly handled by the project's asyncio_mode=auto config.
tests/trajectories/test_redaction.py	Adds comprehensive parametrized tests for all new redaction vectors (25+ leak cases, 20+ over-redaction guards, double-escape path, and ReDoS timing guards). Coverage appears exhaustive for the documented threat model.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant LiteLLM
    participant CB as async_log_failure_event
    participant FD as _failure_detail
    participant RD as redact_trajectory_text
    participant JL as llm_trajectory.jsonl

    LiteLLM->>CB: "response_obj=None, kwargs[exception]=ContextWindowError"
    CB->>FD: "response_obj=None, exception=ContextWindowError"
    FD-->>CB: "detail = ContextWindowError (fix: was None before)"
    CB->>JL: "write {error.type=ContextWindowError, error.message=...tokens...}"
    Note over CB,JL: "response field stays null"

    LiteLLM->>CB: "response_obj=ErrorResponse"
    CB->>FD: "response_obj=ErrorResponse"
    FD-->>CB: "detail = ErrorResponse (existing path unchanged)"
    CB->>JL: "write {error.type=ErrorResponse, ...}"

    JL->>RD: "json.dumps-escaped line with potential secrets"
    Note over RD: "_ESCQ handles backslash-escaped quotes"
    Note over RD: "_NAME cap prevents ReDoS"
    RD-->>JL: "line with ***REDACTED*** substitutions"

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant LiteLLM
    participant CB as async_log_failure_event
    participant FD as _failure_detail
    participant RD as redact_trajectory_text
    participant JL as llm_trajectory.jsonl

    LiteLLM->>CB: "response_obj=None, kwargs[exception]=ContextWindowError"
    CB->>FD: "response_obj=None, exception=ContextWindowError"
    FD-->>CB: "detail = ContextWindowError (fix: was None before)"
    CB->>JL: "write {error.type=ContextWindowError, error.message=...tokens...}"
    Note over CB,JL: "response field stays null"

    LiteLLM->>CB: "response_obj=ErrorResponse"
    CB->>FD: "response_obj=ErrorResponse"
    FD-->>CB: "detail = ErrorResponse (existing path unchanged)"
    CB->>JL: "write {error.type=ErrorResponse, ...}"

    JL->>RD: "json.dumps-escaped line with potential secrets"
    Note over RD: "_ESCQ handles backslash-escaped quotes"
    Note over RD: "_NAME cap prevents ReDoS"
    RD-->>JL: "line with ***REDACTED*** substitutions"

Comments Outside Diff (1)

src/benchflow/providers/litellm_logging.py, line 225-239 (link)

traceback field may not align with the newly-surfaced error.message

traceback.format_exc() captures the exception that is currently active at the moment the callback fires. For the primary case this PR fixes — litellm calling the hook from within a try/except with response_obj=None — the active exception is the same object stored in kwargs['exception'], so traceback and error.message will agree. However, if litellm's plumbing clears the exception context before invoking the callback (or routes through an async boundary that resets it), format_exc() returns "NoneType: None\n" while error.message now carries the real cause. A reader of the log then sees a meaningful error.message but a useless empty traceback. Adding traceback.format_exception(detail) as a fallback when format_exc() returns the None sentinel would make the two fields consistent.

_{Reviews (1): Last reviewed commit: "fix(eval): surface provider failure caus..." | Re-trigger Greptile}

greptile-apps · 2026-06-27T03:04:15Z

+# form is how a provider error body embedded in ``error.message`` reaches the
+# redactor after ``Trajectory.to_jsonl`` runs ``json.dumps`` over the record
+# before redacting it (#830). The ``{0,8}`` cap keeps it ReDoS-bounded.
+_ESCQ = r'\\{0,8}["\']?'


_ESCQ comment overstates escape-depth coverage

The comment says "at any escape depth" but \\{0,8} tops out at 8 backslashes, which reliably covers ~3 levels of json.dumps nesting (level 1 → 1 backslash per quote, level 2 → 3, level 3 → 7). A fourth json.dumps would produce 15 backslashes and would not be matched. In practice the deepest realistic nesting is a single to_jsonl pass over a record that already contains an escaped-JSON string (level 2), so the bound is perfectly adequate — the comment should say "up to ~3 nesting levels" rather than "any nesting depth" to avoid misleading future maintainers who widen the patterns.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…age (#835) Follow-up to #834 (Greptile P2). Add _failure_traceback(detail): when format_exc() is the 'NoneType: None' sentinel (litellm cleared the exception context) but detail is the cause recovered from kwargs['exception'], format that exception directly so the traceback isn't blank under a meaningful error.message.

Yiminnn temporarily deployed to pypi-internal-preview June 27, 2026 02:55 — with GitHub Actions Inactive

greptile-apps Bot reviewed Jun 27, 2026

View reviewed changes

Yiminnn mentioned this pull request Jun 27, 2026

Context-length errors classified as provider_error/transient (HTTP 500) → 12× retry storm + Router cooldown on a permanent 400 #830

Closed

Yiminnn merged commit 35ebe39 into main Jun 27, 2026
8 checks passed

Yiminnn mentioned this pull request Jun 27, 2026

fix(eval): keep failure traceback consistent with surfaced error.message #835

Merged

2 tasks

bingran-you mentioned this pull request Jun 27, 2026

Add train convert Prime SFT export #828

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): surface provider failure cause and harden trajectory redaction#834

fix(eval): surface provider failure cause and harden trajectory redaction#834
Yiminnn merged 1 commit into
mainfrom
fix/830-proxy-failure-detail

Yiminnn commented Jun 27, 2026

Uh oh!

greptile-apps Bot commented Jun 27, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Yiminnn commented Jun 27, 2026

Summary

Provider failure cause (Option A)

Redaction hardening

Reproduction → Verification

Test plan

Related

Consciously deferred (LOW / noted)

Uh oh!

greptile-apps Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 27, 2026 •

edited

Loading