Skip to content

fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts#839

Merged
Yiminnn merged 2 commits into
mainfrom
fix/acp-anthropic-messages-trajectory
Jun 27, 2026
Merged

fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts#839
Yiminnn merged 2 commits into
mainfrom
fix/acp-anthropic-messages-trajectory

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

claude-agent-acp did not write trajectory/llm_trajectory.jsonl for successful rollouts (0/50), while pi did (44/44) and claude-acp's error path did (11/17). This breaks trajectory parity and makes successful claude rollouts unusable for any LLM-trajectory consumer (export / RL training).

Root cause

The per-rollout LiteLLM proxy captures the LLM trajectory from its litellm_settings.callbacks success/failure events. BenchFlow's vllm/ provider maps the upstream to openai/<model>. For the anthropic /v1/messages route against an openai/-prefixed upstream, LiteLLM 1.89 bridges through the OpenAI Responses API (/responses), and its Responses streaming adapter never invokes the success callback — so streaming /v1/messages success produces zero callback records → empty server.trajectory.exchanges_write_llm_trajectory early-returns and writes nothing.

  • claude-agent-acp (Claude Code) streams /v1/messages → hits the gap (0/50 on success).
  • pi calls /v1/chat/completions directly → streaming success logs fine (44/44).
  • claude-acp errors go through post_call_failure_hook (failures do log) → that's the 11/17 on the error path.

Fix

Set litellm.use_chat_completions_url_for_anthropic_messages via the proxy's litellm_settings, so the anthropic /v1/messages bridge uses /chat/completions — the universally-supported endpoint whose streaming path logs success correctly. Native anthropic/ providers don't use the Responses bridge (unaffected); the proxy is BenchFlow-owned and per-rollout, so there is no external blast radius.

Verification

Reproduced with a live-proxy harness (the real proxy started via the runtime + a mock OpenAI-compatible upstream), driving the exact openai/ route the vllm/ provider produces:

case (openai/ route = #833 config) upstream hit success rec exchanges file
messages-stream (claude-acp) — before /responses 0 0
messages-stream (claude-acp) — after /chat/completions 1 1
chat-stream (pi) — after /chat/completions 1 1 ✓ (unchanged)

A regression test (test_proxy_config_forces_chat_completions_for_anthropic_messages) asserts the flag is present in the generated proxy config. tests/test_litellm_config.py + tests/test_litellm_hardening.py pass (57 passed).

Closes #833

…ollouts

LiteLLM routes openai/-prefixed upstreams (e.g. the vllm provider) through its
Responses-API adapter for the anthropic /v1/messages route. That adapter's
streaming path never fires the success callback, so streaming claude-agent-acp
rollouts wrote no trajectory/llm_trajectory.jsonl on the success path (0/50),
while pi (/v1/chat/completions) and the error path were unaffected.

Set litellm.use_chat_completions_url_for_anthropic_messages so the bridge uses
/chat/completions, whose streaming path logs success correctly.

Closes #833
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview June 27, 2026 06:34 — with GitHub Actions Inactive
@bingran-you bingran-you added bug Something isn't working P1 Important debt — must fix soon, but does not block the current release. status:ready Triaged, unassigned, available to claim. review:pending PR is ready-for-review, no reviewer engagement yet. area:eval Issue / PR lives primarily in the "eval" subsystem. area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. labels Jun 27, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Users Simulation automation review (2026-06-27T12:21Z): ready for this simulation scope.

I verified the PR through cost-light CLI/SDK and focused code paths rather than a live expensive model rollout. The LiteLLM proxy flag is wired in the canonical config builder, focused runtime/hardening checks passed, and the trajectory validator still fails closed when trajectory/llm_trajectory.jsonl is missing.

Commands/evidence:

  • uv run pytest tests/test_litellm_config.py -q
  • uv run pytest tests/test_litellm_runtime.py -q
  • uv run pytest tests/test_litellm_hardening.py -q
  • uv run ruff check src/benchflow/providers/litellm_config.py tests/test_litellm_config.py
  • uv run ty check src/benchflow/providers/litellm_config.py
  • CLI/SDK dry paths: benchflow --help, benchflow eval run --help, benchflow eval list --help, benchflow skills --help, benchflow hub check --help, benchflow sandbox --help, Python SDK import smoke.
  • validate_run_artifacts.py passed on a healthy synthetic rollout and failed closed after removing llm_trajectory.jsonl.

Residual risk: the regression test is static config coverage, not a real streamed claude-agent-acp provider canary, so this does not prove live Claude streaming end to end.

…t-completions flag

The flag is applied via LiteLLM's generic litellm_settings -> setattr path,
which does not raise on an unknown key. Assert litellm still exposes
use_chat_completions_url_for_anthropic_messages so a future LiteLLM rename
fails CI instead of silently regressing #833.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview June 27, 2026 18:15 — with GitHub Actions Inactive
@Yiminnn Yiminnn merged commit bc429df into main Jun 27, 2026
9 checks passed
@Yiminnn Yiminnn deleted the fix/acp-anthropic-messages-trajectory branch June 27, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. area:eval Issue / PR lives primarily in the "eval" subsystem. bug Something isn't working P1 Important debt — must fix soon, but does not block the current release. review:pending PR is ready-for-review, no reviewer engagement yet. status:ready Triaged, unassigned, available to claim.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

claude-agent-acp does not emit trajectory/llm_trajectory.jsonl for successful rollouts (pi does)

2 participants