fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts by Yiminnn · Pull Request #839 · benchflow-ai/benchflow

Yiminnn · 2026-06-27T06:34:01Z

Summary

claude-agent-acp did not write trajectory/llm_trajectory.jsonl for successful rollouts (0/50), while pi did (44/44) and claude-acp's error path did (11/17). This breaks trajectory parity and makes successful claude rollouts unusable for any LLM-trajectory consumer (export / RL training).

Root cause

The per-rollout LiteLLM proxy captures the LLM trajectory from its litellm_settings.callbacks success/failure events. BenchFlow's vllm/ provider maps the upstream to openai/<model>. For the anthropic /v1/messages route against an openai/-prefixed upstream, LiteLLM 1.89 bridges through the OpenAI Responses API (/responses), and its Responses streaming adapter never invokes the success callback — so streaming /v1/messages success produces zero callback records → empty server.trajectory.exchanges → _write_llm_trajectory early-returns and writes nothing.

claude-agent-acp (Claude Code) streams /v1/messages → hits the gap (0/50 on success).
pi calls /v1/chat/completions directly → streaming success logs fine (44/44).
claude-acp errors go through post_call_failure_hook (failures do log) → that's the 11/17 on the error path.

Fix

Set litellm.use_chat_completions_url_for_anthropic_messages via the proxy's litellm_settings, so the anthropic /v1/messages bridge uses /chat/completions — the universally-supported endpoint whose streaming path logs success correctly. Native anthropic/ providers don't use the Responses bridge (unaffected); the proxy is BenchFlow-owned and per-rollout, so there is no external blast radius.

Verification

Reproduced with a live-proxy harness (the real proxy started via the runtime + a mock OpenAI-compatible upstream), driving the exact openai/ route the vllm/ provider produces:

case (openai/ route = #833 config)	upstream hit	success rec	exchanges	file
messages-stream (claude-acp) — before	`/responses`	0	0	✗
messages-stream (claude-acp) — after	`/chat/completions`	1	1	✓
chat-stream (pi) — after	`/chat/completions`	1	1	✓ (unchanged)

A regression test (test_proxy_config_forces_chat_completions_for_anthropic_messages) asserts the flag is present in the generated proxy config. tests/test_litellm_config.py + tests/test_litellm_hardening.py pass (57 passed).

Closes #833

…ollouts LiteLLM routes openai/-prefixed upstreams (e.g. the vllm provider) through its Responses-API adapter for the anthropic /v1/messages route. That adapter's streaming path never fires the success callback, so streaming claude-agent-acp rollouts wrote no trajectory/llm_trajectory.jsonl on the success path (0/50), while pi (/v1/chat/completions) and the error path were unaffected. Set litellm.use_chat_completions_url_for_anthropic_messages so the bridge uses /chat/completions, whose streaming path logs success correctly. Closes #833

bingran-you · 2026-06-27T12:21:56Z

Users Simulation automation review (2026-06-27T12:21Z): ready for this simulation scope.

I verified the PR through cost-light CLI/SDK and focused code paths rather than a live expensive model rollout. The LiteLLM proxy flag is wired in the canonical config builder, focused runtime/hardening checks passed, and the trajectory validator still fails closed when trajectory/llm_trajectory.jsonl is missing.

Commands/evidence:

uv run pytest tests/test_litellm_config.py -q
uv run pytest tests/test_litellm_runtime.py -q
uv run pytest tests/test_litellm_hardening.py -q
uv run ruff check src/benchflow/providers/litellm_config.py tests/test_litellm_config.py
uv run ty check src/benchflow/providers/litellm_config.py
CLI/SDK dry paths: benchflow --help, benchflow eval run --help, benchflow eval list --help, benchflow skills --help, benchflow hub check --help, benchflow sandbox --help, Python SDK import smoke.
validate_run_artifacts.py passed on a healthy synthetic rollout and failed closed after removing llm_trajectory.jsonl.

Residual risk: the regression test is static config coverage, not a real streamed claude-agent-acp provider canary, so this does not prove live Claude streaming end to end.

…t-completions flag The flag is applied via LiteLLM's generic litellm_settings -> setattr path, which does not raise on an unknown key. Assert litellm still exposes use_chat_completions_url_for_anthropic_messages so a future LiteLLM rename fails CI instead of silently regressing #833.

Yiminnn temporarily deployed to pypi-internal-preview June 27, 2026 06:34 — with GitHub Actions Inactive

Yiminnn mentioned this pull request Jun 27, 2026

claude-agent-acp does not emit trajectory/llm_trajectory.jsonl for successful rollouts (pi does) #833

Closed

Yiminnn temporarily deployed to pypi-internal-preview June 27, 2026 18:15 — with GitHub Actions Inactive

Yiminnn merged commit bc429df into main Jun 27, 2026
9 checks passed

Yiminnn deleted the fix/acp-anthropic-messages-trajectory branch June 27, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts#839

fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts#839
Yiminnn merged 2 commits into
mainfrom
fix/acp-anthropic-messages-trajectory

Yiminnn commented Jun 27, 2026

Uh oh!

bingran-you commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yiminnn commented Jun 27, 2026

Summary

Root cause

Fix

Verification

Uh oh!

bingran-you commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants