fix(eval): emit llm_trajectory.jsonl for streaming claude-agent-acp rollouts#839
Merged
Conversation
…ollouts LiteLLM routes openai/-prefixed upstreams (e.g. the vllm provider) through its Responses-API adapter for the anthropic /v1/messages route. That adapter's streaming path never fires the success callback, so streaming claude-agent-acp rollouts wrote no trajectory/llm_trajectory.jsonl on the success path (0/50), while pi (/v1/chat/completions) and the error path were unaffected. Set litellm.use_chat_completions_url_for_anthropic_messages so the bridge uses /chat/completions, whose streaming path logs success correctly. Closes #833
Collaborator
|
Users Simulation automation review (2026-06-27T12:21Z): ready for this simulation scope. I verified the PR through cost-light CLI/SDK and focused code paths rather than a live expensive model rollout. The LiteLLM proxy flag is wired in the canonical config builder, focused runtime/hardening checks passed, and the trajectory validator still fails closed when Commands/evidence:
Residual risk: the regression test is static config coverage, not a real streamed |
…t-completions flag The flag is applied via LiteLLM's generic litellm_settings -> setattr path, which does not raise on an unknown key. Assert litellm still exposes use_chat_completions_url_for_anthropic_messages so a future LiteLLM rename fails CI instead of silently regressing #833.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
claude-agent-acpdid not writetrajectory/llm_trajectory.jsonlfor successful rollouts (0/50), whilepidid (44/44) and claude-acp's error path did (11/17). This breaks trajectory parity and makes successful claude rollouts unusable for any LLM-trajectory consumer (export / RL training).Root cause
The per-rollout LiteLLM proxy captures the LLM trajectory from its
litellm_settings.callbackssuccess/failure events. BenchFlow'svllm/provider maps the upstream toopenai/<model>. For the anthropic/v1/messagesroute against anopenai/-prefixed upstream, LiteLLM 1.89 bridges through the OpenAI Responses API (/responses), and its Responses streaming adapter never invokes the success callback — so streaming/v1/messagessuccess produces zero callback records → emptyserver.trajectory.exchanges→_write_llm_trajectoryearly-returns and writes nothing.claude-agent-acp(Claude Code) streams/v1/messages→ hits the gap (0/50 on success).picalls/v1/chat/completionsdirectly → streaming success logs fine (44/44).post_call_failure_hook(failures do log) → that's the 11/17 on the error path.Fix
Set
litellm.use_chat_completions_url_for_anthropic_messagesvia the proxy'slitellm_settings, so the anthropic/v1/messagesbridge uses/chat/completions— the universally-supported endpoint whose streaming path logs success correctly. Nativeanthropic/providers don't use the Responses bridge (unaffected); the proxy is BenchFlow-owned and per-rollout, so there is no external blast radius.Verification
Reproduced with a live-proxy harness (the real proxy started via the runtime + a mock OpenAI-compatible upstream), driving the exact
openai/route thevllm/provider produces:/responses/chat/completions/chat/completionsA regression test (
test_proxy_config_forces_chat_completions_for_anthropic_messages) asserts the flag is present in the generated proxy config.tests/test_litellm_config.py+tests/test_litellm_hardening.pypass (57 passed).Closes #833