Fix structured output leaks in tool-calling loops#5897
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExecutor code now disables structured response_model when tools are involved; core and lite agents attempt direct JSON validation via response_format.model_validate_json(...) (catching ValidationError) before falling back to Converter-based parsing. Test cassettes and a unit test were updated for Gemini 2.5 and tool-driven flows. ChangesResponse Model Handling and Structured Output Parsing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
lib/crewai/tests/agents/test_agent_executor.py (2)
263-263: ⚡ Quick winUse
.get()for more robust assertion.Same concern as line 241: direct dictionary access with
call_args.kwargs["response_model"]will raise KeyError if the key doesn't exist. Using.get()makes the test more robust.🔧 Proposed refactor for robustness
- assert get_llm_response_mock.call_args.kwargs["response_model"] is None + assert get_llm_response_mock.call_args.kwargs.get("response_model") is None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai/tests/agents/test_agent_executor.py` at line 263, The test is asserting get_llm_response_mock.call_args.kwargs["response_model"] which will raise KeyError if the key is missing; change the assertion to use .get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace direct indexing with call_args.kwargs.get("response_model")) so the test asserts that the value is None and won't error if the key is absent.
241-241: ⚡ Quick winUse
.get()for more robust assertion.Direct dictionary access with
call_args.kwargs["response_model"]will raise KeyError if the key doesn't exist. Using.get()makes the test more robust and provides clearer failure messages if the implementation changes to omit the parameter entirely.🔧 Proposed refactor for robustness
- assert get_llm_response_mock.call_args.kwargs["response_model"] is None + assert get_llm_response_mock.call_args.kwargs.get("response_model") is None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai/tests/agents/test_agent_executor.py` at line 241, The test uses direct dict access get_llm_response_mock.call_args.kwargs["response_model"], which will raise KeyError if the key is missing; change the assertion to use .get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the test is robust to the key being absent and yields clearer failure messages while still verifying the intended None value for response_model.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/crewai/tests/agents/test_agent_executor.py`:
- Line 263: The test is asserting
get_llm_response_mock.call_args.kwargs["response_model"] which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace
direct indexing with call_args.kwargs.get("response_model")) so the test asserts
that the value is None and won't error if the key is absent.
- Line 241: The test uses direct dict access
get_llm_response_mock.call_args.kwargs["response_model"], which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert
get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the
test is robust to the key being absent and yields clearer failure messages while
still verifying the intended None value for response_model.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4139e894-c081-432d-a14d-3b0314c8f948
📒 Files selected for processing (5)
lib/crewai/src/crewai/agent/core.pylib/crewai/src/crewai/agents/crew_agent_executor.pylib/crewai/src/crewai/experimental/agent_executor.pylib/crewai/src/crewai/lite_agent.pylib/crewai/tests/agents/test_agent_executor.py
…-47-structured-output-tools
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/crewai/scripts/structured_output_with_tools_runner.py`:
- Around line 44-46: Replace the direct print call with the project-approved
logger: after calling StructuredOutputFlow().kickoff() and obtaining result (and
its result.pydantic), log the output via the configured logging facility instead
of print; locate the main block where StructuredOutputFlow.kickoff() is invoked
and swap the print(result.pydantic) for a logger call (e.g., use the module or
package logger) so the script satisfies the Ruff T201 lint rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: a8037508-a1b0-47e9-9ba3-41a6c9aa0258
📒 Files selected for processing (3)
lib/crewai/scripts/structured_output_with_tools_runner.pylib/crewai/src/crewai/agent/core.pylib/crewai/src/crewai/lite_agent.py
- Changed the model from "claude-3-5-haiku-20241022" to "claude-sonnet-4-6" in the test setup. - Updated the request and response formats in the YAML test cassette to reflect the new tool structure and improved content formatting. - Adjusted the expected response body to match the new output format from the assistant, including changes in tool usage and response details. - Increased rate limit values in the response headers for better testing scenarios.
- Introduced a custom matcher for Bedrock requests to normalize regional endpoints, ensuring consistent behavior across AWS regions. - Updated the VCR configuration to utilize the new matcher. - Adjusted test cassette to replace the original Bedrock endpoint with a placeholder for improved testing consistency. - Modified response body and headers in the test cassette to reflect updated expected values.
| content=f"Failed to parse output into response format after retries: {e.message}", | ||
| color="yellow", | ||
| ) | ||
|
|
There was a problem hiding this comment.
LiteAgent still leaks structured output
Medium Severity
This PR disables passing response_model during tool loops in CrewAgentExecutor and the experimental AgentExecutor, but LiteAgent._invoke_loop still passes the full response_model on every get_llm_response call even when tools are configured. Structured-output enforcement can still run alongside ReAct tool steps in LiteAgent, which is the leak the rest of the change set is meant to prevent.
Reviewed by Cursor Bugbot for commit ab93679. Configure here.
| "match_on": ["method", "scheme", "host", "port", "path"], | ||
| "match_on": ["method", "scheme", "bedrock_host", "port", "path"], |
There was a problem hiding this comment.
This is going to make the cassettes more flaky
…-47-structured-output-tools
| """Register custom VCR matchers for each test cassette session.""" | ||
| vcr.register_matcher("bedrock_host", bedrock_host_matcher) | ||
|
|
||
|
|
There was a problem hiding this comment.
Bedrock VCR matcher unused
Medium Severity
A custom bedrock_host VCR matcher is registered in pytest_recording_configure, but vcr_config still uses default match_on with host. Replay therefore compares raw regional Bedrock hostnames instead of the normalized placeholder, so cassettes recorded in one AWS region fail in CI or locally when another region is used.
Reviewed by Cursor Bugbot for commit efed5fc. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.
| status_code = response.get("status", {}).get("code") | ||
| if isinstance(status_code, int) and status_code >= 400: | ||
| # Avoid persisting auth/model errors when re-recording without valid AWS creds. | ||
| return None |
There was a problem hiding this comment.
Filtering all 400+ responses breaks error-path cassettes
Medium Severity
The new _filter_response_headers logic unconditionally drops all HTTP responses with status code >= 400 from VCR cassette recordings, not just Bedrock auth errors. Any test that exercises error-handling paths (rate limiting, validation errors, 404s, etc.) across any provider will silently lose those recorded responses. When tests replay, the missing interaction will cause CannotSendRequest or cassette-mismatch failures. The comment says this is to "avoid persisting auth/model errors when re-recording without valid AWS creds," but the filter applies globally to all providers and all error types.
Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.


Note
Medium Risk
Changes core agent executor LLM call behavior for agents that combine tools and response_format; mitigated by targeted logic and new unit tests, but affects production kickoff paths.
Overview
Agents with tools and a task response format no longer ask the LLM for structured output on every step in ReAct or native tool loops—
response_modelis cleared wheneveroriginal_toolsis set, and native tool rounds always passresponse_model=None. Structured shaping is intended for the final answer path, not mid-loop tool turns.Post-run formatting in
AgentandLiteAgentnow triesmodel_validate_jsonon the raw output first, then falls back to the existing Converter path if validation fails.Test/VCR plumbing normalizes Bedrock regional hosts for cassette matching, redacts
x-amz-security-token, skips recording HTTP ≥400 responses, and refreshes Bedrock/Anthropic cassettes (e.g. Claude Sonnet 4.6). New unit tests assertresponse_modelis not passed during tool loops.Reviewed by Cursor Bugbot for commit c853196. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Bug Fixes
Tests
Chores