🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406

ammar-agent · 2025-10-24T01:29:05Z

Problem

The openai-web-search.test.ts integration test was flaking in CI with timeouts after 120+ seconds:

Stream emitted 100+ events but never completed with stream-end
Pattern: repeated reasoning-delta → reasoning-end → tool-call-start → tool-call-end cycles
15 tool calls observed before timeout
Test failed on all 3 retry attempts

CI Run: https://github.com/coder/cmux/actions/runs/18766377932/job/53542148133

Root Cause

The test prompt was too complex for a reasoning model:

Find gold price → compute price² → compute Collatz sequence steps to reach 1

With thinkingLevel: 'high' + web_search, this caused the model to enter excessive tool call loops:

Searching for gold prices repeatedly (volatile data)
Extensive reasoning about the huge number (price² is millions)
Never reaching a satisfactory conclusion within 120 seconds

This is NOT a bug in the unlimited steps configuration - models MUST be able to run for hours or even days with unlimited tool calls for autonomous workflows.

Solution

Clarified unlimited steps intent: Added comment explaining that the 100k step limit is intentionally high to support long-running autonomous workflows
Simplified test prompt: Changed to simple weather query + picnic decision
- Still tests reasoning + web_search combination
- Much less likely to cause excessive loops
- Still validates the original bug fix (itemId errors)
Reduced thinking level: Changed from high to medium to avoid excessive deliberation
Adjusted timeouts: Reduced to 120s/90s for simpler task

Testing

Type checking passes. The test still validates the same bug fix with a more stable prompt.

Generated with cmux

Reasoning models (especially gpt-5-codex) can get stuck in infinite tool call loops when combined with web_search and high reasoning effort. This was causing the openai-web-search.test.ts integration test to timeout after 120+ seconds with 15+ tool calls and no completion. Root cause: The stream was using `stopWhen: stepCountIs(100000)` which effectively allowed unlimited tool calls. With reasoning models, the model can keep calling tools indefinitely without reaching a final answer. Fix: Replace unlimited steps with `maxSteps: 25` to prevent infinite loops while still allowing reasonable multi-turn tool use. This value is chosen based on observed failure (15 tool calls) with some buffer. The AI SDK will now stop the stream after 25 tool call rounds, ensuring the stream completes and emits stream-end even if the model gets stuck. Fixes: https://github.com/coder/cmux/actions/runs/18766377932

…oops" This reverts commit 5d49733.

The openai-web-search.test.ts was flaking because the prompt (gold price + Collatz sequence computation) was too complex and causing the reasoning model to enter excessive tool call loops that didn't complete within 120s. Changes: 1. **Clarified unlimited steps intent**: Added comment explaining that models MUST be able to run for hours or days with unlimited tool calls for autonomous workflows. The 100k step limit is intentionally high. 2. **Simplified test prompt**: Changed from complex math (Collatz on price²) to simple weather + picnic decision. This still tests reasoning + web_search combination but is much less likely to cause excessive loops. 3. **Reduced thinking level**: Changed from 'high' to 'medium' to avoid excessive deliberation while still ensuring reasoning is present. 4. **Adjusted timeouts**: Reduced from 150s/120s to 120s/90s since simpler task should complete faster. The test still validates the original bug fix (itemId errors with reasoning + web_search) but with a more stable prompt that's less likely to timeout. Fixes: https://github.com/coder/cmux/actions/runs/18766377932

ammar-agent added 3 commits October 23, 2025 20:28

Revert "🤖 Fix test flake: limit tool call steps to prevent infinite l…

40dedb1

…oops" This reverts commit 5d49733.

ammar-agent changed the title ~~🤖 Fix test flake: limit tool call steps to prevent infinite loops~~ 🤖 Fix test flake by simplifying prompt and clarifying unlimited steps Oct 24, 2025

ammario added this pull request to the merge queue Oct 24, 2025

Merged via the queue into main with commit 4c70f5b Oct 24, 2025
13 checks passed

ammario deleted the investigate-workspace-selection-flake branch October 24, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406

🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406

Uh oh!

ammar-agent commented Oct 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406

🤖 Fix test flake by simplifying prompt and clarifying unlimited steps #406

Uh oh!

Conversation

ammar-agent commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ammar-agent commented Oct 24, 2025 •

edited

Loading