Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Oct 24, 2025

Problem

The openai-web-search.test.ts integration test was flaking in CI with timeouts after 120+ seconds:

  • Stream emitted 100+ events but never completed with stream-end
  • Pattern: repeated reasoning-delta → reasoning-end → tool-call-start → tool-call-end cycles
  • 15 tool calls observed before timeout
  • Test failed on all 3 retry attempts

CI Run: https://github.com/coder/cmux/actions/runs/18766377932/job/53542148133

Root Cause

The test prompt was too complex for a reasoning model:

Find gold price → compute price² → compute Collatz sequence steps to reach 1

With thinkingLevel: 'high' + web_search, this caused the model to enter excessive tool call loops:

  • Searching for gold prices repeatedly (volatile data)
  • Extensive reasoning about the huge number (price² is millions)
  • Never reaching a satisfactory conclusion within 120 seconds

This is NOT a bug in the unlimited steps configuration - models MUST be able to run for hours or even days with unlimited tool calls for autonomous workflows.

Solution

  1. Clarified unlimited steps intent: Added comment explaining that the 100k step limit is intentionally high to support long-running autonomous workflows

  2. Simplified test prompt: Changed to simple weather query + picnic decision

    • Still tests reasoning + web_search combination
    • Much less likely to cause excessive loops
    • Still validates the original bug fix (itemId errors)
  3. Reduced thinking level: Changed from high to medium to avoid excessive deliberation

  4. Adjusted timeouts: Reduced to 120s/90s for simpler task

Testing

Type checking passes. The test still validates the same bug fix with a more stable prompt.


Generated with cmux

Reasoning models (especially gpt-5-codex) can get stuck in infinite tool
call loops when combined with web_search and high reasoning effort. This
was causing the openai-web-search.test.ts integration test to timeout
after 120+ seconds with 15+ tool calls and no completion.

Root cause: The stream was using `stopWhen: stepCountIs(100000)` which
effectively allowed unlimited tool calls. With reasoning models, the model
can keep calling tools indefinitely without reaching a final answer.

Fix: Replace unlimited steps with `maxSteps: 25` to prevent infinite loops
while still allowing reasonable multi-turn tool use. This value is chosen
based on observed failure (15 tool calls) with some buffer.

The AI SDK will now stop the stream after 25 tool call rounds, ensuring
the stream completes and emits stream-end even if the model gets stuck.

Fixes: https://github.com/coder/cmux/actions/runs/18766377932
The openai-web-search.test.ts was flaking because the prompt (gold price +
Collatz sequence computation) was too complex and causing the reasoning
model to enter excessive tool call loops that didn't complete within 120s.

Changes:

1. **Clarified unlimited steps intent**: Added comment explaining that models
   MUST be able to run for hours or days with unlimited tool calls for
   autonomous workflows. The 100k step limit is intentionally high.

2. **Simplified test prompt**: Changed from complex math (Collatz on price²)
   to simple weather + picnic decision. This still tests reasoning +
   web_search combination but is much less likely to cause excessive loops.

3. **Reduced thinking level**: Changed from 'high' to 'medium' to avoid
   excessive deliberation while still ensuring reasoning is present.

4. **Adjusted timeouts**: Reduced from 150s/120s to 120s/90s since simpler
   task should complete faster.

The test still validates the original bug fix (itemId errors with reasoning +
web_search) but with a more stable prompt that's less likely to timeout.

Fixes: https://github.com/coder/cmux/actions/runs/18766377932
@ammar-agent ammar-agent changed the title 🤖 Fix test flake: limit tool call steps to prevent infinite loops 🤖 Fix test flake by simplifying prompt and clarifying unlimited steps Oct 24, 2025
@ammario ammario added this pull request to the merge queue Oct 24, 2025
Merged via the queue into main with commit 4c70f5b Oct 24, 2025
13 checks passed
@ammario ammario deleted the investigate-workspace-selection-flake branch October 24, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants