Skip to content

[Bug] Intermittent "tool_call_id is not found" errors - State management issue #382

@Clawiee

Description

@Clawiee

Pre-checks

  • I have searched existing issues and this is aware this is not a duplicate.

Deployment Method

Cloud / Self-hosted

Error Messages

[LLM Error] HTTP 400: {
  "error": {
    "message": "No tool call found for function call output with call_id call_xxxxx.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

And sometimes:

[LLM Error] HTTP 400: {"error":{"type":"invalid_request_error","message":"tool_call_id is not found"},"type":"error"}

Problem Description

The user reports that LLM tool calling errors occur frequently during conversations. The issue is not related to the LLM model itself — the same operation works fine after opening a new chat window. This indicates a state management issue in the frontend or WebSocket session handling.

Symptoms

  1. Intermittent tool call failures during multi-turn conversations
  2. Errors disappear after refreshing/opening a new chat window
  3. The error suggests tool_call_id mismatch between:
    • The ID generated by the backend when storing tool calls
    • The ID sent back by the frontend in subsequent requests

Root Cause Analysis (Preliminary)

Based on code analysis of backend/app/api/websocket.py:

1. Tool Call ID Generation Issue

In websocket.py, tool_call messages are saved with a synthetic ID:

# Line ~470-490 in websocket.py
tc_id = f"call_{msg.id}"  # synthetic tool_call_id
conversation.append({
    "role": "assistant",
    "content": None,
    "tool_calls": [{
        "id": tc_id,
        "type": "function",
        "function": {"name": tc_name, "arguments": ...},
    }],
})

The msg.id is a database UUID, but when this is sent to the LLM as a tool result, the frontend might send back a different tool_call_id format.

2. Frontend State Mismatch

When the frontend sends tool results back to the backend (for logging purposes), it should include the same tool_call_id that was generated. If there's any state desynchronization (e.g., from caching, race conditions, or WebSocket reconnection), the ID might not match.

3. Session/Context Reset Effect

Opening a new chat window:

  • Resets frontend state
  • Creates a new WebSocket connection
  • Re-fetches clean chat history
  • Re-initializes all tool call tracking

This is why the "workaround" of opening a new window fixes the issue temporarily.

Affected Areas

Based on code search, similar patterns exist in:

  • backend/app/api/websocket.py - WebSocket chat handler
  • backend/app/api/feishu.py - Feishu channel handler
  • backend/app/services/scheduler.py - Scheduler service
  • backend/app/services/trigger_daemon.py - Trigger daemon

Expected vs Actual Behavior

Expected: Tool calls work consistently across all turns in a conversation.

Actual: Tool calls may fail with "tool_call_id is not found" after some interactions, requiring a session refresh.

Logs / Screenshots

N/A - This is a state management issue that is difficult to reproduce consistently.

Proposed Solutions

Solution 1: Consistent Tool Call ID Generation (Frontend + Backend)

Standardize tool_call_id generation across all components:

# Generate a deterministic ID that can be recreated from known data
import hashlib

def generate_tool_call_id(agent_id: str, session_id: str, tool_name: str, timestamp: float) -> str:
    data = f"{agent_id}:{session_id}:{tool_name}:{timestamp}"
    return f"call_{hashlib.md5(data.encode()).hexdigest()[:16]}"

Solution 2: Strict Tool Call ID Validation

Add validation when receiving tool results to handle mismatches gracefully:

async def handle_tool_result(tool_call_id: str, result: str):
    # Try to find by ID, fall back to most recent if not found
    stored = await get_stored_tool_call(tool_call_id)
    if not stored:
        # Try to find the most recent tool call for this session
        stored = await get_most_recent_tool_call(session_id)
        if stored:
            logger.warning(f"Tool call ID mismatch: {tool_call_id} -> {stored.id}")
    # Proceed with matched tool call...

Solution 3: Frontend State Recovery

When the frontend detects a WebSocket reconnection or session refresh, it should:

  1. Clear any cached tool call state
  2. Re-fetch the latest chat history
  3. Initialize tool call tracking from the fresh state

Investigation Needed

  1. Trace the exact flow: How does the frontend track tool_call_ids across turns?
  2. Check WebSocket reconnection logic: Does reconnection properly reset state?
  3. Examine message persistence: Is there any race condition when saving tool calls?
  4. Review frontend caching: Any localStorage/sessionStorage that might be stale?

Additional Context

This issue was reported by a user who noted that the problem is consistent but can be "fixed" by refreshing the browser/chat window, confirming it's a state management issue rather than an LLM or backend logic problem.

Priority

High - This affects the core tool-calling functionality of the platform.

Willing to contribute?

  • I'd be interested in working on this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions