[Bug] Intermittent "tool_call_id is not found" errors - State management issue

### Pre-checks

- [x] I have searched [existing issues](https://github.com/dataelement/Clawith/issues) and this is aware this is not a duplicate.

### Deployment Method

Cloud / Self-hosted

### Error Messages

```
[LLM Error] HTTP 400: {
  "error": {
    "message": "No tool call found for function call output with call_id call_xxxxx.",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}
```

And sometimes:
```
[LLM Error] HTTP 400: {"error":{"type":"invalid_request_error","message":"tool_call_id is not found"},"type":"error"}
```

### Problem Description

The user reports that LLM tool calling errors occur frequently during conversations. The issue is **not related to the LLM model itself** — the same operation works fine after opening a new chat window. This indicates a **state management issue** in the frontend or WebSocket session handling.

### Symptoms

1. **Intermittent tool call failures** during multi-turn conversations
2. **Errors disappear after refreshing/opening a new chat window**
3. The error suggests tool_call_id mismatch between:
   - The ID generated by the backend when storing tool calls
   - The ID sent back by the frontend in subsequent requests

### Root Cause Analysis (Preliminary)

Based on code analysis of `backend/app/api/websocket.py`:

#### 1. Tool Call ID Generation Issue

In `websocket.py`, tool_call messages are saved with a **synthetic ID**:
```python
# Line ~470-490 in websocket.py
tc_id = f"call_{msg.id}"  # synthetic tool_call_id
conversation.append({
    "role": "assistant",
    "content": None,
    "tool_calls": [{
        "id": tc_id,
        "type": "function",
        "function": {"name": tc_name, "arguments": ...},
    }],
})
```

The `msg.id` is a database UUID, but when this is sent to the LLM as a tool result, the frontend might send back a **different** tool_call_id format.

#### 2. Frontend State Mismatch

When the frontend sends tool results back to the backend (for logging purposes), it should include the same `tool_call_id` that was generated. If there's any state desynchronization (e.g., from caching, race conditions, or WebSocket reconnection), the ID might not match.

#### 3. Session/Context Reset Effect

Opening a new chat window:
- Resets frontend state
- Creates a new WebSocket connection
- Re-fetches clean chat history
- Re-initializes all tool call tracking

This is why the "workaround" of opening a new window fixes the issue temporarily.

### Affected Areas

Based on code search, similar patterns exist in:
- `backend/app/api/websocket.py` - WebSocket chat handler
- `backend/app/api/feishu.py` - Feishu channel handler
- `backend/app/services/scheduler.py` - Scheduler service
- `backend/app/services/trigger_daemon.py` - Trigger daemon

### Expected vs Actual Behavior

**Expected**: Tool calls work consistently across all turns in a conversation.

**Actual**: Tool calls may fail with "tool_call_id is not found" after some interactions, requiring a session refresh.

### Logs / Screenshots

N/A - This is a state management issue that is difficult to reproduce consistently.

### Proposed Solutions

#### Solution 1: Consistent Tool Call ID Generation (Frontend + Backend)

Standardize tool_call_id generation across all components:

```python
# Generate a deterministic ID that can be recreated from known data
import hashlib

def generate_tool_call_id(agent_id: str, session_id: str, tool_name: str, timestamp: float) -> str:
    data = f"{agent_id}:{session_id}:{tool_name}:{timestamp}"
    return f"call_{hashlib.md5(data.encode()).hexdigest()[:16]}"
```

#### Solution 2: Strict Tool Call ID Validation

Add validation when receiving tool results to handle mismatches gracefully:

```python
async def handle_tool_result(tool_call_id: str, result: str):
    # Try to find by ID, fall back to most recent if not found
    stored = await get_stored_tool_call(tool_call_id)
    if not stored:
        # Try to find the most recent tool call for this session
        stored = await get_most_recent_tool_call(session_id)
        if stored:
            logger.warning(f"Tool call ID mismatch: {tool_call_id} -> {stored.id}")
    # Proceed with matched tool call...
```

#### Solution 3: Frontend State Recovery

When the frontend detects a WebSocket reconnection or session refresh, it should:
1. Clear any cached tool call state
2. Re-fetch the latest chat history
3. Initialize tool call tracking from the fresh state

### Investigation Needed

1. **Trace the exact flow**: How does the frontend track tool_call_ids across turns?
2. **Check WebSocket reconnection logic**: Does reconnection properly reset state?
3. **Examine message persistence**: Is there any race condition when saving tool calls?
4. **Review frontend caching**: Any localStorage/sessionStorage that might be stale?

### Additional Context

This issue was reported by a user who noted that the problem is consistent but can be "fixed" by refreshing the browser/chat window, confirming it's a state management issue rather than an LLM or backend logic problem.

### Priority

**High** - This affects the core tool-calling functionality of the platform.

### Willing to contribute?

- [ ] I'd be interested in working on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Intermittent "tool_call_id is not found" errors - State management issue #382

Pre-checks

Deployment Method

Error Messages

Problem Description

Symptoms

Root Cause Analysis (Preliminary)

1. Tool Call ID Generation Issue

2. Frontend State Mismatch

3. Session/Context Reset Effect

Affected Areas

Expected vs Actual Behavior

Logs / Screenshots

Proposed Solutions

Solution 1: Consistent Tool Call ID Generation (Frontend + Backend)

Solution 2: Strict Tool Call ID Validation

Solution 3: Frontend State Recovery

Investigation Needed

Additional Context

Priority

Willing to contribute?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Intermittent "tool_call_id is not found" errors - State management issue #382

Description

Pre-checks

Deployment Method

Error Messages

Problem Description

Symptoms

Root Cause Analysis (Preliminary)

1. Tool Call ID Generation Issue

2. Frontend State Mismatch

3. Session/Context Reset Effect

Affected Areas

Expected vs Actual Behavior

Logs / Screenshots

Proposed Solutions

Solution 1: Consistent Tool Call ID Generation (Frontend + Backend)

Solution 2: Strict Tool Call ID Validation

Solution 3: Frontend State Recovery

Investigation Needed

Additional Context

Priority

Willing to contribute?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions