Skip to content

A2A send_message_to_agent fails on transient LLM timeout, aborts the whole task, and returns an empty error message to the UI #143

@Congregalis

Description

@Congregalis

Pre-checks

Deployment Method

Source (setup.sh)

Steps to Reproduce

  1. Agent A(Caption) sends a long task_request to Agent B(Nova) using send_message_to_agent.
  2. Agent B starts a multi-step workflow involving LLM calls and tool usage (web_search in this case).
  3. One of the intermediate LLM requests times out.
  4. The entire A2A request fails immediately and returns a generic empty error string.

Expected vs Actual Behavior

Actual Result

  • The task is aborted mid-execution.
  • The UI shows only Message send error: with no cause.
  • The backend log shows httpcore.ReadTimeout / httpx.ReadTimeout.
Image

Expected Result

  • A transient timeout during a long A2A task should not silently collapse into an empty UI error.
  • The returned error should include a concrete cause.
  • Long-running A2A flows should not be fragile to a single transient timeout.

Logs / Screenshots

1. The A2A request to Nova(Agent B) was actually started

From .data/log/backend.log:

3459 2026-03-20 13:38:15 | INFO | ... | [LLM] Raw arguments for send_message_to_agent (len=655): '{"agent_name": "Nova", "message": "...", "msg_type": "task_request"}'
3460 2026-03-20 13:38:15 | INFO | ... | [LLM] Calling tool: send_message_to_agent({'agent_name': 'Nova', 'message': '...', 'msg_type': 'task_request'})

This confirms the failing action was specifically send_message_to_agent targeting Nova.

2. Nova had already entered a multi-step long-running flow

3472 2026-03-20 13:38:55 | INFO | ... | app.services.autonomy_service:check_and_enforce:62 - L2: Executing web_search for agent Nova with notification
3480 2026-03-20 13:39:51 | INFO | ... | app.services.autonomy_service:check_and_enforce:62 - L2: Executing web_search for agent Nova with notification
3484 2026-03-20 13:40:08 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"
3485 2026-03-20 13:40:15 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"
3486 2026-03-20 13:40:20 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"

This shows Nova was not failing immediately. It had already started working through a long task with multiple successful LLM/tool steps before the failure occurred.

3. The real failure was a timeout during an LLM request

3519 httpcore.ReadTimeout
3524 File "/Users/congregalis/Code/Clawith/backend/app/services/agent_tools.py", line 2806, in _send_message_to_agent
3525     response = await llm_client.complete(
3527 File "/Users/congregalis/Code/Clawith/backend/app/services/llm_client.py", line 410, in complete
3528     response = await client.post(url, json=payload, headers=self._get_headers())
3555 httpx.ReadTimeout

This is the critical evidence. The failure is a timeout while waiting for an LLM response inside _send_message_to_agent, not an invalid tool argument, not a missing agent, and not a frontend-only issue.

4. The empty UI error is explained by how the exception is rendered

At the time of failure, the original code returned:

return f"❌ Message send error: {str(e)[:200]}"

For httpx.ReadTimeout, str(e) can be empty. Reproduced locally:

ReadTimeout
str(e)= ''

That explains why the UI showed only:

Message send error:

with no additional detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions