A2A `send_message_to_agent` fails on transient LLM timeout, aborts the whole task, and returns an empty error message to the UI

### Pre-checks

- [x] I have searched [existing issues](https://github.com/dataelement/Clawith/issues) and this is not a duplicate.

### Deployment Method

Source (setup.sh)

## Steps to Reproduce

1. `Agent A(Caption)` sends a long `task_request` to `Agent B(Nova)` using `send_message_to_agent`.
2. `Agent B` starts a multi-step workflow involving LLM calls and tool usage (`web_search` in this case).
3. One of the intermediate LLM requests times out.
4. The entire A2A request fails immediately and returns a generic empty error string.


## Expected vs Actual Behavior

### Actual Result

- The task is aborted mid-execution.
- The UI shows only `Message send error:` with no cause.
- The backend log shows `httpcore.ReadTimeout` / `httpx.ReadTimeout`.

<img width="957" height="80" alt="Image" src="https://github.com/user-attachments/assets/bc648e35-c07a-439e-9410-291b2be91c9e" />

### Expected Result

- A transient timeout during a long A2A task should not silently collapse into an empty UI error.
- The returned error should include a concrete cause.
- Long-running A2A flows should not be fragile to a single transient timeout.

## Logs / Screenshots

### 1. The A2A request to Nova(Agent B) was actually started

From `.data/log/backend.log`:

```text
3459 2026-03-20 13:38:15 | INFO | ... | [LLM] Raw arguments for send_message_to_agent (len=655): '{"agent_name": "Nova", "message": "...", "msg_type": "task_request"}'
3460 2026-03-20 13:38:15 | INFO | ... | [LLM] Calling tool: send_message_to_agent({'agent_name': 'Nova', 'message': '...', 'msg_type': 'task_request'})
```

This confirms the failing action was specifically `send_message_to_agent` targeting `Nova`.

### 2. Nova had already entered a multi-step long-running flow

```text
3472 2026-03-20 13:38:55 | INFO | ... | app.services.autonomy_service:check_and_enforce:62 - L2: Executing web_search for agent Nova with notification
3480 2026-03-20 13:39:51 | INFO | ... | app.services.autonomy_service:check_and_enforce:62 - L2: Executing web_search for agent Nova with notification
3484 2026-03-20 13:40:08 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"
3485 2026-03-20 13:40:15 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"
3486 2026-03-20 13:40:20 | INFO | ... | HTTP Request: POST https://ark.cn-beijing.volces.com/api/coding/v3/chat/completions "HTTP/1.1 200 OK"
```

This shows Nova was not failing immediately. It had already started working through a long task with multiple successful LLM/tool steps before the failure occurred.

### 3. The real failure was a timeout during an LLM request

```text
3519 httpcore.ReadTimeout
3524 File "/Users/congregalis/Code/Clawith/backend/app/services/agent_tools.py", line 2806, in _send_message_to_agent
3525     response = await llm_client.complete(
3527 File "/Users/congregalis/Code/Clawith/backend/app/services/llm_client.py", line 410, in complete
3528     response = await client.post(url, json=payload, headers=self._get_headers())
3555 httpx.ReadTimeout
```

This is the critical evidence. The failure is a timeout while waiting for an LLM response inside `_send_message_to_agent`, not an invalid tool argument, not a missing agent, and not a frontend-only issue.

### 4. The empty UI error is explained by how the exception is rendered

At the time of failure, the original code returned:

```python
return f"❌ Message send error: {str(e)[:200]}"
```

For `httpx.ReadTimeout`, `str(e)` can be empty. Reproduced locally:

```text
ReadTimeout
str(e)= ''
```

That explains why the UI showed only:

```text
Message send error:
```

with no additional detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2A `send_message_to_agent` fails on transient LLM timeout, aborts the whole task, and returns an empty error message to the UI #143

Pre-checks

Deployment Method

Steps to Reproduce

Expected vs Actual Behavior

Actual Result

Expected Result

Logs / Screenshots

1. The A2A request to Nova(Agent B) was actually started

2. Nova had already entered a multi-step long-running flow

3. The real failure was a timeout during an LLM request

4. The empty UI error is explained by how the exception is rendered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A2A send_message_to_agent fails on transient LLM timeout, aborts the whole task, and returns an empty error message to the UI #143

Description

Pre-checks

Deployment Method

Steps to Reproduce

Expected vs Actual Behavior

Actual Result

Expected Result

Logs / Screenshots

1. The A2A request to Nova(Agent B) was actually started

2. Nova had already entered a multi-step long-running flow

3. The real failure was a timeout during an LLM request

4. The empty UI error is explained by how the exception is rendered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A2A `send_message_to_agent` fails on transient LLM timeout, aborts the whole task, and returns an empty error message to the UI #143