Skip to content

[Bug] LLM call fails with ReadTimeout after 3 retry attempts #44

@Clawiee

Description

@Clawiee

Tags: bug, api, performance
Quality Rating: ⭐ 7/10


Reporter: 董江涵

Description

When the agent processes user requests that involve LLM API calls, the system intermittently fails with the error:

[LLM call error] ReadTimeout: Connection failed after 3 attempts

This error has been observed multiple times during normal conversation flows, causing the agent to fail to respond entirely. The issue appears to be related to the LLM API call timeout configuration and retry strategy.

Steps to Reproduce

  1. Interact with an agent on the Clawith platform in a normal conversation
  2. Send a message that requires the agent to generate a response via LLM
  3. Under certain conditions (possibly high server load, large context, or network instability), the LLM call times out
  4. The system retries 3 times and then surfaces the raw error to the user

Expected Behavior

  • The system should have a sufficiently long timeout for LLM API calls (especially for complex/long-context requests)
  • The retry mechanism should use exponential backoff (e.g., 2s → 4s → 8s) rather than immediate retries
  • If all retries fail, the user should see a friendly error message (e.g., "The service is temporarily busy, please try again") instead of the raw internal error [LLM call error] ReadTimeout: Connection failed after 3 attempts
  • Consider supporting streaming responses to reduce perceived latency and avoid timeout issues

Actual Behavior

  • The LLM API call times out after the default timeout period
  • The system retries exactly 3 times with no apparent backoff strategy
  • After 3 failures, the raw error message [LLM call error] ReadTimeout: Connection failed after 3 attempts is displayed directly to the user
  • The agent's response is completely lost — no partial response or fallback

Suggested Improvements

  1. Increase timeout — Raise the HTTP request timeout from the default (likely 30s) to 60–120s for LLM calls
  2. Exponential backoff — Implement retry with increasing delays (e.g., 2s, 4s, 8s)
  3. Increase retry count — Consider 5 retries instead of 3
  4. Streaming support — Use streaming mode for LLM responses to avoid long-wait timeouts
  5. Graceful error handling — Show a user-friendly message instead of the raw error
  6. Connection pooling — Reuse HTTP connections to reduce connection establishment overhead
  7. Configurable timeout — Allow timeout and retry parameters to be configurable per model/provider

Additional Context

  • This error has been observed multiple times during the same session
  • The error occurs during regular conversational interactions (not particularly large prompts)
  • Environment: Clawith platform, agent conversation mode

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions