Skip to content

[Feature] Unify token-budget fallback across channel LLM entry points #576

@Y1fe1Zh0u

Description

@Y1fe1Zh0u

What problem does this solve?

Follow-up to #572: token-budget fallback is currently being addressed in the WebSocket main chat path and the Feishu path, but Clawith does not have one shared model invocation entry point for every channel.

Current code paths show several independent ingress lanes:

  • WebSocket main chat: backend/app/api/websocket.py builds conversation context and calls call_llm_with_failover(...).
  • Feishu: backend/app/api/feishu.py loads channel history and routes through _call_agent_llm(...), including primary/fallback model handling.
  • WeCom / WeChat / Discord and similar channels load history in their own service files before calling _call_agent_llm(...) or another LLM helper.
  • Background service paths such as heartbeat, task execution, reporting, supervision, and agent-to-agent/service flows may call call_llm(...) or related helpers directly.

Because these paths assemble history and invoke models independently, a fix in one path can leave other channels exposed to the same oversized-history/token-budget failure. The behavior may diverge by channel even when the agent, model, and conversation history are otherwise the same.

Concrete examples from the current codebase:

  • backend/app/api/websocket.py truncates conversation context before call_llm_with_failover(...).
  • backend/app/api/feishu.py normalizes and truncates history inside _call_agent_llm(...).
  • backend/app/services/wecom_stream.py, backend/app/services/wechat_channel.py, and backend/app/services/discord_gateway.py each load recent ChatMessage history before invoking the agent LLM path.
  • Other service-level callers under backend/app/services/ use call_llm(...)/agent LLM helpers for non-channel work.

Proposed solution

Discuss and design a shared abstraction for model-call context preparation, rather than patching each channel independently.

The shared layer should probably own:

  • history normalization before replaying persisted chat messages into the LLM;
  • token-budget-aware context trimming using the target model's context window;
  • consistent primary/fallback model behavior for retryable failures;
  • consistent handling of tool-call/tool-result message pairs, so trimming does not break provider protocol requirements;
  • channel metadata only as input parameters, not duplicated call logic;
  • a way for background/task/agent-to-agent calls to opt into the same budget governance without pretending they are user chat channels.

This does not necessarily mean every channel must use a single monolithic function. The discussion should decide whether the right shape is:

  • a reusable prepare_llm_messages(...) helper used by all entry points;
  • a shared channel invocation service that wraps history loading + model resolution + fallback;
  • or a smaller token-budget guard inserted immediately before any call_llm(...) / _call_agent_llm(...) / call_llm_with_failover(...) call.

Acceptance criteria for the discussion

  • Inventory all current LLM invocation entry points, including WebSocket, Feishu, WeCom, WeChat, Teams/Slack/Discord-style channels, heartbeat/task/reporting, and agent-to-agent flows.
  • Decide which paths must share token-budget fallback semantics and which paths need special behavior.
  • Define the shared API boundary and ownership so future channel additions do not need to rediscover the same trimming/fallback rules.
  • Add regression coverage around at least one non-WebSocket channel with oversized history once the abstraction is implemented.

Willing to contribute?

  • I'd be interested in working on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions