fix: reduce DB connection pool exhaustion during LLM calls#650
Conversation
…ld time during LLM calls All channel message handlers (Feishu, DingTalk, WeCom, Slack, Discord, WeChat) now follow a three-phase pattern: - Phase 1: Short transaction — load config, save user message → commit + close - Phase 2: LLM call with no DB session held - Phase 3: New short transaction — save assistant reply Previously, a single DB connection was held for the entire request lifecycle, including slow LLM calls (30s+). This caused connection pool exhaustion under concurrent load. Changes: - Split _call_agent_llm into _load_agent_and_model + _call_llm_with_config - Each channel handler manages its own connection lifecycle - New short transactions for saving assistant replies reload ChatSession - Background tasks (heartbeat, trigger_daemon) get new trace IDs for log correlation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 217a686ffe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| _task_db.add(task_obj) | ||
| await _task_db.commit() | ||
| await _task_db.refresh(task_obj) | ||
| _task_id = str(task_obj.id) |
There was a problem hiding this comment.
Keep task IDs as UUIDs when dispatching execution
When a Feishu reply triggers task auto-creation, this converts the freshly-created Task.id to a string before passing it to execute_task, while every other caller passes a uuid.UUID and execute_task immediately uses the value in UUID-column queries/inserts (Task.id == task_id, TaskLog(task_id=task_id)). In the asyncpg/PostgreSQL path this can fail UUID binding (or prevent the task lookup), so the task is acknowledged in chat but never actually starts; keep the UUID scalar rather than stringifying it.
Useful? React with 👍 / 👎.
…claim to prevent duplicate triggers
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49426ef52f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -1,106 +0,0 @@ | |||
| services: | |||
There was a problem hiding this comment.
Restore the default Compose entrypoint
Deleting the repository-root docker-compose.yml breaks the documented Docker deployment path: I searched the repo and the README/setup scripts still tell users to run docker compose up -d from the project root, while the only remaining Compose files are under deploy/ and are not picked up by Docker Compose unless users pass -f. Fresh installs or restarts following the documented command will now fail with no default compose file found.
Useful? React with 👍 / 👎.
Organization-level GitHub Models endpoint returns 403 on free plan. Switch to personal endpoint to bypass org restriction.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e4f5097e2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async with _async_session() as db: | ||
| result = await db.execute( | ||
| select(ChannelConfig).where( | ||
| ChannelConfig.agent_id == agent_id, | ||
| ChannelConfig.channel_type == "feishu", | ||
| ) | ||
| ) | ||
| ) | ||
| config = result.scalar_one_or_none() | ||
| config = result.scalar_one_or_none() | ||
| # Pre-load agent and model configs for LLM call (avoids extra session later) | ||
| _agent_model, _llm_model, _fallback_model = await _load_agent_and_model(db, agent_id) |
There was a problem hiding this comment.
Keep Feishu DB work inside a managed session
For text events, this async with closes the session immediately after loading config/model, but the handler later reuses the same db object for history lookup, user resolution, and message persistence outside any context manager before a manual close much later. If a branch returns or raises before that close (for example the channel-user resolution error path), the reopened transaction is not reliably rolled back/closed, and the Feishu contact HTTP call still runs with a DB transaction checked out, so this path can continue leaking or exhausting pool connections instead of reducing them.
Useful? React with 👍 / 👎.
| if: ${{ inputs.use_ai_notes }} | ||
| shell: bash | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| GITHUB_TOKEN: ${{ secrets.MODELS_TOKEN }} | ||
| run: | |
There was a problem hiding this comment.
Skip AI notes when MODELS_TOKEN is absent
With use_ai_notes defaulting to true, repositories that have not configured the new MODELS_TOKEN secret will run this step with an empty bearer token; the curl call then returns an error status and test "$status_code" -lt 400 stops the job before the fallback release notes step can run. Since the input text says AI notes are used “when MODELS_TOKEN is configured,” guard this step on the secret being present or make the default false so releases do not fail by default.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b472b1797d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Resolve via storage backend (supports local FS and S3) | ||
| storage = get_storage_backend() | ||
| storage_key = normalize_storage_key(f"{agent_id}/{path}") | ||
| if not await storage.exists(storage_key) or not await storage.is_file(storage_key): |
There was a problem hiding this comment.
Reject traversal before building the publish key
For publish requests whose path contains .., normalize_storage_key(f"{agent_id}/{path}") collapses the leading agent component away (for example ../<other-agent-id>/page.html normalizes to <other-agent-id>/page.html). The subsequent exists/read_text checks then operate on that escaped key, so an agent/tool invocation that knows another storage key can publish HTML outside its own workspace; keep an explicit traversal rejection or verify the normalized key still starts with this agent's prefix.
Useful? React with 👍 / 👎.
Overview
All channel message handlers (Feishu, DingTalk, WeCom, Slack, Discord, WeChat) now follow a three-phase pattern to avoid holding database connections during slow LLM calls.
Problem
Previously, a single DB connection was held for the entire request lifecycle, including LLM calls that could take 30+ seconds. The connection sat idle but unavailable during that time. Under high concurrency, the connection pool was quickly exhausted.
Changes
Each channel handler is split into three phases:
Specifically:
_call_agent_llminto_load_agent_and_model+_call_llm_with_configImpact
last_message_atconsistency