fix: reduce system prompt 78% to fix Qwen3.5 timeouts + MCP runtime status (#609) by itomek · Pull Request #617 · amd/gaia

itomek · 2026-03-25T14:31:59Z

Summary

Perf: Reduce ChatAgent system prompt from ~17,600 → ~3,853 tokens (78%) to fix Qwen3.5-35B timeouts on local hardware
Feat: MCP server runtime status exposed in Agent UI Settings modal (issue Agent UI: MCP client/server support and configuration (v0.17.1) #609)
Fix: Increase chat timeouts 120/180s → 600s; fix stale log message; fix MCPTool.get() AttributeError in tools endpoint

Changes

System prompt optimization (`perf: reduce system prompt...`)

Restructured _get_system_prompt() from 733 → 280 lines using two-tier RAG gating:
- Tier 1 (always present): SMART DISCOVERY, FILE SEARCH — needed because RAG tools are always registered
- Tier 2 (gated on has_indexed): FACTUAL ACCURACY, DOCUMENT SILENCE, POST-INDEX QUERY, etc.
Merged duplicate platform blocks (current OS only)
Removed: merge conflict marker <<<<<<< HEAD, hardcoded AVAILABLE TOOLS REFERENCE, both UNSUPPORTED FEATURES blocks, duplicate rules
Increased streaming/non-streaming timeouts to 600s
Added "Sending to model..." SSE status events for better UX

MCP runtime status (`feat: MCP runtime status in Agent UI`)

MCPClientManager._failed: tracks connection errors from load_from_config()
MCPClientManager.get_status_report(): returns {name, connected, tool_count, error} for all servers
GET /api/mcp/status: REST endpoint returning cached runtime status
SSE mcp_status event emitted after each agent setup
Settings modal: MCP Servers section showing ✓ connected (N tools) or ✗ failed status
6 new unit tests for _failed tracking and get_status_report()

Bug fixes

list_mcp_server_tools was calling .get() on MCPTool dataclass (pre-existing AttributeError)
remove_server() now clears _failed to prevent ghost entries in status report
has_docs → has_indexed for Tier 2 gating (library docs aren't indexed yet)

Tests

6 new tests: MCPClientManager._failed tracking and get_status_report()
2 new tests: Tier 1/Tier 2 prompt gating correctness

Test plan

uv run python -m pytest tests/unit/mcp/ tests/test_chat_agent.py -x -q — 136 pass, 1 pre-existing network failure
gaia chat --ui → send "hi" with no documents → responds without timeout (was timing out with Qwen3.5-35B)
Settings modal → MCP Servers section shows connected/failed servers after first chat
GET /api/mcp/status returns {"servers": []} before any chat, populated after

Closes #609

…timeouts The ChatAgent's _get_system_prompt() was sending a 17,648-token system prompt unconditionally, causing every message to timeout with Qwen3.5-35B on local hardware (Lemonade's ~5min timeout exceeded before first token). Changes: - Merge duplicate platform blocks into one (current OS only, ~50 lines saved) - Two-tier RAG gating: Tier 1 (discovery/tool-usage rules) always present since RAG tools are always registered; Tier 2 (query/factual-accuracy rules) only injected when has_indexed=True - Remove hardcoded AVAILABLE TOOLS REFERENCE (duplicated auto-generated list) - Remove both UNSUPPORTED FEATURES blocks, replace with 3-line version - Deduplicate MULTI-FACT QUERY RULE and CONVERSATION SUMMARY RULE - Remove git merge conflict marker (<<<<<<< HEAD) from prompt string - Condense WRONG/RIGHT examples throughout Result: no-docs prompt ~3,853 tokens (was ~17,648), 78% reduction. Also increase streaming and non-streaming chat timeouts from 120/180s to 600s, add "Sending to model..." status events for better UX feedback, and un-suppress print_processing_start() in SSE handler.

Tracks which MCP servers are actually connected at runtime and exposes this status in the Settings modal. Backend: - MCPClientManager._failed: records connection errors from load_from_config() - MCPClientManager.get_status_report(): returns connected/failed status with tool counts for all configured servers - MCPClientMixin.get_mcp_status_report(): delegates to _mcp_manager - _chat_helpers.py: emits mcp_status SSE event after each agent setup; caches last-known status for the REST endpoint - GET /api/mcp/status: returns cached runtime MCP status - Fix: list_mcp_server_tools used .get() on MCPTool dataclass (AttributeError) - Fix: remove_server now clears _failed to avoid ghost entries Frontend: - types/index.ts: MCPServerStatus type, mcp_status StreamEventType - services/api.ts: getMCPRuntimeStatus() - SettingsModal.tsx: MCP Servers section showing connected/failed status with tool counts (only shown when MCP servers are configured) Tests: 6 new tests for MCPClientManager._failed tracking and get_status_report()

Verify that: - Tier 2 rules (FACTUAL ACCURACY, POST-INDEX QUERY, etc.) are absent from the system prompt when no documents are indexed - Tier 2 rules appear after a document is indexed and rebuild_system_prompt() is called (has_indexed=True gates the Tier 2 block) - Tier 1 rules (SMART DISCOVERY WORKFLOW, FILE SEARCH) are always present since RAG tools are always registered unconditionally

…iptions Replace Prompts.format_chat_history() + generate() with structured messages + chat() in send_messages() and send_messages_stream(). This prevents the model seeing nested/malformed ChatML tokens that caused it to recite system prompt instructions instead of responding normally. Also trims tool descriptions to first line only (~1660 token savings), adds heartbeat status events during prompt prefill, and gates the SD mixin prompt on sd_default_model presence. Fixes: model generating garbage text ("According to my Instructions") instead of valid JSON responses in GAIA Agent UI.

…prompt Restores all removed example pairs and multi-turn worked examples into the modular prompt structure (base_prompt, tool_rules, discovery_rules, rag_query_rules, data_file_rules). Examples were removed for token savings but are necessary for LLM adherence to complex behavioral rules. Also moves POST-INDEX QUERY RULE from conditional rag_query_rules to always-present tool_rules — Smart Discovery can trigger indexing mid-conversation even before docs are initially indexed, so the rule must be present from the start. Updates test to reflect this design. Removes dead variable has_docs (defined but never used).

kovtcharov · 2026-03-25T22:34:18Z

Recommend creating your own MCP-focused agent.

DocumentLibrary, FileBrowser, and ChatView drag-and-drop now call api.attachDocument() after indexing so the agent's RAG system receives the session's documents. Also fixes context bar to show session-scoped docs, changes X button to detach (not hard-delete), evicts agent cache on detach, and adds early-exit guard in read_file for binary formats. Closes #609

Auto-detect when frontend source files are newer than the built dist and run `npm run build` before starting the server. Also add no-cache headers to index.html responses so browsers and tunnel proxies always pick up rebuilt assets. Add VS Code Dev Tunnels to CORS origins. Co-Authored-By: Tomasz Iniewicz <infancy_shred.0d@icloud.com>

Disable uvicorn access logs by default (enable with --debug flag). Gate frontend console info/debug/timed logs behind ?debug URL param or localStorage, keeping only warnings and errors visible. Co-Authored-By: Tomasz Iniewicz <infancy_shred.0d@icloud.com>

- Add _maybe_load_expected_model() pre-flight check in _chat_helpers.py that detects when Lemonade has no chat-capable model loaded (empty all_models_loaded list or embedding-only) and calls load_model() before process_query(). Lemonade silently hangs HTTP connections in this state instead of returning an error, causing 100-900s hangs. - Suppress false-alarm "Wrong model" banner in ConnectionBanner.tsx Case 4 when the embedding model is transiently active after indexing. - Add 10s connection timeout to MCPClientManager.load_from_config() so a hanging MCP stdio server cannot block agent construction indefinitely before the pre-flight check is reached. - Add 12 unit tests covering all pre-flight scenarios including fast path, embedding-only, no model, error handling, and concurrency.

_chat_helpers.py: - Add _model_load_lock (threading.Lock) to prevent concurrent load_model() calls from multiple sessions arriving simultaneously with no model active - Add _maybe_load_expected_model() pre-flight check that inspects Lemonade's all_models_loaded before process_query(). When no llm/vlm is active (empty list or embedding-only), calls blocking load_model() and emits a "Loading LLM model..." SSE status event. This prevents the 100-900s silent hang caused by Lemonade accepting chat completions but producing zero tokens when no text-generation model is loaded. - Call _maybe_load_expected_model() in both the streaming (_run_agent) and non-streaming (_do_chat) paths. ConnectionBanner.tsx: - Suppress Case 4 "Wrong model loaded" banner when embedding_model_loaded is true. The embedding model is transiently active after RAG indexing; the pre-flight in _chat_helpers.py loads the correct LLM before the first query. mcp_client_manager.py: - Add cancel_futures=True to pool.shutdown() so pending MCP connections are not started after the 10s timeout fires.

Cover re-check inside lock returning non-200 (load_model proceeds) and re-check raising ConnectError (outer handler catches, warning SSE emitted). Brings test count to 14.

kovtcharov · 2026-03-27T00:44:19Z

@claude review this PR in detail. Identify if there's any technical debt or bugs.

github-actions · 2026-03-27T00:44:51Z

Claude finished @kovtcharov's task —— View job

PR Review: Technical Debt and Bug Analysis

Reviewing PR #617 in detail for technical debt and bugs...

Read PR diff and changed files
Review system prompt optimization in ChatAgent
Review MCP runtime status implementation
Review bug fixes and new tests
Identify technical debt and potential issues

Summary

This PR makes significant improvements to system prompt size (78% reduction), MCP runtime status visibility, and timeout handling. The implementation is generally solid with good test coverage. However, I've identified several issues ranging from minor to important.

Issues Found

🟡 Important Issues

1. Potential Race Condition in MCP Status Cache (`src/gaia/ui/_chat_helpers.py:721-726`)

The _mcp_status_cache is updated inside the agent thread while get_cached_mcp_status() reads from the main thread. Although there's a lock, the update pattern could lead to stale reads:

# Inside _run_agent() thread
if hasattr(agent, "get_mcp_status_report"):
    mcp_report = agent.get_mcp_status_report()
    with _mcp_status_lock:
        _mcp_status_cache[:] = mcp_report  # Modifies in-place

Recommendation: Consider using copy.deepcopy() when returning from get_cached_mcp_status() to prevent potential mutation issues, or ensure the lock is held during the entire read operation.

2. MCPClientManager ThreadPoolExecutor Lifecycle (`src/gaia/mcp/client/mcp_client_manager.py:220-250`)

The parallel connection code creates a new ThreadPoolExecutor per call to load_from_config(). While shutdown(wait=False, cancel_futures=True) is called, threads stuck in readline() become daemon threads that exit with the process. This is fine for normal operation, but could leave resources hanging if load_from_config() is called repeatedly.

pool = ThreadPoolExecutor(max_workers=len(to_connect))
# ... later ...
pool.shutdown(wait=False, cancel_futures=True)

Recommendation: Consider reusing a single thread pool or adding a comment documenting this behavior is intentional.

3. Agent Cache Eviction Policy (`src/gaia/ui/_chat_helpers.py:94-110`)

The agent cache uses a simple FIFO eviction when full:

if session_id not in _agent_cache and len(_agent_cache) >= _MAX_CACHED_AGENTS:
    oldest = next(iter(_agent_cache))  # FIFO eviction
    del _agent_cache[oldest]

Technical Debt: This evicts the oldest entry regardless of activity. A recently active session could be evicted if it was the oldest, while inactive sessions remain cached.

Recommendation: Consider LRU eviction or track last-access timestamps.

🟢 Minor Issues

4. Missing Type Annotations (`src/gaia/ui/_chat_helpers.py:53-62`)

Several module-level variables lack type annotations:

_agent_cache: dict = {}  # Could be: Dict[str, Dict[str, Any]]
_mcp_status_cache: list = []  # Could be: List[Dict[str, Any]]

5. Hardcoded Magic Numbers

_CONNECT_TIMEOUT = 10.0 in mcp_client_manager.py:209 - should be configurable
_MAX_CACHED_AGENTS = 10 in _chat_helpers.py:57 - should be configurable
_STREAM_TIMEOUT = 600 in _chat_helpers.py:796 - duplicated with non-streaming timeout

Recommendation: Consider extracting these to a config module or environment variables.

6. Incomplete Error Message in `_maybe_load_expected_model` (`src/gaia/ui/_chat_helpers.py:355-364`)

The warning message "Could not auto-load LLM. Check that Lemonade is running." is generic. The actual exception is logged but not shown to the user.

Recommendation: Include more specific error info in the SSE status event.

✅ Good Practices Observed

Comprehensive Test Coverage: New tests for _failed tracking and get_status_report() (6 tests), plus Tier 1/Tier 2 prompt gating tests (2 tests).
Proper Lock Usage: The _mcp_status_lock and _agent_cache_lock are used correctly for thread synchronization.
Defensive Coding: Good use of pop(name, None) in remove_server() to clear _failed entries without raising KeyError.
Clear Code Organization: The two-tier RAG gating is well-documented and the separation between discovery rules (always present) and query rules (gated on has_indexed) is logical.
SSE Event Emission: The mcp_status SSE event is emitted after agent setup, allowing the frontend to update MCP status without a separate API call.

Addressing Reviewer Concern

@kovtcharov raised a concern about system prompt reduction potentially reverting agent performance.

Analysis: The prompt reduction is well-structured:

Tier 1 (always present): SMART DISCOVERY, FILE SEARCH - needed because RAG tools are always registered
Tier 2 (gated on has_indexed): FACTUAL ACCURACY, DOCUMENT SILENCE, POST-INDEX QUERY, etc.

The key insight is that Tier 2 rules only make sense when documents are indexed. Including them when no documents are indexed adds ~14,000 tokens of irrelevant instructions that can confuse the model and cause timeouts.

Tests Added: test_tier2_rag_rules_absent_without_indexed_docs and test_tier2_rag_rules_present_after_indexing verify the gating logic is correct.

Recommendation: This change should improve agent performance for the common case (no docs indexed yet) while maintaining full capability when documents are present. The concern is valid but appears to be addressed by the two-tier approach.

No Critical Bugs Found

The PR is well-implemented overall. The issues identified are technical debt and optimization opportunities rather than bugs that would cause incorrect behavior.

- Remove redundant `import shutil` in cli.py (already imported at top) - Fix test_system_status_llm_health_fields_have_safe_defaults to pin LEMONADE_BASE_URL so it passes regardless of environment variables - Reformat test_chat_preflight.py (black formatting) https://claude.ai/code/session_01E8XUu1vYUvGs6wGYoDSUsy

- Use copy.deepcopy() in get_cached_mcp_status() to prevent callers from mutating cached dicts after the lock is released (race condition) - Add type parameters to _agent_cache and _mcp_status_cache annotations https://claude.ai/code/session_01E8XUu1vYUvGs6wGYoDSUsy

## Summary Release v0.17.0 — **GAIA Agent UI**, eval benchmark framework, tool execution guardrails, system prompt optimization, and security hardening. ### Files Changed - **`docs/releases/v0.17.0.mdx`** — Comprehensive release notes (new file) - **`docs/docs.json`** — Added `releases/v0.17.0` to Releases tab, updated navbar to `v0.17.0 · Lemonade 10.0.0` - **`src/gaia/version.py`** — Already at `0.17.0` on main (no change needed) ### Release Highlights **New Features:** - **GAIA Agent UI** — Full-stack privacy-first desktop chat with streaming responses, 53+ format document Q&A, ngrok tunnel for mobile, page-level citations, session management (PR #428) - **Agent UI Eval Framework** — `gaia eval agent` command with 7-dimension weighted scoring across 34 scenarios, redesigned Settings modal, `<think>` block display, performance stats (PR #607) - **Tool Execution Guardrails** — Blocking confirmation popup (Allow/Deny/Always Allow) before write/shell tools, 60s timeout (PR #565, #604) - **Device Support Detection** — AMD Ryzen AI Max + Radeon ≥24GB detection, `--base-url` remote bypass, `GAIA_SKIP_DEVICE_CHECK` override (PR #593) - **Terminal UI Design** — Typewriter welcome page, pixelated AMD cursor, glassmorphism, `prefers-reduced-motion` support (PR #568) **Performance:** - **78% System Prompt Reduction** — 17,600 → 3,853 tokens via two-tier RAG gating, 600s chat timeout, MCP runtime status display (PR #617) **Security:** - **TOCTOU Race Condition** — Atomic `O_NOFOLLOW` + `fstat` fix in document upload, per-file `asyncio.Lock` (PR #564) **Bug Fixes:** - LRU eviction silent failure + new `--max-indexed-files`/`--max-total-chunks` CLI flags (PR #567) - Lemonade v10 device key renames: `npu` → `amd_npu`, `gpu` → `amd_igpu`/`amd_dgpu` (PR #548) - Agent UI rendering, Windows paths, JSON safety regex, RAG indexing guards (PR #566, #604, #605) - Restored accidentally reverted changes from PRs #564, #565, #568 (PR #608) ### Post-Merge After merging, tag and push: ```bash git checkout main && git pull git tag v0.17.0 && git push origin v0.17.0 ``` CI runs `validate-release` → `publish-release`. PyPI gated on Kalin approval. ## Test plan - [ ] `docs.json` is valid JSON and renders on Mintlify - [ ] `validate_release_notes.py` passes for v0.17.0 - [ ] `version.py` reads `0.17.0` - [ ] Release notes content matches actual PR changes

itomek added 3 commits March 25, 2026 10:21

github-actions Bot added agents mcp MCP integration changes tests Test changes labels Mar 25, 2026

github-actions Bot added the chat Chat SDK changes label Mar 25, 2026

itomek self-assigned this Mar 25, 2026

kovtcharov reviewed Mar 25, 2026

View reviewed changes

Comment thread src/gaia/agents/chat/agent.py

itomek force-pushed the 609-agent-ui-mcp-clientserver-support-and-configuration-v2 branch from 4b40932 to 5e2a5e4 Compare March 26, 2026 19:13

itomek and others added 2 commits March 26, 2026 15:39

fix: address code review findings

8f17a8f

github-actions Bot added the cli CLI changes label Mar 26, 2026

itomek and others added 4 commits March 26, 2026 19:10

test: add missing edge-case tests for pre-flight model check

e571a18

Cover re-check inside lock returning non-200 (load_model proceeds) and re-check raising ConnectError (outer handler catches, warning SSE emitted). Brings test count to 14.

itomek marked this pull request as ready for review March 27, 2026 00:32

itomek requested a review from kovtcharov-amd as a code owner March 27, 2026 00:32

kovtcharov approved these changes Mar 27, 2026

View reviewed changes

claude added 2 commits March 27, 2026 01:02

itomek added this pull request to the merge queue Mar 27, 2026

Merged via the queue into main with commit 2d08088 Mar 27, 2026
36 checks passed

itomek deleted the 609-agent-ui-mcp-clientserver-support-and-configuration-v2 branch March 27, 2026 15:39

itomek mentioned this pull request Mar 27, 2026

Release v0.17.0 #626

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce system prompt 78% to fix Qwen3.5 timeouts + MCP runtime status (#609)#617

fix: reduce system prompt 78% to fix Qwen3.5 timeouts + MCP runtime status (#609)#617
itomek merged 14 commits intomainfrom
609-agent-ui-mcp-clientserver-support-and-configuration-v2

itomek commented Mar 25, 2026

Uh oh!

Uh oh!

kovtcharov commented Mar 25, 2026

Uh oh!

kovtcharov commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itomek commented Mar 25, 2026

Summary

Changes

System prompt optimization (perf: reduce system prompt...)

MCP runtime status (feat: MCP runtime status in Agent UI)

Bug fixes

Tests

Test plan

Uh oh!

Uh oh!

kovtcharov commented Mar 25, 2026

Uh oh!

kovtcharov commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Technical Debt and Bug Analysis

Summary

Issues Found

🟡 Important Issues

1. Potential Race Condition in MCP Status Cache (src/gaia/ui/_chat_helpers.py:721-726)

2. MCPClientManager ThreadPoolExecutor Lifecycle (src/gaia/mcp/client/mcp_client_manager.py:220-250)

3. Agent Cache Eviction Policy (src/gaia/ui/_chat_helpers.py:94-110)

🟢 Minor Issues

4. Missing Type Annotations (src/gaia/ui/_chat_helpers.py:53-62)

5. Hardcoded Magic Numbers

6. Incomplete Error Message in _maybe_load_expected_model (src/gaia/ui/_chat_helpers.py:355-364)

✅ Good Practices Observed

Addressing Reviewer Concern

No Critical Bugs Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

System prompt optimization (`perf: reduce system prompt...`)

MCP runtime status (`feat: MCP runtime status in Agent UI`)

github-actions Bot commented Mar 27, 2026 •

edited

Loading

1. Potential Race Condition in MCP Status Cache (`src/gaia/ui/_chat_helpers.py:721-726`)

2. MCPClientManager ThreadPoolExecutor Lifecycle (`src/gaia/mcp/client/mcp_client_manager.py:220-250`)

3. Agent Cache Eviction Policy (`src/gaia/ui/_chat_helpers.py:94-110`)

4. Missing Type Annotations (`src/gaia/ui/_chat_helpers.py:53-62`)

6. Incomplete Error Message in `_maybe_load_expected_model` (`src/gaia/ui/_chat_helpers.py:355-364`)