Skip to content

fix: reduce system prompt 78% to fix Qwen3.5 timeouts + MCP runtime status (#609)#617

Merged
itomek merged 14 commits intomainfrom
609-agent-ui-mcp-clientserver-support-and-configuration-v2
Mar 27, 2026
Merged

fix: reduce system prompt 78% to fix Qwen3.5 timeouts + MCP runtime status (#609)#617
itomek merged 14 commits intomainfrom
609-agent-ui-mcp-clientserver-support-and-configuration-v2

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented Mar 25, 2026

Summary

  • Perf: Reduce ChatAgent system prompt from ~17,600 → ~3,853 tokens (78%) to fix Qwen3.5-35B timeouts on local hardware
  • Feat: MCP server runtime status exposed in Agent UI Settings modal (issue Agent UI: MCP client/server support and configuration (v0.17.1) #609)
  • Fix: Increase chat timeouts 120/180s → 600s; fix stale log message; fix MCPTool.get() AttributeError in tools endpoint

Changes

System prompt optimization (perf: reduce system prompt...)

  • Restructured _get_system_prompt() from 733 → 280 lines using two-tier RAG gating:
    • Tier 1 (always present): SMART DISCOVERY, FILE SEARCH — needed because RAG tools are always registered
    • Tier 2 (gated on has_indexed): FACTUAL ACCURACY, DOCUMENT SILENCE, POST-INDEX QUERY, etc.
  • Merged duplicate platform blocks (current OS only)
  • Removed: merge conflict marker <<<<<<< HEAD, hardcoded AVAILABLE TOOLS REFERENCE, both UNSUPPORTED FEATURES blocks, duplicate rules
  • Increased streaming/non-streaming timeouts to 600s
  • Added "Sending to model..." SSE status events for better UX

MCP runtime status (feat: MCP runtime status in Agent UI)

  • MCPClientManager._failed: tracks connection errors from load_from_config()
  • MCPClientManager.get_status_report(): returns {name, connected, tool_count, error} for all servers
  • GET /api/mcp/status: REST endpoint returning cached runtime status
  • SSE mcp_status event emitted after each agent setup
  • Settings modal: MCP Servers section showing ✓ connected (N tools) or ✗ failed status
  • 6 new unit tests for _failed tracking and get_status_report()

Bug fixes

  • list_mcp_server_tools was calling .get() on MCPTool dataclass (pre-existing AttributeError)
  • remove_server() now clears _failed to prevent ghost entries in status report
  • has_docshas_indexed for Tier 2 gating (library docs aren't indexed yet)

Tests

  • 6 new tests: MCPClientManager._failed tracking and get_status_report()
  • 2 new tests: Tier 1/Tier 2 prompt gating correctness

Test plan

  • uv run python -m pytest tests/unit/mcp/ tests/test_chat_agent.py -x -q — 136 pass, 1 pre-existing network failure
  • gaia chat --ui → send "hi" with no documents → responds without timeout (was timing out with Qwen3.5-35B)
  • Settings modal → MCP Servers section shows connected/failed servers after first chat
  • GET /api/mcp/status returns {"servers": []} before any chat, populated after

Closes #609

itomek added 3 commits March 25, 2026 10:21
…timeouts

The ChatAgent's _get_system_prompt() was sending a 17,648-token system
prompt unconditionally, causing every message to timeout with Qwen3.5-35B
on local hardware (Lemonade's ~5min timeout exceeded before first token).

Changes:
- Merge duplicate platform blocks into one (current OS only, ~50 lines saved)
- Two-tier RAG gating: Tier 1 (discovery/tool-usage rules) always present
  since RAG tools are always registered; Tier 2 (query/factual-accuracy
  rules) only injected when has_indexed=True
- Remove hardcoded AVAILABLE TOOLS REFERENCE (duplicated auto-generated list)
- Remove both UNSUPPORTED FEATURES blocks, replace with 3-line version
- Deduplicate MULTI-FACT QUERY RULE and CONVERSATION SUMMARY RULE
- Remove git merge conflict marker (<<<<<<< HEAD) from prompt string
- Condense WRONG/RIGHT examples throughout

Result: no-docs prompt ~3,853 tokens (was ~17,648), 78% reduction.

Also increase streaming and non-streaming chat timeouts from 120/180s to
600s, add "Sending to model..." status events for better UX feedback, and
un-suppress print_processing_start() in SSE handler.
Tracks which MCP servers are actually connected at runtime and exposes
this status in the Settings modal.

Backend:
- MCPClientManager._failed: records connection errors from load_from_config()
- MCPClientManager.get_status_report(): returns connected/failed status
  with tool counts for all configured servers
- MCPClientMixin.get_mcp_status_report(): delegates to _mcp_manager
- _chat_helpers.py: emits mcp_status SSE event after each agent setup;
  caches last-known status for the REST endpoint
- GET /api/mcp/status: returns cached runtime MCP status
- Fix: list_mcp_server_tools used .get() on MCPTool dataclass (AttributeError)
- Fix: remove_server now clears _failed to avoid ghost entries

Frontend:
- types/index.ts: MCPServerStatus type, mcp_status StreamEventType
- services/api.ts: getMCPRuntimeStatus()
- SettingsModal.tsx: MCP Servers section showing connected/failed status
  with tool counts (only shown when MCP servers are configured)

Tests: 6 new tests for MCPClientManager._failed tracking and get_status_report()
Verify that:
- Tier 2 rules (FACTUAL ACCURACY, POST-INDEX QUERY, etc.) are absent
  from the system prompt when no documents are indexed
- Tier 2 rules appear after a document is indexed and rebuild_system_prompt()
  is called (has_indexed=True gates the Tier 2 block)
- Tier 1 rules (SMART DISCOVERY WORKFLOW, FILE SEARCH) are always present
  since RAG tools are always registered unconditionally
@github-actions github-actions Bot added agents mcp MCP integration changes tests Test changes labels Mar 25, 2026
…iptions

Replace Prompts.format_chat_history() + generate() with structured
messages + chat() in send_messages() and send_messages_stream(). This
prevents the model seeing nested/malformed ChatML tokens that caused it
to recite system prompt instructions instead of responding normally.

Also trims tool descriptions to first line only (~1660 token savings),
adds heartbeat status events during prompt prefill, and gates the SD
mixin prompt on sd_default_model presence.

Fixes: model generating garbage text ("According to my Instructions")
instead of valid JSON responses in GAIA Agent UI.
@github-actions github-actions Bot added the chat Chat SDK changes label Mar 25, 2026
@itomek itomek self-assigned this Mar 25, 2026
…prompt

Restores all removed example pairs and multi-turn worked examples into
the modular prompt structure (base_prompt, tool_rules, discovery_rules,
rag_query_rules, data_file_rules). Examples were removed for token savings
but are necessary for LLM adherence to complex behavioral rules.

Also moves POST-INDEX QUERY RULE from conditional rag_query_rules to
always-present tool_rules — Smart Discovery can trigger indexing
mid-conversation even before docs are initially indexed, so the rule
must be present from the start. Updates test to reflect this design.

Removes dead variable has_docs (defined but never used).
Comment thread src/gaia/agents/chat/agent.py
@kovtcharov
Copy link
Copy Markdown
Collaborator

Recommend creating your own MCP-focused agent.

DocumentLibrary, FileBrowser, and ChatView drag-and-drop now call
api.attachDocument() after indexing so the agent's RAG system receives
the session's documents. Also fixes context bar to show session-scoped
docs, changes X button to detach (not hard-delete), evicts agent cache
on detach, and adds early-exit guard in read_file for binary formats.

Closes #609
@itomek itomek force-pushed the 609-agent-ui-mcp-clientserver-support-and-configuration-v2 branch from 4b40932 to 5e2a5e4 Compare March 26, 2026 19:13
itomek and others added 2 commits March 26, 2026 15:39
Auto-detect when frontend source files are newer than the built dist
and run `npm run build` before starting the server. Also add no-cache
headers to index.html responses so browsers and tunnel proxies always
pick up rebuilt assets. Add VS Code Dev Tunnels to CORS origins.

Co-Authored-By: Tomasz Iniewicz <infancy_shred.0d@icloud.com>
@github-actions github-actions Bot added the cli CLI changes label Mar 26, 2026
itomek and others added 4 commits March 26, 2026 19:10
Disable uvicorn access logs by default (enable with --debug flag).
Gate frontend console info/debug/timed logs behind ?debug URL param
or localStorage, keeping only warnings and errors visible.

Co-Authored-By: Tomasz Iniewicz <infancy_shred.0d@icloud.com>
- Add _maybe_load_expected_model() pre-flight check in _chat_helpers.py
  that detects when Lemonade has no chat-capable model loaded (empty
  all_models_loaded list or embedding-only) and calls load_model()
  before process_query(). Lemonade silently hangs HTTP connections in
  this state instead of returning an error, causing 100-900s hangs.
- Suppress false-alarm "Wrong model" banner in ConnectionBanner.tsx
  Case 4 when the embedding model is transiently active after indexing.
- Add 10s connection timeout to MCPClientManager.load_from_config() so
  a hanging MCP stdio server cannot block agent construction indefinitely
  before the pre-flight check is reached.
- Add 12 unit tests covering all pre-flight scenarios including fast
  path, embedding-only, no model, error handling, and concurrency.
_chat_helpers.py:
- Add _model_load_lock (threading.Lock) to prevent concurrent load_model()
  calls from multiple sessions arriving simultaneously with no model active
- Add _maybe_load_expected_model() pre-flight check that inspects Lemonade's
  all_models_loaded before process_query(). When no llm/vlm is active
  (empty list or embedding-only), calls blocking load_model() and emits a
  "Loading LLM model..." SSE status event. This prevents the 100-900s silent
  hang caused by Lemonade accepting chat completions but producing zero tokens
  when no text-generation model is loaded.
- Call _maybe_load_expected_model() in both the streaming (_run_agent) and
  non-streaming (_do_chat) paths.

ConnectionBanner.tsx:
- Suppress Case 4 "Wrong model loaded" banner when embedding_model_loaded is
  true. The embedding model is transiently active after RAG indexing; the
  pre-flight in _chat_helpers.py loads the correct LLM before the first query.

mcp_client_manager.py:
- Add cancel_futures=True to pool.shutdown() so pending MCP connections are
  not started after the 10s timeout fires.
Cover re-check inside lock returning non-200 (load_model proceeds)
and re-check raising ConnectError (outer handler catches, warning SSE
emitted). Brings test count to 14.
@itomek itomek marked this pull request as ready for review March 27, 2026 00:32
@itomek itomek requested a review from kovtcharov-amd as a code owner March 27, 2026 00:32
@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude review this PR in detail. Identify if there's any technical debt or bugs.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 27, 2026

Claude finished @kovtcharov's task —— View job


PR Review: Technical Debt and Bug Analysis

Reviewing PR #617 in detail for technical debt and bugs...

  • Read PR diff and changed files
  • Review system prompt optimization in ChatAgent
  • Review MCP runtime status implementation
  • Review bug fixes and new tests
  • Identify technical debt and potential issues

Summary

This PR makes significant improvements to system prompt size (78% reduction), MCP runtime status visibility, and timeout handling. The implementation is generally solid with good test coverage. However, I've identified several issues ranging from minor to important.

Issues Found

🟡 Important Issues

1. Potential Race Condition in MCP Status Cache (src/gaia/ui/_chat_helpers.py:721-726)

The _mcp_status_cache is updated inside the agent thread while get_cached_mcp_status() reads from the main thread. Although there's a lock, the update pattern could lead to stale reads:

# Inside _run_agent() thread
if hasattr(agent, "get_mcp_status_report"):
    mcp_report = agent.get_mcp_status_report()
    with _mcp_status_lock:
        _mcp_status_cache[:] = mcp_report  # Modifies in-place

Recommendation: Consider using copy.deepcopy() when returning from get_cached_mcp_status() to prevent potential mutation issues, or ensure the lock is held during the entire read operation.

2. MCPClientManager ThreadPoolExecutor Lifecycle (src/gaia/mcp/client/mcp_client_manager.py:220-250)

The parallel connection code creates a new ThreadPoolExecutor per call to load_from_config(). While shutdown(wait=False, cancel_futures=True) is called, threads stuck in readline() become daemon threads that exit with the process. This is fine for normal operation, but could leave resources hanging if load_from_config() is called repeatedly.

pool = ThreadPoolExecutor(max_workers=len(to_connect))
# ... later ...
pool.shutdown(wait=False, cancel_futures=True)

Recommendation: Consider reusing a single thread pool or adding a comment documenting this behavior is intentional.

3. Agent Cache Eviction Policy (src/gaia/ui/_chat_helpers.py:94-110)

The agent cache uses a simple FIFO eviction when full:

if session_id not in _agent_cache and len(_agent_cache) >= _MAX_CACHED_AGENTS:
    oldest = next(iter(_agent_cache))  # FIFO eviction
    del _agent_cache[oldest]

Technical Debt: This evicts the oldest entry regardless of activity. A recently active session could be evicted if it was the oldest, while inactive sessions remain cached.

Recommendation: Consider LRU eviction or track last-access timestamps.

🟢 Minor Issues

4. Missing Type Annotations (src/gaia/ui/_chat_helpers.py:53-62)

Several module-level variables lack type annotations:

_agent_cache: dict = {}  # Could be: Dict[str, Dict[str, Any]]
_mcp_status_cache: list = []  # Could be: List[Dict[str, Any]]

5. Hardcoded Magic Numbers

  • _CONNECT_TIMEOUT = 10.0 in mcp_client_manager.py:209 - should be configurable
  • _MAX_CACHED_AGENTS = 10 in _chat_helpers.py:57 - should be configurable
  • _STREAM_TIMEOUT = 600 in _chat_helpers.py:796 - duplicated with non-streaming timeout

Recommendation: Consider extracting these to a config module or environment variables.

6. Incomplete Error Message in _maybe_load_expected_model (src/gaia/ui/_chat_helpers.py:355-364)

The warning message "Could not auto-load LLM. Check that Lemonade is running." is generic. The actual exception is logged but not shown to the user.

Recommendation: Include more specific error info in the SSE status event.

✅ Good Practices Observed

  1. Comprehensive Test Coverage: New tests for _failed tracking and get_status_report() (6 tests), plus Tier 1/Tier 2 prompt gating tests (2 tests).

  2. Proper Lock Usage: The _mcp_status_lock and _agent_cache_lock are used correctly for thread synchronization.

  3. Defensive Coding: Good use of pop(name, None) in remove_server() to clear _failed entries without raising KeyError.

  4. Clear Code Organization: The two-tier RAG gating is well-documented and the separation between discovery rules (always present) and query rules (gated on has_indexed) is logical.

  5. SSE Event Emission: The mcp_status SSE event is emitted after agent setup, allowing the frontend to update MCP status without a separate API call.

Addressing Reviewer Concern

@kovtcharov raised a concern about system prompt reduction potentially reverting agent performance.

Analysis: The prompt reduction is well-structured:

  • Tier 1 (always present): SMART DISCOVERY, FILE SEARCH - needed because RAG tools are always registered
  • Tier 2 (gated on has_indexed): FACTUAL ACCURACY, DOCUMENT SILENCE, POST-INDEX QUERY, etc.

The key insight is that Tier 2 rules only make sense when documents are indexed. Including them when no documents are indexed adds ~14,000 tokens of irrelevant instructions that can confuse the model and cause timeouts.

Tests Added: test_tier2_rag_rules_absent_without_indexed_docs and test_tier2_rag_rules_present_after_indexing verify the gating logic is correct.

Recommendation: This change should improve agent performance for the common case (no docs indexed yet) while maintaining full capability when documents are present. The concern is valid but appears to be addressed by the two-tier approach.

No Critical Bugs Found

The PR is well-implemented overall. The issues identified are technical debt and optimization opportunities rather than bugs that would cause incorrect behavior.


claude added 2 commits March 27, 2026 01:02
- Remove redundant `import shutil` in cli.py (already imported at top)
- Fix test_system_status_llm_health_fields_have_safe_defaults to pin
  LEMONADE_BASE_URL so it passes regardless of environment variables
- Reformat test_chat_preflight.py (black formatting)

https://claude.ai/code/session_01E8XUu1vYUvGs6wGYoDSUsy
- Use copy.deepcopy() in get_cached_mcp_status() to prevent callers
  from mutating cached dicts after the lock is released (race condition)
- Add type parameters to _agent_cache and _mcp_status_cache annotations

https://claude.ai/code/session_01E8XUu1vYUvGs6wGYoDSUsy
@itomek itomek added this pull request to the merge queue Mar 27, 2026
Merged via the queue into main with commit 2d08088 Mar 27, 2026
36 checks passed
@itomek itomek deleted the 609-agent-ui-mcp-clientserver-support-and-configuration-v2 branch March 27, 2026 15:39
@itomek itomek mentioned this pull request Mar 27, 2026
4 tasks
github-merge-queue Bot pushed a commit that referenced this pull request Mar 27, 2026
## Summary

Release v0.17.0 — **GAIA Agent UI**, eval benchmark framework, tool
execution guardrails, system prompt optimization, and security
hardening.

### Files Changed
- **`docs/releases/v0.17.0.mdx`** — Comprehensive release notes (new
file)
- **`docs/docs.json`** — Added `releases/v0.17.0` to Releases tab,
updated navbar to `v0.17.0 · Lemonade 10.0.0`
- **`src/gaia/version.py`** — Already at `0.17.0` on main (no change
needed)

### Release Highlights

**New Features:**
- **GAIA Agent UI** — Full-stack privacy-first desktop chat with
streaming responses, 53+ format document Q&A, ngrok tunnel for mobile,
page-level citations, session management (PR #428)
- **Agent UI Eval Framework** — `gaia eval agent` command with
7-dimension weighted scoring across 34 scenarios, redesigned Settings
modal, `<think>` block display, performance stats (PR #607)
- **Tool Execution Guardrails** — Blocking confirmation popup
(Allow/Deny/Always Allow) before write/shell tools, 60s timeout (PR
#565, #604)
- **Device Support Detection** — AMD Ryzen AI Max + Radeon ≥24GB
detection, `--base-url` remote bypass, `GAIA_SKIP_DEVICE_CHECK` override
(PR #593)
- **Terminal UI Design** — Typewriter welcome page, pixelated AMD
cursor, glassmorphism, `prefers-reduced-motion` support (PR #568)

**Performance:**
- **78% System Prompt Reduction** — 17,600 → 3,853 tokens via two-tier
RAG gating, 600s chat timeout, MCP runtime status display (PR #617)

**Security:**
- **TOCTOU Race Condition** — Atomic `O_NOFOLLOW` + `fstat` fix in
document upload, per-file `asyncio.Lock` (PR #564)

**Bug Fixes:**
- LRU eviction silent failure + new
`--max-indexed-files`/`--max-total-chunks` CLI flags (PR #567)
- Lemonade v10 device key renames: `npu` → `amd_npu`, `gpu` →
`amd_igpu`/`amd_dgpu` (PR #548)
- Agent UI rendering, Windows paths, JSON safety regex, RAG indexing
guards (PR #566, #604, #605)
- Restored accidentally reverted changes from PRs #564, #565, #568 (PR
#608)

### Post-Merge
After merging, tag and push:
```bash
git checkout main && git pull
git tag v0.17.0 && git push origin v0.17.0
```
CI runs `validate-release` → `publish-release`. PyPI gated on Kalin
approval.

## Test plan
- [ ] `docs.json` is valid JSON and renders on Mintlify
- [ ] `validate_release_notes.py` passes for v0.17.0
- [ ] `version.py` reads `0.17.0`
- [ ] Release notes content matches actual PR changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chat Chat SDK changes cli CLI changes mcp MCP integration changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent UI: MCP client/server support and configuration (v0.17.1)

3 participants