Skip to content

feat(agents): add Chat Lite + Settings model/ctx/memory controls#802

Merged
kovtcharov-amd merged 32 commits intomainfrom
feature/mac-4b-default
Apr 29, 2026
Merged

feat(agents): add Chat Lite + Settings model/ctx/memory controls#802
kovtcharov-amd merged 32 commits intomainfrom
feature/mac-4b-default

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

Summary

Ships Chat Lite — a 4B-model sibling of the built-in Chat Agent for Macs and other hardware that cannot host the 35B default — plus the three Settings controls needed to make swapping models practical: an Active Model override, a Context Size picker, and per-agent Memory Warnings. Chat Agent and its defaults are untouched; this is purely additive.

Threads

  • New chat-lite agent (registry.py) — reuses ChatAgent but presets model_id to Qwen3-4B-Instruct-2507-GGUF (falls back to Qwen3-4B-GGUF). Appears alongside Chat in the picker. Why: 35B won't load on ~8-16GB machines, so users were stuck with no working out-of-the-box option.

  • AgentInfo.min_memory_gb — new optional field on registrations/manifests/API. Chat Lite declares 5.0, Chat keeps None. Settings renders a Memory Warnings block only for agents whose requirement exceeds memory_available_gb. Why: warn before the user wastes time picking an agent that will OOM.

  • Settings: Active Model — text field bound to the existing custom_model setting, with "Use agent default" placeholder. Empty → agent's registered models[0] wins (unchanged backend logic). Why: users needed a visible way to swap models per-agent without editing settings JSON.

  • Settings: Context Size — preset chips (4K / 8K / 16K / 32K) plus numeric input, Apply reloads the active model via the existing /api/system/load-model. Why: matches what Lemonade's lemonade load --ctx-size already supports but the UI never exposed.

  • expected_model_loaded respects registered agents (system.py) — used to hardcode the 35B default, so Chat Lite's 4B always tripped "Wrong model loaded". Now accepts any registered agent's preferred model as valid. Why: the old check is wrong the moment you have more than one agent with different model preferences.

  • Pre-flight load handles wrong-model + small-ctx (_chat_helpers.py) — _maybe_load_expected_model used to short-circuit if any LLM was active, even the wrong one, and never checked ctx. It now requires the specific expected model with ctx ≥ 32K; otherwise it reloads. Why: Lemonade auto-loads requested models at its default 4096 ctx, silently truncating ChatAgent's >7K-token system prompt, producing an empty stream. This is what blocked Chat Lite from ever returning a response. _ensure_model_loaded also gains a 32K fallback for models not in the built-in MODELS registry (same rationale).

Test plan

  • python -m pytest tests/unit/agents/ tests/unit/chat/ui/test_agents_router.py — 156 pass, incl. 10 new
  • python util/lint.py --black --isort — green
  • cd src/gaia/apps/webui && npm run build — clean, 0 new warnings
  • End-to-end on Mac: create a chat-lite session, send "Reply with exactly: hello from chat-lite" — model auto-loads, streams back "hello from chat-lite" at ~60 tok/s
  • Second message in same session — no reload (pre-flight correctly short-circuits when model + ctx already match)
  • Open Settings modal on a running instance: verify Active Model, Context Size, and Memory Warnings sections render and persist round-trips

Out of scope

Two related improvements deferred to follow-ups:

  • Lemonade's admin API lives on port 13305 in v10.x but GAIA still defaults to localhost:8000. Users currently need LEMONADE_BASE_URL=http://localhost:13305/api/v1 as an env var. A port probe would remove that step.
  • _try_reload_with_ctx reloads whatever model is currently loaded, not the target model. Harmless after the pre-flight fix above, but worth cleaning up.

@github-actions github-actions Bot added agents llm LLM backend changes tests Test changes electron Electron app changes performance Performance-critical changes labels Apr 18, 2026
kovtcharov added a commit that referenced this pull request Apr 18, 2026
… + validate ctx_size

Follows up on the architecture review of PR #802 with three polish
fixes; no behaviour change vs the previous commit on main agents
paths except the `ctx_size` validation.

1. **Single source of truth for the 32K context requirement.**
   ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client``
   and is re-exported by ``lemonade_manager`` for backwards compat.
   The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates
   the three side-by-side copies (each carrying a "must match ..."
   comment — exactly the smell review flagged as drift-prone).

2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``.
   The four ChatAgent/registry.create_agent call sites used to
   repeat the same 4-field bundle (rag_documents, library_documents,
   allowed_paths, ui_session_id). Centralising it means adding a
   new field — or forgetting one, which is what bit us last time
   and caused Chat Lite's PermissionError spiral — happens in one
   place. Unknown kwargs are still filtered by the per-agent
   factory, so this remains safe for manifest agents that don't
   recognise all fields.

3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing
   showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently
   accepted by the endpoint and then failed deep in Lemonade with
   no actionable error. ``Field(None, gt=0)`` now surfaces a 422
   at the boundary with a readable message.

Verified with the existing 153-test suite and a live end-to-end
sweep over the chat-lite agent on a Mac: cold auto-load, warm
reuse, agent switch, parallel sessions (3 simultaneous), ctx
reload 32K→64K, long 6K-token input, empty message, malformed
load-model payload — all pass.
@kovtcharov kovtcharov self-assigned this Apr 18, 2026
@kovtcharov kovtcharov requested a review from itomek-amd April 18, 2026 04:47
@github-actions
Copy link
Copy Markdown
Contributor

Summary

Ships the Gaia Lite agent (ChatAgent + 4B model for low-memory hardware) and three Settings controls (Active Model override, Context Size picker, Memory Warnings) — purely additive and well-bounded. The real gem is the _maybe_load_expected_model fix in src/gaia/ui/_chat_helpers.py:407-467: the old short-circuit ("any LLM loaded → skip") silently truncated ChatAgent's >7K-token prompt when Lemonade had auto-loaded a model at 4096 ctx, producing empty streams. The new check (right model and ctx ≥ 32K) is the only reason Gaia Lite can ever return a response. Strong test coverage (10 new tests), clean consolidation of DEFAULT_CONTEXT_SIZE into a single source of truth, thoughtful comments at every non-obvious branch.

Issues Found

🟢 Minor

Silent except Exception in _canonical_agent_type (src/gaia/ui/_chat_helpers.py:93-106)

canonical_id is a pure dict lookup on a string — it physically cannot raise unless agent_type is unhashable, which the type hint already rules out. The blanket except Exception: return agent_type is silent degradation that CLAUDE.md's "fail loudly" rule tries to eliminate. Either drop the try/except entirely or narrow it:

def _canonical_agent_type(agent_type: str) -> str:
    """Resolve legacy agent-type aliases (e.g. ``chat-lite`` → ``gaia-lite``).

    Keeps the per-session agent cache from thrashing when a client mixes the
    old and new IDs within the same session — both resolve to the same
    canonical ID and therefore the same cache entry.
    """
    registry = _agent_registry
    if registry is None:
        return agent_type
    return registry.canonical_id(agent_type)

memory_available_gb defaults to 0.0 in src/gaia/ui/models.py:25

If psutil ever fails to populate memory_available_gb (e.g. an exception path in system_status), the default 0.0 makes a.min_memory_gb > 0 trip the Memory Warnings banner for every agent with a declared requirement — a confusing false-positive. Cheap guard: check a sentinel (status.memory_available_gb > 0) inside the filter in SettingsModal.tsx:549-551, or make the backend field Optional[float] = None and gate rendering on != null.

                        const warnings = status.memory_available_gb > 0
                            ? agents.filter(
                                (a) => a.min_memory_gb != null && a.min_memory_gb > status.memory_available_gb,
                            )
                            : [];

Source-shape regex test is fragile (tests/unit/chat/ui/test_chat_helpers.py:1493-1525)

The regex-over-source assertion for "streaming branch passes rag_file_paths=[]" will break if a future refactor adds a line break inside the call (e.g. a formatter change), even though the behavior is still correct. The comment acknowledges the tradeoff, but you could get the same ratcheting effect by monkeypatching registry.create_agent and asserting on the received kwargs — no mock for the full SSE/Lemonade stack required. Keep as-is if the cost/benefit feels right; flagging because a false CI red in 6 months will not be obvious.

Memory threshold comment vs. constant (src/gaia/agents/registry.py:102-109)

The 5 GB floor is justified as "Q4_K_M weights ~2.5 GB + context + runtime headroom". Nice rationale — consider lifting 5.0 to a module constant (_GAIA_LITE_MIN_MEMORY_GB = 5.0) so the number and the justification live together and don't drift if a bigger checkpoint is added to the models list later.

update_settings null-vs-empty-string docstring (src/gaia/ui/routers/system.py:430-435)

The new docstring is correct in its current wire format, but Pydantic absolutely can distinguish null from "unset" if the field uses Optional[str] = Field(default=None) + checks "custom_model" in request.model_fields_set. The comment bakes in a client-side workaround ("send \"\"") that future maintainers might puzzle over. Consider a one-line "see SettingsUpdateRequest for why" pointer, or clarify that this is a GAIA choice, not a Pydantic limitation.

Strengths

  • Single source of truth for DEFAULT_CONTEXT_SIZE — consolidated in src/gaia/llm/lemonade_client.py:100-108 and re-exported from lemonade_manager and routers/system. Exactly the kind of consolidation that prevents the "ctx ceiling drifted between modules" bug class.
  • _session_agent_kwargs helper in src/gaia/ui/_chat_helpers.py:342-365 dedupes four call sites that were already a foot-gun (the non-streaming-vs-streaming rag_file_paths=[] asymmetry is now documented at the call sites rather than being a silent timebomb).
  • Legacy chat-lite alias (_LEGACY_ID_ALIASES in registry.py:107-113) avoids a DB migration for persisted sessions, is plumbed through get, create_agent, and resolve_model (the last one is easy to miss and has a specific regression test), and the alias map is trivial to extend on the next rename.
  • Manifest factory kwarg filtering (registry.py:428-449) correctly identifies that the dynamic class's **kwargs forwards to BaseAgent.__init__ and introspects that signature rather than the class's — the comment explicitly explains why MRO-walking is a footgun.
  • ctx_size: Optional[int] = Field(None, gt=0) in LoadModelRequest (system.py:468) — boundary-level validation that returns a clean 422 instead of a cryptic Lemonade error. Exactly what CLAUDE.md's "fail loudly at the boundary" rule asks for.
  • MessageBubble "Gaia"-prefix strip (MessageBubble.tsx:242-251) — the word-boundary regex correctly handles both "Gaia Lite" and a hypothetical "Gaiadocs"; the inline comment earns its place.

Verdict

Approve with suggestions — no blocking issues. The except Exception and the memory_available_gb = 0.0 default are real but low-severity, and the other flags are nice-to-haves. Ship it after the author decides which suggestions to apply.

@kovtcharov
Copy link
Copy Markdown
Collaborator Author

@claude troubleshoot why our CI/CD pipeline is failing.

@github-actions
Copy link
Copy Markdown
Contributor

CI/CD Failure Triage for PR #802

Two of the required checks are red on commit 6fd1e6f (run list):

🟡 1. Code Quality (Lint) — 2 issues

Job log

# Check File Fix
1 Black formatting src/gaia/mcp/mcp_bridge.py (the "error": {…} block around L332) python util/lint.py --black --fix
2 Agent conventions schemas/agent-manifest.schema.json is stale vs. the manifest model (this PR touches AgentInfo.min_memory_gb and resolve_model in registry.py) python util/gen_manifest_schema.py

Note: the manifest-schema regeneration is the genuine tell here — the schema generator wasn't re-run after min_memory_gb landed, so it drifts from the runtime Pydantic model. That's the same drift the existing linter is designed to catch.

🟡 2. Unit Tests — 6 failures, all caused by this PR

Job log6 failed, 1484 passed, 16 skipped

All six are stale tests that still encode the old short-circuit behavior, not bugs in the new code. They directly contradict the "Pre-flight load handles wrong-model + small-ctx" thread in the PR description:

tests/unit/test_chat_preflight.py (4 tests) — pre-flight now requires the right model and ctx ≥ 32K, not just "any LLM is loaded":

  • test_llm_active_skips_load — asserted LemonadeClient not called; actual call: load_model('Qwen3.5-35B-A3B-GGUF', ctx_size=32768, prompt=False)
  • test_vlm_active_skips_load — same
  • test_concurrent_second_thread_skips_load — same
  • test_sse_not_called_on_fast_path — asserted _emit not called; actual calls: "Loading LLM model..." + "Could not auto-load LLM..."

tests/unit/test_lemonade_model_loading.py (2 tests)_ensure_model_loaded now passes ctx_size=32768 for models not in the built-in MODELS registry (the 32K fallback described in the PR):

  • test_calls_load_when_model_not_loaded — expected ctx_size=None, got ctx_size=32768
  • test_model_loaded_when_not_present — same

Suggested fix plan

# 1. Reformat + regen schema
python util/lint.py --black --fix
python util/gen_manifest_schema.py

# 2. Update the 6 stale tests to assert the new behavior:
#    - test_chat_preflight.py: wrong-model or small-ctx case should now expect a reload + SSE emit
#    - test_lemonade_model_loading.py: expected `ctx_size=32768` on models outside MODELS registry
python -m pytest tests/unit/test_chat_preflight.py tests/unit/test_lemonade_model_loading.py -xvs

# 3. Full lint + unit suite before re-pushing
python util/lint.py --all
python -m pytest tests/unit/

All other checks (API Tests, CLI/Linux, CLI/Windows, Agent SDK, RAG, Security, CodeQL, Lemonade Embeddings) are green, so scope is narrow.

@github-actions github-actions Bot added mcp MCP integration changes cli CLI changes eval Evaluation framework changes labels Apr 26, 2026
@kovtcharov kovtcharov removed the agents label Apr 26, 2026
@github-actions github-actions Bot added devops DevOps/infrastructure changes agents dependencies Dependency updates labels Apr 26, 2026
@kovtcharov
Copy link
Copy Markdown
Collaborator Author

@claude review this PR.

@github-actions
Copy link
Copy Markdown
Contributor

PR Review — feat(agents): add Chat Lite + Settings model/ctx/memory controls

I reviewed the current state of this PR (HEAD 2c967a4, 31 commits, +6857/-2219 across 68 files). The Gaia Lite + Settings work itself looks solid — the previous review covered that scope well. But the PR has materially changed since then and I see three blockers plus a serious scope concern.


🔴 Blockers (must fix before merge)

1. eval/runner.py is Windows-incompatible — breaks 50+ tests.

src/gaia/eval/runner.py:20 unconditionally import fcntl for the new process-wide lock (commit 404ee397). fcntl is POSIX-only, so on Windows the module-load itself fails with ModuleNotFoundError: No module named 'fcntl' and every test that imports gaia.eval.runner errors out. The Test Eval Tool (Windows) job shows ~50 failures all with this exact stack.

The lock comment already acknowledges "better degraded than dead" for the OSError path on /tmp — apply the same logic to platform: if sys.platform == "win32": yield; return, or use msvcrt.locking() for cross-platform support. As written, this PR cannot ship to Windows users.

2. schemas/agent-manifest.schema.json is stale — lint job fails.

Code Quality is red on the same issue I flagged on Apr 20: the min_memory_gb field landed in the Pydantic model but the generated schema wasn't regenerated. One-shot fix:

python util/gen_manifest_schema.py
git add schemas/agent-manifest.schema.json && git commit -m "chore: regenerate manifest schema"

3. Merge conflicts against main.

gh pr view reports mergeable: CONFLICTING. Last merge from main was dbf0100a on commit-day-12 of this branch's life — a fresh rebase will surface what conflicts and let you resolve them deliberately rather than at merge time.


🟡 Important — scope creep is hiding the real review surface

The PR title and description still describe the original 6 threads (Gaia Lite agent, 3 Settings controls, expected_model_loaded, pre-flight load). The actual diff now contains at least 11 additional themes that aren't mentioned:

Theme Evidence LoC
Synthetic eval corpus generator eval/corpus/gen_real_world.py + 19 docs + 4 .xlsx +1303
--agent-type flag for eval src/gaia/eval/runner.py ~200
Process-wide eval lock (the fcntl one) src/gaia/eval/runner.py ~200
Tool-call envelope recovery + context-overflow retry src/gaia/agents/base/agent.py ~540
Typed Lemonade errors + auto-retry src/gaia/llm/providers/lemonade.py ~160
UI polish (cards, sidebar, greetings, ctx badge, drag-drop, varied greetings) src/gaia/apps/webui/src/components/* ~700
Live model-download progress + auto-load + banner priority src/gaia/apps/webui/src/components/ConnectionBanner.tsx ~140
Removed PyPI wheel-dist verification gate .github/workflows/pypi.yml, util/verify_wheel_dist.py −392 +33
Removed AGENTS.md (168 lines) repo root −168
Removed tests/unit/test_lemonade_manager_preload.py, test_init_ctx_size.py, test_export_import.py, test_code_index_sdk.py, test_ui_extras.py, test_verify_wheel_dist.py tests −981
Docs edits across agent-ui, code-index, custom-agent, patterns, plugin-registry, setup-wizard docs/ ~250

The reliability/UI polish work is mostly good in isolation — but I cannot do an honest review of "Chat Lite + Settings controls" when the diff also silently removes the PyPI wheel-dist gate (which was added specifically to prevent the webui-bundling regression that motivated docs/sdk/sdks/agent-ui.mdx) and 6 test modules totaling ~1000 LoC.

Two things would unblock review:

  1. Update the PR description to enumerate every theme with a why, especially the deletions:
    • Why is verify_wheel_dist gone? Has the underlying regression class been ruled out?
    • Why is AGENTS.md deleted while CLAUDE.md still references AI-agent conventions?
    • Why were those 6 test modules removed — were the modules-under-test deleted, or just the tests?
  2. Or split — at minimum, separate the eval corpus + reliability fixes into their own PRs. The original Chat Lite scope is small and reviewable; bundled this way it isn't.

🟢 What's still good

  • The original 6 threads still hold up; suggestions from the Apr 20 review (the silent except Exception in _canonical_agent_type, the memory_available_gb = 0.0 default, the source-shape regex test) are worth checking but none are blockers.
  • Test additions are thorough — tests/unit/agents/test_parse_error_recovery.py (+185), the registry tests (+566), and the chat-helpers expansion (+114) are all well-targeted.
  • The _chat_helpers.py consolidation around _session_agent_kwargs and _maybe_load_expected_model (right model and ctx ≥ 32K) is the real load-bearing fix and it's well-commented.

Suggested path forward

# 1. Fix Windows fcntl
# In src/gaia/eval/runner.py, guard the import + helpers behind sys.platform.

# 2. Regenerate schema
python util/gen_manifest_schema.py

# 3. Rebase and re-run CI
git fetch origin && git rebase origin/main

# 4. Update PR description to match the actual diff (or split)

Once Windows + lint are green and the description matches the diff, this is reviewable for merge. Tagging @kovtcharov-amd for the scope/split decision since the deletions of AGENTS.md, the PyPI gate, and ~1000 LoC of tests are call-it-out-loud changes that need a maintainer eyes-on regardless of how the rest of the review goes.

itomek pushed a commit that referenced this pull request Apr 29, 2026
… + validate ctx_size

Follows up on the architecture review of PR #802 with three polish
fixes; no behaviour change vs the previous commit on main agents
paths except the `ctx_size` validation.

1. **Single source of truth for the 32K context requirement.**
   ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client``
   and is re-exported by ``lemonade_manager`` for backwards compat.
   The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates
   the three side-by-side copies (each carrying a "must match ..."
   comment — exactly the smell review flagged as drift-prone).

2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``.
   The four ChatAgent/registry.create_agent call sites used to
   repeat the same 4-field bundle (rag_documents, library_documents,
   allowed_paths, ui_session_id). Centralising it means adding a
   new field — or forgetting one, which is what bit us last time
   and caused Chat Lite's PermissionError spiral — happens in one
   place. Unknown kwargs are still filtered by the per-agent
   factory, so this remains safe for manifest agents that don't
   recognise all fields.

3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing
   showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently
   accepted by the endpoint and then failed deep in Lemonade with
   no actionable error. ``Field(None, gt=0)`` now surfaces a 422
   at the boundary with a readable message.

Verified with the existing 153-test suite and a live end-to-end
sweep over the chat-lite agent on a Mac: cold auto-load, warm
reuse, agent switch, parallel sessions (3 simultaneous), ctx
reload 32K→64K, long 6K-token input, empty message, malformed
load-model payload — all pass.
@itomek itomek force-pushed the feature/mac-4b-default branch from 2c967a4 to 5f51005 Compare April 29, 2026 18:09
@github-actions
Copy link
Copy Markdown
Contributor

Summary

Substantively this is a strong, well-reasoned PR — the Chat Lite agent + Settings controls solve a real "35B won't load on this box" UX hole, and the pre-flight model + ctx≥32K fix is the right diagnosis for the silent-truncation bug. Inline comments are unusually good at explaining why (legacy-alias rationale, ctx-size floor reasoning, Gemma-vs-Qwen platform split). Most of my notes are about scope/description hygiene rather than the code itself.

Issues

🟡 Important

1. PR description is stale relative to the actual implementation.

The description claims the agent ID is chat-lite and the primary model is Qwen3-4B-Instruct-2507-GGUF. The code actually ships:

  • Agent ID is gaia-lite (chat-lite is preserved as a legacy alias for stored sessions).
  • Primary models are Qwen3.5-4B-GGUF (Darwin) and Gemma-4-E4B-it-GGUF (Linux/Windows), with platform-conditional ordering.

src/gaia/agents/registry.py:212-258 and the _LEGACY_ID_ALIASES plumbing are the right design, but the PR description is what reviewers and future archaeologists will read first — please bring it in line with reality before merge.

2. Scope is much wider than the description signals.

The description lists ~6 threads centered on Chat Lite + Settings. The diff also contains, none of which are mentioned:

  • Typed Lemonade error hierarchy (LemonadeError/LemonadeModelNotLoadedError/LemonadeContextOverflowError/LemonadeNetworkError) with _classify_chat_exception walker and one-shot retry-after-reload in both streaming and non-streaming paths.
  • Tool-call parse-error recovery loop in agents/base/agent.py (synthetic recovery prompt, 3-attempt cap, friendly fallback) + new tests/unit/agents/test_parse_error_recovery.py.
  • Context-overflow _shrink_messages_for_overflow + retry-after-trim.
  • eval/runner.py fcntl-based single-runner lock with stale-PID reclaim and GAIA_EVAL_NO_LOCK bypass.
  • system.py SSE-consuming download-progress streaming + _auto_load_after_download + new DownloadProgress model.
  • Auto-titling background task in _chat_helpers.py (Jaccard-overlap heuristic, 30 s throttle, fire-and-forget httpx call).
  • WebUI red-color/cursor redesign (~10 CSS files, MessageBubble/Sidebar/ConnectionBanner restyle).
  • New eval/corpus/gen_real_world.py (1,303 lines) + 19 generated documents + 4 xlsx files + 4 baseline/postfix eval/results/*.json snapshots.

Each of those has its own correctness surface. A reviewer reading the description will skip them; a future bisector tracking down (say) an eval-lock or auto-title regression will not realize this PR is the introduction. Either split or — more pragmatically given the size — expand the description to thread-list every distinct change with a one-line why. This is exactly the case CLAUDE.md's "PR Descriptions — Tight and Value-Focused" rule warns about: "If the PR really does bundle many threads, group them — don't list 16 commits."

3. tunnel-friendly-error.png (147 KB) committed at the repo root.

pr-files.txt:71 adds a binary screenshot with no references in code, docs, or .gitignore updates that would explain it. Looks like a debugging artifact that slipped in alongside pr-diff.txt / .claude-pr/. Please drop it before merge.

# (delete tunnel-friendly-error.png from the worktree before push)

🟢 Minor

4. uv.lock quietly drops the Python floor from 3.13 → 3.12 (uv.lock:3) without a corresponding pyproject.toml/setup.py change. Either it's intentional (in which case python_requires in the source-of-truth metadata should match, and the change deserves a one-liner in the description because it broadens supported Python versions) or it slipped in from a local-env bisect (in which case revert).

5. _classify_chat_exception walks __cause__ only, not __context__ (src/gaia/ui/_chat_helpers.py:117-121). Implicit chains from a bare raise ... inside an except block (no from e) won't be matched by the typed-isinstance pass. The substring fallback at :126 covers most of the gap, but if you ever rely on the typed-instance metadata (e.g. LemonadeContextOverflowError.retryable), an implicit chain will skip that branch.

    cur: Optional[BaseException] = exc
    while cur is not None:
        if isinstance(cur, LemonadeError):
            return cur
        cur = cur.__cause__ or cur.__context__

6. Auto-title task can race the next user turn for the Lemonade slot. _maybe_update_session_title fires _generate_session_title against /v1/chat/completions immediately after a stream finishes, without acquiring model_load_lock or any other coordinator. If the user's next message arrives while titling is in-flight and triggers a reload (e.g. they switched agents in the UI), both calls contend for Lemonade's single inference slot. Not a correctness bug — Lemonade serializes — but it can stretch the user-visible "thinking" delay by however long the title generation takes. Worth either gating titling behind the same lock or short-circuiting it when a load is pending.

7. eval/runner.py _acquire_eval_lock writes the PID after stale-lock reclaim, but if the os.write fails the lock fd stays open and the next failure message will print holder=-1. Cosmetic — the lock itself works — but worth a try/except around the PID stamping so the diagnostic stays useful.

8. gaia_lite_factory silently filters unknown kwargs via valid_fields (registry.py:229-235). This is the established pattern across the codebase, so I'm not asking you to change it here, but it does mask kwarg typos at call sites. Consider a follow-up that logs at debug when a kwarg is dropped — would catch a class of silent-config-bug.

Strengths

  • Inline rationale is excellent. The legacy-alias docstring (registry.py:78-96), the platform-split block (:200-215), the _GAIA_LITE_MIN_MEMORY_GB = 5.0 derivation (:217-224), and the 32K-fallback comments in lemonade_client.py all explain why, not what. This is the kind of comment that ages well — preserve this discipline.
  • Single-source-of-truth discipline. canonical_id() resolving aliases through one path; _GAIA_LITE_MODELS[0] driving both UI and factory preset; DEFAULT_CONTEXT_SIZE re-exported instead of re-declared. These are the right moves and they paid off in test cleanup.
  • Test coverage is real, not nominal. test_parse_error_recovery.py, the alias-resolution suite in test_registry.py, the test_right_model_wrong_ctx_triggers_reload case, and the test_known_model_uses_registry_ctx_size regression guard all hit failure modes that would otherwise return as production bug reports. Mocks are at sensible boundaries.
  • "Fail loudly" rule respected. _canonical_agent_type propagates AttributeError rather than swallowing it; typed Lemonade errors carry actionable user_message text; pre-flight reload emits a visible "Model reloaded — retrying…" SSE.

Verdict

Approve with suggestions.

The blockers are documentation, not code: please update the PR description (Issue 1) and drop tunnel-friendly-error.png (Issue 3) before merging. The uv.lock Python-version question (Issue 4) is worth a one-line clarification in the description either way. Everything else is non-blocking — minor robustness and ergonomic notes.

@itomek
Copy link
Copy Markdown
Collaborator

itomek commented Apr 29, 2026

Cross-platform local validation

Every push to this branch was validated on both local Linux (Docker) and local Windows (SSH) before going out — so we don't burn an hour of CI to find platform-specific regressions.

Workflow

  1. Commit the working tree, then bundle the rebased branch + commit: git bundle create /tmp/pr802.bundle origin/main..HEAD.
  2. Transfer the bundle to the test box (docker cp for Linux, scp for Windows).
  3. On the test box: git fetch /tmp/pr802.bundle "HEAD:refs/heads/test/pr802-…" && git checkout test/pr802-… (a throwaway branch — never touches the box's existing checkout).
  4. Run the same gates CI runs:
    • python util/lint.py --black --isort --pylint --flake8
    • python -m pytest tests/unit/ tests/test_eval.py

Results for b2a952f0

Platform Lint tests/test_eval.py tests/unit/
macOS host ✅ clean ✅ 100 / 100 ✅ 1708 passed, 15 skipped, 0 failed
Linux container ✅ clean ✅ 100 / 100 ✅ 1695 passed, 23 skipped, 0 failed
Windows (real win32) n/a 100 / 100 1695 passed, 15 skipped, 13 failed*

*The 13 Windows-only failures are pre-existing Linux-specific tests (os.geteuid mock patches, sys.platform="linux" monkeypatches) in tests/unit/test_init_command.py::TestInstallViaPpa (7 tests, Ubuntu-PPA installer paths) and tests/unit/installer/test_uninstall_command.py (6 tests, POSIX path/permission paths). None have skipif win32 guards. CI's Run Unit Tests job is Linux-only, so it doesn't surface them. These tests would also fail against origin/main on Windows — not a regression from this PR.

What this catches

  • Linux-specific import paths and lazy-import side effects. The new if sys.platform == "win32": fcntl = None; else: import fcntl guard in eval/runner.py was verified to import cleanly on Linux and on real Windows (platform=win32, fcntl=None) before pushing — the original blocker fix that made all of tests/test_eval.py go green on Windows.
  • File-system case-sensitivity bugs that pass on macOS but fail on Linux.
  • Lockfile resolution differences between hosts.
  • Real-Windows behaviour the macOS-side monkeypatch.setattr(runner, "fcntl", None) test simulates — verified end-to-end here, not just by the GitHub Actions Windows runner.

One-time setup quirks

  • Linux: project venv was missing pyfakefs (declared in setup.py's dev extras), making tests/unit/installer/test_uninstall_command.py ERROR on collection. uv pip install pyfakefs clears it.
  • Windows: tests/unit/chat/ui/test_*.py needs the [ui] extra (FastAPI) on top of [dev]. uv pip install -e ".[dev,ui]" from a fresh venv is the one-shot recipe.

Chat Lite is a 4B-model sibling of the built-in Chat Agent for
hardware that cannot host the 35B default (Macs, low-memory boxes).
Same tools, same system prompt, just pinned to Qwen3-4B-Instruct-2507
via the ChatAgent config. ChatAgent and its registration are
untouched.

Agent metadata gains `min_memory_gb`, exposed via `/api/agents`.
Settings renders a Memory Warnings section when any registered agent
declares a requirement above the free memory reading.

Two Settings additions:
  * Active Model — text override bound to the existing `custom_model`
    setting. Empty means "use agent default".
  * Context Size — preset chips (4K/8K/16K/32K) + numeric input that
    reloads the active model with the chosen ctx via the existing
    `/api/system/load-model` endpoint.

Two correctness fixes made Chat Lite actually usable end-to-end:

  * `/api/system/status` previously flagged any non-35B model as
    "Wrong model loaded". It now accepts any registered agent's
    preferred model, so Chat Lite's 4B doesn't trip the banner.

  * `_maybe_load_expected_model` short-circuited when any LLM was
    active, even the wrong one. It now requires the specific expected
    model with ctx >= 32K, otherwise reloads. Without this, Lemonade
    auto-loaded requested models at its default 4096 ctx, silently
    truncated ChatAgent's >7K-token system prompt, and returned an
    empty stream. `_ensure_model_loaded` also falls back to 32K when
    the model is not in the built-in MODELS registry.

Tests: 10 new unit tests covering registration, factory presets,
min_memory_gb propagation through manifests and the agents API, plus
a coexistence check that Chat Agent's defaults stay unchanged.
Frontend `npm run build` passes with 0 new warnings.
kovtcharov and others added 22 commits April 29, 2026 15:57
Two related fixes for the same eval failure mode (Qwen 4B getting
``finish_reason=length`` mid-tool-call on the no_sycophancy scenario):

1. **Output budget too tight.** ``AgentConfig.max_tokens`` was 4096,
   which Qwen3.5-4B exhausted while serialising long tool-call argument
   strings (1000+-char ``summary_type`` blob in the case the eval
   surfaced). With our 32K ctx_size and a ~7.7K-token system prompt +
   history, 8K of output budget leaves plenty of room and still keeps
   ~24K for input. Going much higher would steal from input-history
   budget without measured gain.

2. **Misleading error message.** When the cap was hit we raised
   "Increase --ctx-size for model X" — but ``finish_reason=length``
   from the OpenAI completions API specifically signals the *output
   token cap*, not the context window. ctx_size and max_tokens are
   separate limits, and conflating them sent users off chasing the
   wrong knob (the user had already loaded at 32K ctx). The new error
   names ``AgentConfig.max_tokens`` directly.

Both are agent-runtime changes (affect ChatAgent generally), not
gaia-lite specific. Test suite green; full registry + chat-ui suites
pass with no behaviour-coupled assertions to update.
…op session links

Bundles four user-asked tweaks discovered while iterating on gaia-lite:

UI / Agent UI
  * ChatView.tsx — chat header model badge now shows the loaded
    context window inline ("Qwen3.5-4B-GGUF · 32K"). Title attr carries
    the full "context window: NN,NNN tokens" so a hover gives the
    precise number. Mismatched ctx (e.g. eval reload at 4K) is now
    visible at a glance instead of surfacing only as a chat-time error.
  * ChatView.tsx + Sidebar.tsx — dropped the "#abc1234" session-link
    pills from both the chat header and every sidebar row. They were
    clipboard-copy hash badges; nobody used them and they ate the row.
  * MessageBubble.tsx — every message bubble (user + assistant) now
    has a ``title`` tooltip with the full absolute send timestamp.
    Previously only assistant bubbles had this via the stats footer.
  * ChatView.css — small ``.model-ctx-size`` rule so the suffix is
    dimmer than the model name (eye lands on model first).

Chat agent personality
  * src/gaia/agents/chat/agent.py — replaced the 2-example "RIGHT"
    list in the GREETING RULE with 10 varied openers + an explicit
    "VARY YOUR PHRASING" rule. The original prompt's recency bias
    pinned the model on a single canned "Hey! What are you working
    on?" every conversation; the rotation list breaks that pattern
    while keeping the warm/curious/no-feature-pitch invariant.
    Added a new WRONG example calling out the lock-in behaviour
    explicitly so the model can self-detect it.

Eval discipline
  * CLAUDE.md — new rule: "Run agent evals SERIALLY, never in
    parallel." Today's session lost three eval runs to two
    concurrent ``gaia eval agent`` invocations race-evicting each
    other's models out of Lemonade's single-tenant slot, surfacing
    as bogus 4K-ctx errors and INFRA_ERROR. Documents the exact
    failure modes + the ``ps aux | grep "gaia eval" | wc -l`` =
    0 sanity check before kicking off a new run.

Backend changes confined to comments + system-prompt strings —
agent runtime contracts unchanged. 212 unit tests pass; lint clean;
frontend bundle rebuilt.
The runaway-eval failure mode the user kept hitting today: a parent
agent in --fix mode shells out a second `gaia eval agent ...` while the
first is still running, both invocations talk to the same Lemonade
Server, Lemonade has a single-tenant LLM slot, the runs race-evict each
other's models, and the whole thing surfaces as nondeterministic
`n_ctx=4096` overflows or `model_load_error: llama-server failed to
start`. CLAUDE.md now documents the rule; this commit enforces it.

src/gaia/eval/runner.py
  * `_acquire_eval_lock()` — context manager around fcntl.flock on
    /tmp/gaia-eval-agent.lock. LOCK_EX | LOCK_NB so a second invocation
    fails fast (exit 2) with an actionable message naming the holder
    PID and run age, not by hanging.
  * Stale-lock recovery: if the holder PID is gone, the lock file is
    reclaimed automatically (no manual rm needed).
  * Escape hatch: GAIA_EVAL_NO_LOCK=1 skips the guard for unit tests
    or callers that genuinely manage Lemonade out of band.
  * AgentEvalRunner.run() now wraps the per-scenario loop in the
    lock; audit_only mode skips it (no Lemonade contention).
    Body extracted to _run_locked() so the wrapper stays thin.

UI follow-on (separate concern, same touch):

src/gaia/apps/webui/src/components/AgentActivity.css
  * The "N steps · M tools" expand-activity bar used to render as a
    full-width terminal panel with border + uppercase 11px text on
    every assistant turn — visually loud for content the user rarely
    expands. Now it's an inline 10px chip at 0.55 opacity, expanding
    to full opacity + bg + border on hover/focus, and forced to full
    opacity when the run is active or has errors (states the user
    actually needs to notice).

Verification: subprocess test confirms two concurrent invocations
produce exit-2 + clear error in the second one. Lint clean.
Every GAIA reply was rendering with the same vertical AMD-red border
+ red-tinted background as ``.msg-error``, so a normal multi-turn chat
read as a stack of warnings. Reserved that visual language for actual
errors only.

src/gaia/apps/webui/src/components/MessageBubble.css
  * .msg-assistant: dropped the ``border-left: 2px solid var(--amd-red)``
    that was duplicating .msg-error's left rail. Assistant messages now
    distinguish themselves from user messages purely via the elevated
    panel background, the GAIA avatar + name in the header, and the
    left-aligned text direction. Subtle is the point — a long thread
    should read as a calm conversation, not warnings.
  * Removed the ``[data-theme="dark"] .msg-assistant`` override that
    layered ``rgba(237, 28, 36, 0.02)`` (red haze) on top of the
    already-neutral --bg-assistant-msg variable in dark mode.
  * Avatar dark-mode treatment: ``border-color: rgba(237,28,36,0.25)``
    and ``box-shadow: 0 0 8px rgba(237,28,36,0.12)`` (red glow) →
    neutral ``var(--border)`` + a 1-px white-tinted ring. Brand red
    is still present in the small "GAIA" name text + the user-avatar
    fill; just not bleeding into every UI surface.

Error styling unchanged — .msg-error still owns the AMD-red left
border + red background tint, and is now the only place that visual
appears so it actually stands out.

Pure CSS — no behavioural change. Bundle rebuilt; refresh to see.
Three connected polish passes on the assistant bubble:

src/gaia/apps/webui/src/components/MessageBubble.css
  * **Rounded card.** .msg-assistant was a hard-edged horizontal strip;
    now it's a 14-px-radius panel that visually sits *inside* the chat
    column instead of cutting it horizontally. ``width: calc(100% - 16px)``
    inset so the rounding actually shows.
  * **3D depth.** Layered box-shadow (4 stops: inset top-edge highlight,
    crisp 1-px contact, soft mid-distance, wider ambient) for a subtle
    elevation. Hover lifts ~1 px and softens the ambient — small enough
    not to read as a "click me" CTA, big enough to feel responsive.
    Dark-mode variant uses heavier alphas (0.30–0.55) because shadows
    on near-black backgrounds disappear at typical light-mode values.
  * **Softer fade-in.** Replaced the 200 ms slide+scale with a 380 ms
    fade+drift+deblur (assistant: 450 ms). cubic-bezier(0.22, 0.61, 0.36, 1)
    eases out smoothly. Dropped the 0.99 scale step — scaling rounded
    cards causes brief subpixel blur on retina. Added a 2-px filter:blur
    transition that mimics an "image developing" effect on entry —
    very subtle but adds polish without slowing perception.

User messages unchanged: still flat, transparent, right-aligned. Only
the GAIA reply panel is the rounded 3D card. Shadow tokens are pure
CSS — no extra DOM, no JS, ~zero runtime cost.
Same direction as the GAIA message card (b5289a3): rounder, lighter,
less red bleed, subtle elevation on the active state.

src/gaia/apps/webui/src/components/Sidebar.css

  * Session items
    - radius-md (8px) → radius-lg (10px) for softer corners
    - Padding tweaked +1px vertically so rows don't feel crammed
    - Hover now adds a faint border-light ring in addition to the
      bg-hover fill — gives shape to the row before clicking
    - Active row's box-shadow was a 10-px red glow + inset 28-px
      red haze; replaced with a calm multi-layer (top highlight +
      contact + mid) elevation in light mode, deeper black-alpha
      shadows in dark mode. Brand-red is still present in the
      left-edge indicator and a 6%-alpha bg tint, but no longer
      bleeding into the surrounding sidebar.
    - Press-feedback scale tightened from 0.982 → 0.985 (less
      "trampoline", more "tap").

  * Search input
    - radius-md → radius-lg, matching the session items
    - Now sits on a tertiary background with an inset hairline
      shadow, reading as gently recessed (the inverse of the
      session-active which reads as raised — coherent stack metaphor)
    - Focus ring: was border 0.4 alpha + 12-px red glow ("warning"
      vibe); now a soft 3-px 0.08-alpha brand-red halo + 0.35-alpha
      border. Reads as "focused", not "errored".

  * Bottom status bar
    - Tiny 1-px inset top highlight in addition to the existing
      border-top — gives a soft "lip" between the scrolling session
      list and the fixed status row, matching the layered-card feel.

Pure CSS; no DOM changes. Bundle rebuilt.
The textarea had a custom 10×18 px solid-AMD-red block cursor with a
0.5-alpha red glow blinking at 1 Hz on every focus. On a screen the
user is staring at all session, that's an obvious "warning indicator"
treatment for a primitive that should be invisible until interacted with.

src/gaia/apps/webui/src/components/ChatView.tsx
  * Removed ``getCaretXY`` helper, ``_computedStyleCache``, the
    ``caret`` state + ``updateCaret``/``setCaret`` callbacks, the
    inline ``caretColor: 'transparent'`` override, and the
    ``<span className="input-cursor" />`` element. The textarea now
    relies on the browser's native caret like every other text input.
  * onFocus/onBlur/onSelect handlers tied only to caret tracking are
    gone; nothing else cared about that state.

src/gaia/apps/webui/src/components/ChatView.css
  * Removed the ``.input-cursor`` rule. ``@keyframes cursorBlink``
    stays in styles/index.css — it's still used by AgentActivity and
    MessageBubble streaming indicators (those are in-line content, not
    always-on UI chrome, so the blink is appropriate there).

Side benefit: drops a per-keystroke ``requestAnimationFrame`` +
mirror-div DOM creation that was running on every input change.
The previous "softened" active state still carried two pieces of the
error visual vocabulary:

  1. A vertical AMD-red gradient bar on the left edge (::before
     pseudo-element) — same shape and color as .msg-error's
     left rail. Active row read as "this row is broken".
  2. In dark mode: rgba(237,28,36,0.06) background + 0.12-α red border
     — selected row looked like a tinted warning band.

Both gone. Active state is now communicated purely via elevation +
neutral background lift, matching the GAIA message-card vocabulary
exactly (the same 4-stop layered shadow: inset top highlight,
contact, mid-distance, ambient).

src/gaia/apps/webui/src/components/Sidebar.css

  * .session-item::before — entire pseudo-element + the hover/active
    transform-scaleY rules removed. The selected row no longer has
    a red strip; selection is the elevation + bg-active.
  * .session-item.active light-mode shadow stack matched 1:1 to
    .msg-assistant for visual coherence.
  * .session-item.active dark-mode background switched from
    rgba(237,28,36,0.06) → rgba(255,255,255,0.04). Border color
    likewise from rgba(237,28,36,0.12) → rgba(255,255,255,0.08).
    Heavier black shadows for the dark-on-dark elevation read.
  * sessionActivateDarkBg keyframe (which animated a red
    background flood + inset red glow on activation) deleted.
    sessionActivate keyframe simplified — the 1.5-px overshoot
    bounce removed for a calmer settle.

Brand red is now reserved for: the small "GAIA" name text in
message headers, the user-avatar fill, and .msg-error. That's it.
…color

Two threads:

(1) GAIA auto-titling (the running chat agent renames its own session).

src/gaia/ui/_chat_helpers.py
  * New ``_maybe_update_session_title`` helper. Fired fire-and-forget
    after every assistant turn finishes; calls the same Lemonade chat-
    completion endpoint to generate a 3-6 word tab title.
  * Trigger rules:
      - Title is one of the defaults (New Chat / New Task / Untitled
        / Chat / empty) → first-response pass
      - User message has ≤ 0.15 word-overlap with current title AND
        is ≥ 25 chars → topic-shift pass
  * Skipped when title starts with "Eval:" (eval framework owns those)
    and throttled to one update / 30 s / session so concurrent fires
    don't pile up.
  * LLM call uses temperature 0.3 + max_tokens 24, stripping common
    title artifacts ("Title:", quote-wrapping, trailing punctuation)
    that small models add despite the instruction.
  * Background task reference pinned in _active_sse_handlers under a
    ``_titlebg:<sid>`` key so GC doesn't kill it mid-flight; cleared
    in the done callback.

src/gaia/ui/routers/chat.py
  * Same hook in the non-streaming path immediately after
    db.add_message(assistant). Wrapped in try/except so an auto-title
    failure can never block the user's response.

(2) Neutral selection color.

src/gaia/apps/webui/src/styles/index.css
  * ::selection background: rgba(237, 28, 36, 0.18) (AMD red) →
    rgba(56, 132, 255, 0.22) (cool blue). Selecting any text used to
    paint a red highlight on top of red errors / red focus rings /
    formerly-red active session — every interaction read as a warning
    state. Standard cool-blue selection is what users expect.

Tests: 46 chat-ui unit tests pass; lint clean.
…lures

The chat path had no recovery for the three failure modes the user
keeps hitting in the UI:
  * "No model loaded: <model>" — Lemonade evicted between turns
  * "request (N tokens) exceeds the available context size (M)"
    — wrong-ctx model load
  * "Network error: CURL Timeout was reached" — Lemonade busy/hung

Each one previously surfaced as a wall of raw JSON inside the chat
bubble and required the user to hand-restart the model. Now:

src/gaia/llm/providers/lemonade.py
  * Added typed exceptions ``LemonadeError``,
    ``LemonadeModelNotLoadedError``, ``LemonadeContextOverflowError``,
    ``LemonadeNetworkError``. Each carries (a) a short ``.user_message``
    suitable for direct chat-bubble rendering and (b) the raw payload
    on ``.payload`` for diagnostic logging. Class-level ``retryable``
    flag declares whether the chat layer should auto-retry.
  * ``_classify_lemonade_response`` walks the response envelope —
    including the nested ``details.response.error`` shape Lemonade
    uses for backend_error wrappers — and returns the matching typed
    exception or a generic LemonadeError when the shape is novel.
  * The single ``raise ValueError(f"Unexpected response format from
    Lemonade Server: {response}")`` at the dict-of-choices guard now
    routes through the classifier, so callers see typed errors instead
    of a string-blob ValueError.

src/gaia/ui/_chat_helpers.py
  * ``_classify_chat_exception`` walks the cause chain AND the
    stringified message text to detect typed errors even when AgentSDK
    wraps the original LemonadeError in a generic ValueError /
    RuntimeError (which it does in several paths).
  * ``_run_agent`` (the streaming worker thread) now does ONE automatic
    retry on retryable Lemonade errors. On a model-not-loaded /
    network-error first failure: forces a fresh ``_maybe_load_expected_model``
    at our 32K ctx, emits a "Model reloaded — retrying..." status SSE
    so the user sees the recovery (not silent retry), and re-runs
    ``agent.process_query``. If THAT fails too, the friendly
    user_message is surfaced instead of the raw exception.
  * The catch-all ``except Exception`` block now prefers
    ``LemonadeError.user_message`` over ``str(e)``, so even
    non-retryable errors (context overflow) come through as a
    plain-English explanation instead of a JSON blob.

Tests: full chat-ui suite (46 tests) passes; lint clean. The retry
itself is exercised end-to-end the next time you trigger a model
eviction — start a fresh chat session and you'll see the recovery
status rather than the wall-of-JSON failure mode.
…t path

Layer 2 + 3 of the chat-resilience push (continuation of 0795a5d).

src/gaia/ui/_chat_helpers.py

  * **Pre-flight ctx_size guard.** ``_maybe_load_expected_model``
    previously checked ``active_ctx and active_ctx < N`` — short-circuit
    evaluation meant a 0-or-missing ctx slipped through, leaving a
    broken model in place. New guard explicitly treats missing ctx as
    "needs reload" so a model loaded with no recipe_options.ctx_size
    (which is exactly the broken state we want to recover from) gets
    re-loaded at our 32K canonical.

  * **Bounded reload timeout.** Pre-flight now passes ``timeout=120``
    to ``LemonadeClient.load_model`` instead of inheriting the default
    DEFAULT_MODEL_LOAD_TIMEOUT (12000 s / 200 min). Cold load of a 4B
    GGUF on consumer hardware fits in <60 s; if it hasn't completed in
    120 s something is genuinely wrong and we'd rather surface the
    failure than block the chat thread for hours.

  * **Non-streaming chat now retries identically to streaming.** The
    one-shot retry on transient Lemonade errors (model evicted between
    turns, network blip) was only in the streaming worker before; the
    non-streaming path raised straight to the user. Mirrored the same
    classify → reload → retry sequence in ``_get_chat_response`` so
    both paths recover the same way.

  * **Friendly error mapping in non-streaming.** The catch-all
    ``except Exception`` in ``_get_chat_response`` was returning a
    stock "I'm having trouble connecting..." string regardless of the
    actual failure mode. Now it prefers the typed
    ``LemonadeError.user_message`` (e.g. "This conversation got too
    long for the model's context window"), falling back to the stock
    copy only when classification fails.

Together with 0795a5d, this closes the three failure modes the user
keeps hitting in the UI:
  * Wrong-ctx model load (now caught by tighter guard, reloaded at 32K)
  * Mid-conversation eviction (caught by retry on both streaming/non-streaming)
  * Lemonade hang during reload (bounded by 120-s timeout instead of 200 min)

Tests: 60 chat-ui + preflight tests pass.
NameError regression introduced in 1a807d2 (auto-titling commit) —
the streaming-path background task at the bottom of
``_stream_chat_response`` called ``_effective_model(agent, model_id)``,
but ``agent`` lives inside the producer thread's local scope and
isn't visible to the outer generator. Every streaming chat turn was
therefore raising NameError after the response otherwise completed,
which the catch-all then surfaced as "Sorry, something went wrong on
my end".

src/gaia/ui/_chat_helpers.py
  * Capture ``session_model = model_id`` in the outer scope right
    after custom-model resolution; reference that in the auto-title
    task instead of the inaccessible ``agent``.
  * Comment explains why the indirection — same model id the
    pre-flight + agent factory used, just made visible to the
    post-stream cleanup.

Verified by re-running a fresh streaming chat turn end-to-end against
gaia-lite: 3.1 s round-trip, ``done=True``, no error events, response
"4" persisted correctly.

Found via the chaos-test harness from the same reliability push —
which was the point of layer 4.
…bling raw error

Small models (4B-class) occasionally emit malformed native tool_calls
envelopes — e.g. a 1000+ char summary_type argument that gets truncated
mid-string. Previously _parse_llm_response raised ValueError uncaught,
the exception bubbled through the chat helper, and the user saw:

  Agent error: Malformed native tool_calls envelope: Expecting ','
  delimiter: line 1 column 220 (char 219)

The fix wraps the parse call in process_query with a try/except that:

1. Logs the parse error to error_history (type=tool_call_parse_error)
2. Appends a synthetic recovery prompt instructing the model to retry
   with documented enum values (or fall back to plain text)
3. Continues the loop so the next LLM call has clean conversation
4. After 3 consecutive parse failures, gives up gracefully with a
   friendly fallback rather than spamming the user

Same handler also catches finish_reason=length (tool-call truncated
mid-arguments) and parallel tool_calls (NotImplementedError), since
both manifest as the same user-facing failure mode.

Surfaced by GAIA eval baseline scenario:
- personality/honest_limitation Turn 2: 1000+ char summary_type arg
- rag_quality/negation_handling Turns 1-2: finish_reason=length

Tests: tests/unit/agents/test_parse_error_recovery.py — covers the
parse-error path, the 3-strikes graceful giveup, and the underlying
ValueError still raises from _parse_llm_response itself.
Two failure modes surfaced by GAIA eval baseline against gaia-lite on
Qwen3.5-4B-GGUF:

1. Context-overflow mid-loop (5 of 8 baseline failures)
   When a multi-step ReAct turn accumulates tool results from several
   search_file/index_document calls, the cumulative `messages` array
   eventually exceeds the model's 32K context window. Lemonade returns
   exceed_context_size and the chat helper surfaces "This conversation
   got too long for the model's context window. Start a fresh task..."

   Fix: wrap the LLM-call try/except in process_query (both streaming
   and non-streaming branches) with a retry loop. When we detect a
   context-overflow exception (substring match on the upstream error
   text — typed errors get wrapped by AgentSDK), trim the messages
   array to {first user input + last 4 entries} and retry ONCE. If
   the retry also fails, return a friendly message asking the user to
   start fresh — no more raw exception leaks.

   Surfaced by:
   - tool_selection/smart_discovery T1 (search → index → blow up)
   - error_recovery/file_not_found T1+T2 (failed index → deep search)
   - error_recovery/search_empty_fallback T1
   - captured/captured_eval_smart_discovery

2. list_windows had no macOS branch
   The tool returned "Window listing not available. Install pywinauto
   (Windows) or wmctrl (Linux)." on Mac, and the agent reported "I
   can't list open windows on this Mac" — judged FAIL.

   Fix: add a Darwin branch that uses osascript with System Events to
   return the visible (non-background) processes — equivalent to what
   the user sees in Mission Control. No new dependency: osascript
   ships with macOS.

   Surfaced by web_system/list_windows.

Tests: extended tests/unit/agents/test_parse_error_recovery.py with
TestProcessQueryRecoversOnContextOverflow — covers the trim+retry
success path AND the after-retry-still-fails graceful fallback.
…keep latest

Previous trim strategy (keep first + last 4 messages) didn't help when the
final tool result itself was huge (e.g. RAG query returning many KB of
chunks). After trim we still couldn't fit, so the agent gave up.

New _shrink_messages_for_overflow helper:
- Keep the original user query intact
- Keep the LATEST tool result intact (model needs it to answer)
- Replace older tool results with a tiny stub
  ("[tool result omitted -- context overflow recovery]")
- Truncate verbose assistant chain-of-thought to 800 chars

This preserves the structural shape of the conversation so the model
still understands what tools have been called, but drops the bulk of
the bytes. Same shrink applied in both streaming and non-streaming
branches of process_query.

Surfaced by tool_selection/smart_discovery T1: agent indexed handbook
correctly, but the indexed-content tool result + chain-of-thought
combined to push past 32K. Old trim retained both and still failed;
new shrink keeps just the latest result so retry succeeds.

Tests still pass — the existing context-overflow tests now exercise
the new path implicitly via the same "raise then succeed" pattern.
The chat helper has a one-shot retry that calls _maybe_load_expected_model
to reload the model at the canonical 32K ctx — but only fires for errors
where ``retryable=True``. LemonadeContextOverflowError was always non-
retryable, so when a previous load left Qwen3.5-4B at 4096-ctx (e.g.
after an embedding model swap or auto-load default), the agent surfaced
"This conversation got too long" to the user even though the actual
remedy was a model reload.

Now LemonadeContextOverflowError.retryable is set dynamically based on
the reported n_ctx:

- If n_ctx < 32768 → retryable=True. The model was loaded with the
  wrong ctx_size; the chat layer's one-shot retry will reload at 32K
  via _maybe_load_expected_model and the same prompt will fit.
- If n_ctx == 32768 → retryable=False. This is a genuine "conversation
  too big" situation; retry won't help, surface the friendly message.

Two parsing paths updated:
- _classify_lemonade_response (provider-side, structured payload):
  reads n_ctx from nested error.details.response.error.n_ctx
- _classify_chat_exception (chat-layer fallback for when AgentSDK
  re-raises with str(original)): regex-extracts the ctx number from
  the textual message ("context size (4096 tokens)")

Surfaced by tool_selection/smart_discovery — backend was loaded with
n_ctx=4096 after a fresh restart sequence, the original handbook+RAG
turn legitimately needed ~18K tokens, and we were stuck refusing the
turn instead of reloading.
…load

The previous commit made LemonadeContextOverflowError retryable when n_ctx
< 32K, but the chat helper's reload-and-retry never fired because the
agent's own try/except in process_query catches the exception first and
runs its in-loop trim-and-retry instead. The trim doesn't help when the
real issue is a 4K-loaded model — the second request goes to the same
broken backend.

Fix: detect the wrong-ctx-loaded sub-case in agent.py (substring match
on "context size (4096|8192|16384" / "n_ctx': 4096|8192|16384") and
RE-RAISE instead of trimming. This bubbles the typed error up to the
chat helper, where _classify_chat_exception now reads it as retryable
and triggers _maybe_load_expected_model to reload at 32K before the
chat layer's own one-shot retry.

Genuine "conversation too big to fit even at 32K" still goes through
the trim+retry+friendly-fallback path as before.

Both streaming and non-streaming branches updated symmetrically.
…lope

The OpenAI tool_call spec says ``function.arguments`` is a JSON string,
but llama.cpp 4B-class models occasionally emit it as a pre-parsed dict.
``json.loads(dict)`` raised TypeError which our recovery layer didn't
catch (it only listened for ValueError / NotImplementedError), so the
exception bubbled out as:

  Agent error: the JSON object must be str, bytes or bytearray, not dict

Surfaced by tool_selection/smart_discovery T1 after the previous
context-overflow fix unblocked a path that was previously masked by
that earlier failure.

Now we accept either shape:
- str/bytes → json.loads as before
- dict → use directly
- empty/None → empty dict
- anything else → raise ValueError with a recovery-friendly message
  so the parse-error retry kicks in
The substring detection ("context size (4096", etc.) only fires when the
raw payload is preserved in the exception. When AgentSDK re-raises with
the typed LemonadeContextOverflowError's friendly user_message
("This conversation got too long..."), the n_ctx detail is gone, so
substring matching always missed.

Fix: when a context-overflow fires AND substring match doesn't say
"wrong ctx", probe the Lemonade health endpoint via httpx to read the
LLM's actual ctx_size. If < 32K, treat it as wrong-ctx and re-raise so
the chat helper reload-and-retry kicks in.

The probe times out fast (3s) and returns False on any failure, so the
caller cleanly falls through to in-loop trim if the probe is unreliable.

Surfaced by another smart_discovery rerun: ctx was 4096, my earlier
substring guards missed because str(exception) was already friendly.
The 19 real_world scenarios were SKIPPED_NO_DOCUMENT because the
referenced corpus files did not exist on disk. Add a deterministic
generator that authors all 19 documents from a single Python data
structure, idempotent on re-run.

Documents are synthetic paraphrases (no copyrighted source text) but
contain every ground-truth fact each scenario references. XLSX files
are flattened by RAGSDK._extract_text_from_xlsx into row-keyed prose
that surfaces SKUs, totals, and notes for the chunker.

Re-run with: python eval/corpus/gen_real_world.py
CI blockers:
- src/gaia/eval/runner.py: guard `import fcntl` and `_acquire_eval_lock` body
  so Windows degrades to a no-op (the lock guards a Lemonade race that doesn't
  happen on Windows dev boxes). Unblocks Test Eval Tool (Windows) — was
  ModuleNotFoundError on every test in tests/test_eval.py.

Apr-20 review actionable items:
- src/gaia/ui/_chat_helpers.py: drop blanket `except Exception` in
  _canonical_agent_type — canonical_id is a pure dict lookup; CLAUDE.md fail-loudly.
- src/gaia/ui/models.py: memory_available_gb -> Optional[float] = None.
  Avoids a false memory-warning banner if psutil ever silently falls back.
  TS type and SettingsModal updated to null-guard.
- src/gaia/agents/registry.py: lift the 5 GB gaia-lite floor to a named
  constant _GAIA_LITE_MIN_MEMORY_GB so the rationale stays adjacent to the value.
- src/gaia/ui/routers/system.py: clarify update_settings docstring re Pydantic
  vs GAIA convention; widen psutil exception handler so an OSError from
  virtual_memory() (containers/seccomp) doesn't 500 the status endpoint.

Tests + lint hygiene:
- tests/unit/test_chat_preflight.py: 6 stale tests now exercise the new
  pre-flight semantics (right model + ctx >= 32K). New
  test_right_model_wrong_ctx_triggers_reload covers the negative case.
- tests/unit/test_lemonade_model_loading.py: assert ctx_size=32768 fallback
  for unknown models. New test_known_model_uses_registry_ctx_size guards
  the registry-lookup path.
- tests/test_eval.py: new test_acquire_eval_lock_windows_noop with fcntl=None.
- tests/unit/chat/ui/test_chat_helpers.py: source-shape regex updated for
  the new _build_create_kwargs(...) call shape; new TestCanonicalAgentType
  class ratchets the dropped silent-except.
- src/gaia/llm/providers/lemonade.py: unused `is_err` -> `_is_err`.
- src/gaia/ui/_chat_helpers.py: use module-level `_re` consistently.
- tests/unit/agents/test_registry.py: missing `from unittest.mock import patch`.
- src/gaia/apps/webui/package-lock.json: bumped 0.17.3 -> 0.17.4 to match
  package.json.
…loor, walk __context__

- Delete tunnel-friendly-error.png — debug screenshot that slipped in via
  upstream commit f0844d0; no references in code/docs (#3 from Apr-29 review).
- Restore uv.lock requires-python ">=3.13" to match origin/main (was silently
  narrowed to ">=3.12" in the same upstream commit). setup.py's python_requires
  stays >=3.10; the lock now no longer drifts from main (#4 from Apr-29 review).
- Restore src/gaia/apps/webui/package-lock.json to origin/main (revert my
  drive-by 0.17.3 -> 0.17.4 bump). Main itself has the package.json=0.17.4 vs
  lockfile=0.17.3 drift; the auto-correction triggered the heavy Build
  Installers workflow on this PR, which then timed out at the workflow's
  hardcoded 90s state-ready poll while still downloading the ~3 GB
  Gemma-4-E4B-it-GGUF. Reverting eliminates the unrelated CI noise; the
  lockfile/package.json drift is its own tech debt.
- _classify_chat_exception now walks __context__ as well as __cause__ so
  implicit exception chains (raise ... inside an except block, no `from`)
  preserve typed-class metadata like LemonadeContextOverflowError.retryable
  (#5 from Apr-29 review).
@itomek itomek force-pushed the feature/mac-4b-default branch from b2a952f to 62d0d2f Compare April 29, 2026 20:05
@kovtcharov-amd kovtcharov-amd added this pull request to the merge queue Apr 29, 2026
Merged via the queue into main with commit 37e35eb Apr 29, 2026
45 of 47 checks passed
@kovtcharov-amd kovtcharov-amd deleted the feature/mac-4b-default branch April 29, 2026 22:20
kovtcharov added a commit that referenced this pull request Apr 29, 2026
Bring the feature branch back to green by addressing the cluster of CI
failures that landed when the memory v2 work merged with main.  All fixes
are mechanical or scoped to test isolation — no behavioural change to
the memory pipeline itself.

- Restore lost merge-conflict state in `ChatView.tsx` and `Sidebar.tsx`:
  the `getSessionHash` import, `hashCopied`/`copied` state, and the
  `handleCopyHash` callback all dropped during the merge — Vite build
  was failing on missing identifiers across PyPI Build Check and all
  three Build Installers jobs.

- Lint/Pylint cleanup so the `Code Quality (Lint)` job is green again:
  remove unused vars/imports, drop dead `if x != x` branches, and
  promote a few pointless lambdas to method references in
  `agents/base/discovery.py`.  Reorder `routers/memory.py` imports
  to satisfy isort.

- Tighten `_canonical_agent_type` to surface `AttributeError` instead
  of swallowing it (matches the existing regression test added in #802;
  was failing locally and in CI Unit Tests).

- Add an explicit `GAIA_MEMORY_DISABLED=1` opt-out to `MemoryMixin.init_memory`.
  The Path Validator security tests, Unit Tests, and Chat Agent Tests
  jobs all instantiate `ChatAgent`/`CodeAgent` without a Lemonade server
  available; the memory v2 hard-requirement on the embedding service
  fails them.  This is a deliberate, named opt-out (not a silent
  fallback) — tests that exercise memory itself clear the variable
  via the new `tests/unit/conftest.py` autouse fixture and the
  `_mock_v2_init_context` helper, so memory test coverage is unchanged.
  CI workflows that don't need memory now set the env var explicitly.
@github-actions github-actions Bot mentioned this pull request May 1, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents cli CLI changes dependencies Dependency updates devops DevOps/infrastructure changes electron Electron app changes eval Evaluation framework changes llm LLM backend changes mcp MCP integration changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants