feat(agents): add Chat Lite + Settings model/ctx/memory controls#802
feat(agents): add Chat Lite + Settings model/ctx/memory controls#802kovtcharov-amd merged 32 commits intomainfrom
Conversation
… + validate ctx_size Follows up on the architecture review of PR #802 with three polish fixes; no behaviour change vs the previous commit on main agents paths except the `ctx_size` validation. 1. **Single source of truth for the 32K context requirement.** ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client`` and is re-exported by ``lemonade_manager`` for backwards compat. The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates the three side-by-side copies (each carrying a "must match ..." comment — exactly the smell review flagged as drift-prone). 2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``. The four ChatAgent/registry.create_agent call sites used to repeat the same 4-field bundle (rag_documents, library_documents, allowed_paths, ui_session_id). Centralising it means adding a new field — or forgetting one, which is what bit us last time and caused Chat Lite's PermissionError spiral — happens in one place. Unknown kwargs are still filtered by the per-agent factory, so this remains safe for manifest agents that don't recognise all fields. 3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently accepted by the endpoint and then failed deep in Lemonade with no actionable error. ``Field(None, gt=0)`` now surfaces a 422 at the boundary with a readable message. Verified with the existing 153-test suite and a live end-to-end sweep over the chat-lite agent on a Mac: cold auto-load, warm reuse, agent switch, parallel sessions (3 simultaneous), ctx reload 32K→64K, long 6K-token input, empty message, malformed load-model payload — all pass.
SummaryShips the Gaia Lite agent (ChatAgent + 4B model for low-memory hardware) and three Settings controls (Active Model override, Context Size picker, Memory Warnings) — purely additive and well-bounded. The real gem is the Issues Found🟢 MinorSilent
If Source-shape regex test is fragile ( The regex-over-source assertion for "streaming branch passes Memory threshold comment vs. constant ( The 5 GB floor is justified as "Q4_K_M weights ~2.5 GB + context + runtime headroom". Nice rationale — consider lifting
The new docstring is correct in its current wire format, but Pydantic absolutely can distinguish Strengths
VerdictApprove with suggestions — no blocking issues. The |
|
@claude troubleshoot why our CI/CD pipeline is failing. |
CI/CD Failure Triage for PR #802Two of the required checks are red on commit 🟡 1. Code Quality (Lint) — 2 issues
Note: the manifest-schema regeneration is the genuine tell here — the schema generator wasn't re-run after 🟡 2. Unit Tests — 6 failures, all caused by this PRJob log — All six are stale tests that still encode the old short-circuit behavior, not bugs in the new code. They directly contradict the "Pre-flight load handles wrong-model + small-ctx" thread in the PR description:
Suggested fix plan# 1. Reformat + regen schema
python util/lint.py --black --fix
python util/gen_manifest_schema.py
# 2. Update the 6 stale tests to assert the new behavior:
# - test_chat_preflight.py: wrong-model or small-ctx case should now expect a reload + SSE emit
# - test_lemonade_model_loading.py: expected `ctx_size=32768` on models outside MODELS registry
python -m pytest tests/unit/test_chat_preflight.py tests/unit/test_lemonade_model_loading.py -xvs
# 3. Full lint + unit suite before re-pushing
python util/lint.py --all
python -m pytest tests/unit/All other checks (API Tests, CLI/Linux, CLI/Windows, Agent SDK, RAG, Security, CodeQL, Lemonade Embeddings) are green, so scope is narrow. |
|
@claude review this PR. |
PR Review —
|
| Theme | Evidence | LoC |
|---|---|---|
| Synthetic eval corpus generator | eval/corpus/gen_real_world.py + 19 docs + 4 .xlsx |
+1303 |
--agent-type flag for eval |
src/gaia/eval/runner.py |
~200 |
| Process-wide eval lock (the fcntl one) | src/gaia/eval/runner.py |
~200 |
| Tool-call envelope recovery + context-overflow retry | src/gaia/agents/base/agent.py |
~540 |
| Typed Lemonade errors + auto-retry | src/gaia/llm/providers/lemonade.py |
~160 |
| UI polish (cards, sidebar, greetings, ctx badge, drag-drop, varied greetings) | src/gaia/apps/webui/src/components/* |
~700 |
| Live model-download progress + auto-load + banner priority | src/gaia/apps/webui/src/components/ConnectionBanner.tsx |
~140 |
| Removed PyPI wheel-dist verification gate | .github/workflows/pypi.yml, util/verify_wheel_dist.py |
−392 +33 |
Removed AGENTS.md (168 lines) |
repo root | −168 |
Removed tests/unit/test_lemonade_manager_preload.py, test_init_ctx_size.py, test_export_import.py, test_code_index_sdk.py, test_ui_extras.py, test_verify_wheel_dist.py |
tests | −981 |
Docs edits across agent-ui, code-index, custom-agent, patterns, plugin-registry, setup-wizard |
docs/ |
~250 |
The reliability/UI polish work is mostly good in isolation — but I cannot do an honest review of "Chat Lite + Settings controls" when the diff also silently removes the PyPI wheel-dist gate (which was added specifically to prevent the webui-bundling regression that motivated docs/sdk/sdks/agent-ui.mdx) and 6 test modules totaling ~1000 LoC.
Two things would unblock review:
- Update the PR description to enumerate every theme with a why, especially the deletions:
- Why is
verify_wheel_distgone? Has the underlying regression class been ruled out? - Why is
AGENTS.mddeleted while CLAUDE.md still references AI-agent conventions? - Why were those 6 test modules removed — were the modules-under-test deleted, or just the tests?
- Why is
- Or split — at minimum, separate the eval corpus + reliability fixes into their own PRs. The original Chat Lite scope is small and reviewable; bundled this way it isn't.
🟢 What's still good
- The original 6 threads still hold up; suggestions from the Apr 20 review (the silent
except Exceptionin_canonical_agent_type, thememory_available_gb = 0.0default, the source-shape regex test) are worth checking but none are blockers. - Test additions are thorough —
tests/unit/agents/test_parse_error_recovery.py(+185), the registry tests (+566), and the chat-helpers expansion (+114) are all well-targeted. - The
_chat_helpers.pyconsolidation around_session_agent_kwargsand_maybe_load_expected_model(right model and ctx ≥ 32K) is the real load-bearing fix and it's well-commented.
Suggested path forward
# 1. Fix Windows fcntl
# In src/gaia/eval/runner.py, guard the import + helpers behind sys.platform.
# 2. Regenerate schema
python util/gen_manifest_schema.py
# 3. Rebase and re-run CI
git fetch origin && git rebase origin/main
# 4. Update PR description to match the actual diff (or split)Once Windows + lint are green and the description matches the diff, this is reviewable for merge. Tagging @kovtcharov-amd for the scope/split decision since the deletions of AGENTS.md, the PyPI gate, and ~1000 LoC of tests are call-it-out-loud changes that need a maintainer eyes-on regardless of how the rest of the review goes.
… + validate ctx_size Follows up on the architecture review of PR #802 with three polish fixes; no behaviour change vs the previous commit on main agents paths except the `ctx_size` validation. 1. **Single source of truth for the 32K context requirement.** ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client`` and is re-exported by ``lemonade_manager`` for backwards compat. The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates the three side-by-side copies (each carrying a "must match ..." comment — exactly the smell review flagged as drift-prone). 2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``. The four ChatAgent/registry.create_agent call sites used to repeat the same 4-field bundle (rag_documents, library_documents, allowed_paths, ui_session_id). Centralising it means adding a new field — or forgetting one, which is what bit us last time and caused Chat Lite's PermissionError spiral — happens in one place. Unknown kwargs are still filtered by the per-agent factory, so this remains safe for manifest agents that don't recognise all fields. 3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently accepted by the endpoint and then failed deep in Lemonade with no actionable error. ``Field(None, gt=0)`` now surfaces a 422 at the boundary with a readable message. Verified with the existing 153-test suite and a live end-to-end sweep over the chat-lite agent on a Mac: cold auto-load, warm reuse, agent switch, parallel sessions (3 simultaneous), ctx reload 32K→64K, long 6K-token input, empty message, malformed load-model payload — all pass.
2c967a4 to
5f51005
Compare
SummarySubstantively this is a strong, well-reasoned PR — the Chat Lite agent + Settings controls solve a real "35B won't load on this box" UX hole, and the pre-flight Issues🟡 Important1. PR description is stale relative to the actual implementation. The description claims the agent ID is
2. Scope is much wider than the description signals. The description lists ~6 threads centered on Chat Lite + Settings. The diff also contains, none of which are mentioned:
Each of those has its own correctness surface. A reviewer reading the description will skip them; a future bisector tracking down (say) an eval-lock or auto-title regression will not realize this PR is the introduction. Either split or — more pragmatically given the size — expand the description to thread-list every distinct change with a one-line why. This is exactly the case CLAUDE.md's "PR Descriptions — Tight and Value-Focused" rule warns about: "If the PR really does bundle many threads, group them — don't list 16 commits." 3.
🟢 Minor4. 5. 6. Auto-title task can race the next user turn for the Lemonade slot. 7. 8. Strengths
VerdictApprove with suggestions. The blockers are documentation, not code: please update the PR description (Issue 1) and drop |
Cross-platform local validationEvery push to this branch was validated on both local Linux (Docker) and local Windows (SSH) before going out — so we don't burn an hour of CI to find platform-specific regressions. Workflow
Results for
|
| Platform | Lint | tests/test_eval.py |
tests/unit/ |
|---|---|---|---|
| macOS host | ✅ clean | ✅ 100 / 100 | ✅ 1708 passed, 15 skipped, 0 failed |
| Linux container | ✅ clean | ✅ 100 / 100 | ✅ 1695 passed, 23 skipped, 0 failed |
Windows (real win32) |
n/a | ✅ 100 / 100 | 1695 passed, 15 skipped, 13 failed* |
*The 13 Windows-only failures are pre-existing Linux-specific tests (os.geteuid mock patches, sys.platform="linux" monkeypatches) in tests/unit/test_init_command.py::TestInstallViaPpa (7 tests, Ubuntu-PPA installer paths) and tests/unit/installer/test_uninstall_command.py (6 tests, POSIX path/permission paths). None have skipif win32 guards. CI's Run Unit Tests job is Linux-only, so it doesn't surface them. These tests would also fail against origin/main on Windows — not a regression from this PR.
What this catches
- Linux-specific
importpaths and lazy-import side effects. The newif sys.platform == "win32": fcntl = None; else: import fcntlguard ineval/runner.pywas verified to import cleanly on Linux and on real Windows (platform=win32,fcntl=None) before pushing — the original blocker fix that made all oftests/test_eval.pygo green on Windows. - File-system case-sensitivity bugs that pass on macOS but fail on Linux.
- Lockfile resolution differences between hosts.
- Real-Windows behaviour the macOS-side
monkeypatch.setattr(runner, "fcntl", None)test simulates — verified end-to-end here, not just by the GitHub Actions Windows runner.
One-time setup quirks
- Linux: project venv was missing
pyfakefs(declared insetup.py's dev extras), makingtests/unit/installer/test_uninstall_command.pyERROR on collection.uv pip install pyfakefsclears it. - Windows:
tests/unit/chat/ui/test_*.pyneeds the[ui]extra (FastAPI) on top of[dev].uv pip install -e ".[dev,ui]"from a fresh venv is the one-shot recipe.
Chat Lite is a 4B-model sibling of the built-in Chat Agent for
hardware that cannot host the 35B default (Macs, low-memory boxes).
Same tools, same system prompt, just pinned to Qwen3-4B-Instruct-2507
via the ChatAgent config. ChatAgent and its registration are
untouched.
Agent metadata gains `min_memory_gb`, exposed via `/api/agents`.
Settings renders a Memory Warnings section when any registered agent
declares a requirement above the free memory reading.
Two Settings additions:
* Active Model — text override bound to the existing `custom_model`
setting. Empty means "use agent default".
* Context Size — preset chips (4K/8K/16K/32K) + numeric input that
reloads the active model with the chosen ctx via the existing
`/api/system/load-model` endpoint.
Two correctness fixes made Chat Lite actually usable end-to-end:
* `/api/system/status` previously flagged any non-35B model as
"Wrong model loaded". It now accepts any registered agent's
preferred model, so Chat Lite's 4B doesn't trip the banner.
* `_maybe_load_expected_model` short-circuited when any LLM was
active, even the wrong one. It now requires the specific expected
model with ctx >= 32K, otherwise reloads. Without this, Lemonade
auto-loaded requested models at its default 4096 ctx, silently
truncated ChatAgent's >7K-token system prompt, and returned an
empty stream. `_ensure_model_loaded` also falls back to 32K when
the model is not in the built-in MODELS registry.
Tests: 10 new unit tests covering registration, factory presets,
min_memory_gb propagation through manifests and the agents API, plus
a coexistence check that Chat Agent's defaults stay unchanged.
Frontend `npm run build` passes with 0 new warnings.
Two related fixes for the same eval failure mode (Qwen 4B getting ``finish_reason=length`` mid-tool-call on the no_sycophancy scenario): 1. **Output budget too tight.** ``AgentConfig.max_tokens`` was 4096, which Qwen3.5-4B exhausted while serialising long tool-call argument strings (1000+-char ``summary_type`` blob in the case the eval surfaced). With our 32K ctx_size and a ~7.7K-token system prompt + history, 8K of output budget leaves plenty of room and still keeps ~24K for input. Going much higher would steal from input-history budget without measured gain. 2. **Misleading error message.** When the cap was hit we raised "Increase --ctx-size for model X" — but ``finish_reason=length`` from the OpenAI completions API specifically signals the *output token cap*, not the context window. ctx_size and max_tokens are separate limits, and conflating them sent users off chasing the wrong knob (the user had already loaded at 32K ctx). The new error names ``AgentConfig.max_tokens`` directly. Both are agent-runtime changes (affect ChatAgent generally), not gaia-lite specific. Test suite green; full registry + chat-ui suites pass with no behaviour-coupled assertions to update.
…op session links
Bundles four user-asked tweaks discovered while iterating on gaia-lite:
UI / Agent UI
* ChatView.tsx — chat header model badge now shows the loaded
context window inline ("Qwen3.5-4B-GGUF · 32K"). Title attr carries
the full "context window: NN,NNN tokens" so a hover gives the
precise number. Mismatched ctx (e.g. eval reload at 4K) is now
visible at a glance instead of surfacing only as a chat-time error.
* ChatView.tsx + Sidebar.tsx — dropped the "#abc1234" session-link
pills from both the chat header and every sidebar row. They were
clipboard-copy hash badges; nobody used them and they ate the row.
* MessageBubble.tsx — every message bubble (user + assistant) now
has a ``title`` tooltip with the full absolute send timestamp.
Previously only assistant bubbles had this via the stats footer.
* ChatView.css — small ``.model-ctx-size`` rule so the suffix is
dimmer than the model name (eye lands on model first).
Chat agent personality
* src/gaia/agents/chat/agent.py — replaced the 2-example "RIGHT"
list in the GREETING RULE with 10 varied openers + an explicit
"VARY YOUR PHRASING" rule. The original prompt's recency bias
pinned the model on a single canned "Hey! What are you working
on?" every conversation; the rotation list breaks that pattern
while keeping the warm/curious/no-feature-pitch invariant.
Added a new WRONG example calling out the lock-in behaviour
explicitly so the model can self-detect it.
Eval discipline
* CLAUDE.md — new rule: "Run agent evals SERIALLY, never in
parallel." Today's session lost three eval runs to two
concurrent ``gaia eval agent`` invocations race-evicting each
other's models out of Lemonade's single-tenant slot, surfacing
as bogus 4K-ctx errors and INFRA_ERROR. Documents the exact
failure modes + the ``ps aux | grep "gaia eval" | wc -l`` =
0 sanity check before kicking off a new run.
Backend changes confined to comments + system-prompt strings —
agent runtime contracts unchanged. 212 unit tests pass; lint clean;
frontend bundle rebuilt.
The runaway-eval failure mode the user kept hitting today: a parent
agent in --fix mode shells out a second `gaia eval agent ...` while the
first is still running, both invocations talk to the same Lemonade
Server, Lemonade has a single-tenant LLM slot, the runs race-evict each
other's models, and the whole thing surfaces as nondeterministic
`n_ctx=4096` overflows or `model_load_error: llama-server failed to
start`. CLAUDE.md now documents the rule; this commit enforces it.
src/gaia/eval/runner.py
* `_acquire_eval_lock()` — context manager around fcntl.flock on
/tmp/gaia-eval-agent.lock. LOCK_EX | LOCK_NB so a second invocation
fails fast (exit 2) with an actionable message naming the holder
PID and run age, not by hanging.
* Stale-lock recovery: if the holder PID is gone, the lock file is
reclaimed automatically (no manual rm needed).
* Escape hatch: GAIA_EVAL_NO_LOCK=1 skips the guard for unit tests
or callers that genuinely manage Lemonade out of band.
* AgentEvalRunner.run() now wraps the per-scenario loop in the
lock; audit_only mode skips it (no Lemonade contention).
Body extracted to _run_locked() so the wrapper stays thin.
UI follow-on (separate concern, same touch):
src/gaia/apps/webui/src/components/AgentActivity.css
* The "N steps · M tools" expand-activity bar used to render as a
full-width terminal panel with border + uppercase 11px text on
every assistant turn — visually loud for content the user rarely
expands. Now it's an inline 10px chip at 0.55 opacity, expanding
to full opacity + bg + border on hover/focus, and forced to full
opacity when the run is active or has errors (states the user
actually needs to notice).
Verification: subprocess test confirms two concurrent invocations
produce exit-2 + clear error in the second one. Lint clean.
Every GAIA reply was rendering with the same vertical AMD-red border
+ red-tinted background as ``.msg-error``, so a normal multi-turn chat
read as a stack of warnings. Reserved that visual language for actual
errors only.
src/gaia/apps/webui/src/components/MessageBubble.css
* .msg-assistant: dropped the ``border-left: 2px solid var(--amd-red)``
that was duplicating .msg-error's left rail. Assistant messages now
distinguish themselves from user messages purely via the elevated
panel background, the GAIA avatar + name in the header, and the
left-aligned text direction. Subtle is the point — a long thread
should read as a calm conversation, not warnings.
* Removed the ``[data-theme="dark"] .msg-assistant`` override that
layered ``rgba(237, 28, 36, 0.02)`` (red haze) on top of the
already-neutral --bg-assistant-msg variable in dark mode.
* Avatar dark-mode treatment: ``border-color: rgba(237,28,36,0.25)``
and ``box-shadow: 0 0 8px rgba(237,28,36,0.12)`` (red glow) →
neutral ``var(--border)`` + a 1-px white-tinted ring. Brand red
is still present in the small "GAIA" name text + the user-avatar
fill; just not bleeding into every UI surface.
Error styling unchanged — .msg-error still owns the AMD-red left
border + red background tint, and is now the only place that visual
appears so it actually stands out.
Pure CSS — no behavioural change. Bundle rebuilt; refresh to see.
Three connected polish passes on the assistant bubble:
src/gaia/apps/webui/src/components/MessageBubble.css
* **Rounded card.** .msg-assistant was a hard-edged horizontal strip;
now it's a 14-px-radius panel that visually sits *inside* the chat
column instead of cutting it horizontally. ``width: calc(100% - 16px)``
inset so the rounding actually shows.
* **3D depth.** Layered box-shadow (4 stops: inset top-edge highlight,
crisp 1-px contact, soft mid-distance, wider ambient) for a subtle
elevation. Hover lifts ~1 px and softens the ambient — small enough
not to read as a "click me" CTA, big enough to feel responsive.
Dark-mode variant uses heavier alphas (0.30–0.55) because shadows
on near-black backgrounds disappear at typical light-mode values.
* **Softer fade-in.** Replaced the 200 ms slide+scale with a 380 ms
fade+drift+deblur (assistant: 450 ms). cubic-bezier(0.22, 0.61, 0.36, 1)
eases out smoothly. Dropped the 0.99 scale step — scaling rounded
cards causes brief subpixel blur on retina. Added a 2-px filter:blur
transition that mimics an "image developing" effect on entry —
very subtle but adds polish without slowing perception.
User messages unchanged: still flat, transparent, right-aligned. Only
the GAIA reply panel is the rounded 3D card. Shadow tokens are pure
CSS — no extra DOM, no JS, ~zero runtime cost.
Same direction as the GAIA message card (b5289a3): rounder, lighter, less red bleed, subtle elevation on the active state. src/gaia/apps/webui/src/components/Sidebar.css * Session items - radius-md (8px) → radius-lg (10px) for softer corners - Padding tweaked +1px vertically so rows don't feel crammed - Hover now adds a faint border-light ring in addition to the bg-hover fill — gives shape to the row before clicking - Active row's box-shadow was a 10-px red glow + inset 28-px red haze; replaced with a calm multi-layer (top highlight + contact + mid) elevation in light mode, deeper black-alpha shadows in dark mode. Brand-red is still present in the left-edge indicator and a 6%-alpha bg tint, but no longer bleeding into the surrounding sidebar. - Press-feedback scale tightened from 0.982 → 0.985 (less "trampoline", more "tap"). * Search input - radius-md → radius-lg, matching the session items - Now sits on a tertiary background with an inset hairline shadow, reading as gently recessed (the inverse of the session-active which reads as raised — coherent stack metaphor) - Focus ring: was border 0.4 alpha + 12-px red glow ("warning" vibe); now a soft 3-px 0.08-alpha brand-red halo + 0.35-alpha border. Reads as "focused", not "errored". * Bottom status bar - Tiny 1-px inset top highlight in addition to the existing border-top — gives a soft "lip" between the scrolling session list and the fixed status row, matching the layered-card feel. Pure CSS; no DOM changes. Bundle rebuilt.
The textarea had a custom 10×18 px solid-AMD-red block cursor with a
0.5-alpha red glow blinking at 1 Hz on every focus. On a screen the
user is staring at all session, that's an obvious "warning indicator"
treatment for a primitive that should be invisible until interacted with.
src/gaia/apps/webui/src/components/ChatView.tsx
* Removed ``getCaretXY`` helper, ``_computedStyleCache``, the
``caret`` state + ``updateCaret``/``setCaret`` callbacks, the
inline ``caretColor: 'transparent'`` override, and the
``<span className="input-cursor" />`` element. The textarea now
relies on the browser's native caret like every other text input.
* onFocus/onBlur/onSelect handlers tied only to caret tracking are
gone; nothing else cared about that state.
src/gaia/apps/webui/src/components/ChatView.css
* Removed the ``.input-cursor`` rule. ``@keyframes cursorBlink``
stays in styles/index.css — it's still used by AgentActivity and
MessageBubble streaming indicators (those are in-line content, not
always-on UI chrome, so the blink is appropriate there).
Side benefit: drops a per-keystroke ``requestAnimationFrame`` +
mirror-div DOM creation that was running on every input change.
The previous "softened" active state still carried two pieces of the
error visual vocabulary:
1. A vertical AMD-red gradient bar on the left edge (::before
pseudo-element) — same shape and color as .msg-error's
left rail. Active row read as "this row is broken".
2. In dark mode: rgba(237,28,36,0.06) background + 0.12-α red border
— selected row looked like a tinted warning band.
Both gone. Active state is now communicated purely via elevation +
neutral background lift, matching the GAIA message-card vocabulary
exactly (the same 4-stop layered shadow: inset top highlight,
contact, mid-distance, ambient).
src/gaia/apps/webui/src/components/Sidebar.css
* .session-item::before — entire pseudo-element + the hover/active
transform-scaleY rules removed. The selected row no longer has
a red strip; selection is the elevation + bg-active.
* .session-item.active light-mode shadow stack matched 1:1 to
.msg-assistant for visual coherence.
* .session-item.active dark-mode background switched from
rgba(237,28,36,0.06) → rgba(255,255,255,0.04). Border color
likewise from rgba(237,28,36,0.12) → rgba(255,255,255,0.08).
Heavier black shadows for the dark-on-dark elevation read.
* sessionActivateDarkBg keyframe (which animated a red
background flood + inset red glow on activation) deleted.
sessionActivate keyframe simplified — the 1.5-px overshoot
bounce removed for a calmer settle.
Brand red is now reserved for: the small "GAIA" name text in
message headers, the user-avatar fill, and .msg-error. That's it.
…color
Two threads:
(1) GAIA auto-titling (the running chat agent renames its own session).
src/gaia/ui/_chat_helpers.py
* New ``_maybe_update_session_title`` helper. Fired fire-and-forget
after every assistant turn finishes; calls the same Lemonade chat-
completion endpoint to generate a 3-6 word tab title.
* Trigger rules:
- Title is one of the defaults (New Chat / New Task / Untitled
/ Chat / empty) → first-response pass
- User message has ≤ 0.15 word-overlap with current title AND
is ≥ 25 chars → topic-shift pass
* Skipped when title starts with "Eval:" (eval framework owns those)
and throttled to one update / 30 s / session so concurrent fires
don't pile up.
* LLM call uses temperature 0.3 + max_tokens 24, stripping common
title artifacts ("Title:", quote-wrapping, trailing punctuation)
that small models add despite the instruction.
* Background task reference pinned in _active_sse_handlers under a
``_titlebg:<sid>`` key so GC doesn't kill it mid-flight; cleared
in the done callback.
src/gaia/ui/routers/chat.py
* Same hook in the non-streaming path immediately after
db.add_message(assistant). Wrapped in try/except so an auto-title
failure can never block the user's response.
(2) Neutral selection color.
src/gaia/apps/webui/src/styles/index.css
* ::selection background: rgba(237, 28, 36, 0.18) (AMD red) →
rgba(56, 132, 255, 0.22) (cool blue). Selecting any text used to
paint a red highlight on top of red errors / red focus rings /
formerly-red active session — every interaction read as a warning
state. Standard cool-blue selection is what users expect.
Tests: 46 chat-ui unit tests pass; lint clean.
…lures
The chat path had no recovery for the three failure modes the user
keeps hitting in the UI:
* "No model loaded: <model>" — Lemonade evicted between turns
* "request (N tokens) exceeds the available context size (M)"
— wrong-ctx model load
* "Network error: CURL Timeout was reached" — Lemonade busy/hung
Each one previously surfaced as a wall of raw JSON inside the chat
bubble and required the user to hand-restart the model. Now:
src/gaia/llm/providers/lemonade.py
* Added typed exceptions ``LemonadeError``,
``LemonadeModelNotLoadedError``, ``LemonadeContextOverflowError``,
``LemonadeNetworkError``. Each carries (a) a short ``.user_message``
suitable for direct chat-bubble rendering and (b) the raw payload
on ``.payload`` for diagnostic logging. Class-level ``retryable``
flag declares whether the chat layer should auto-retry.
* ``_classify_lemonade_response`` walks the response envelope —
including the nested ``details.response.error`` shape Lemonade
uses for backend_error wrappers — and returns the matching typed
exception or a generic LemonadeError when the shape is novel.
* The single ``raise ValueError(f"Unexpected response format from
Lemonade Server: {response}")`` at the dict-of-choices guard now
routes through the classifier, so callers see typed errors instead
of a string-blob ValueError.
src/gaia/ui/_chat_helpers.py
* ``_classify_chat_exception`` walks the cause chain AND the
stringified message text to detect typed errors even when AgentSDK
wraps the original LemonadeError in a generic ValueError /
RuntimeError (which it does in several paths).
* ``_run_agent`` (the streaming worker thread) now does ONE automatic
retry on retryable Lemonade errors. On a model-not-loaded /
network-error first failure: forces a fresh ``_maybe_load_expected_model``
at our 32K ctx, emits a "Model reloaded — retrying..." status SSE
so the user sees the recovery (not silent retry), and re-runs
``agent.process_query``. If THAT fails too, the friendly
user_message is surfaced instead of the raw exception.
* The catch-all ``except Exception`` block now prefers
``LemonadeError.user_message`` over ``str(e)``, so even
non-retryable errors (context overflow) come through as a
plain-English explanation instead of a JSON blob.
Tests: full chat-ui suite (46 tests) passes; lint clean. The retry
itself is exercised end-to-end the next time you trigger a model
eviction — start a fresh chat session and you'll see the recovery
status rather than the wall-of-JSON failure mode.
…t path Layer 2 + 3 of the chat-resilience push (continuation of 0795a5d). src/gaia/ui/_chat_helpers.py * **Pre-flight ctx_size guard.** ``_maybe_load_expected_model`` previously checked ``active_ctx and active_ctx < N`` — short-circuit evaluation meant a 0-or-missing ctx slipped through, leaving a broken model in place. New guard explicitly treats missing ctx as "needs reload" so a model loaded with no recipe_options.ctx_size (which is exactly the broken state we want to recover from) gets re-loaded at our 32K canonical. * **Bounded reload timeout.** Pre-flight now passes ``timeout=120`` to ``LemonadeClient.load_model`` instead of inheriting the default DEFAULT_MODEL_LOAD_TIMEOUT (12000 s / 200 min). Cold load of a 4B GGUF on consumer hardware fits in <60 s; if it hasn't completed in 120 s something is genuinely wrong and we'd rather surface the failure than block the chat thread for hours. * **Non-streaming chat now retries identically to streaming.** The one-shot retry on transient Lemonade errors (model evicted between turns, network blip) was only in the streaming worker before; the non-streaming path raised straight to the user. Mirrored the same classify → reload → retry sequence in ``_get_chat_response`` so both paths recover the same way. * **Friendly error mapping in non-streaming.** The catch-all ``except Exception`` in ``_get_chat_response`` was returning a stock "I'm having trouble connecting..." string regardless of the actual failure mode. Now it prefers the typed ``LemonadeError.user_message`` (e.g. "This conversation got too long for the model's context window"), falling back to the stock copy only when classification fails. Together with 0795a5d, this closes the three failure modes the user keeps hitting in the UI: * Wrong-ctx model load (now caught by tighter guard, reloaded at 32K) * Mid-conversation eviction (caught by retry on both streaming/non-streaming) * Lemonade hang during reload (bounded by 120-s timeout instead of 200 min) Tests: 60 chat-ui + preflight tests pass.
NameError regression introduced in 1a807d2 (auto-titling commit) — the streaming-path background task at the bottom of ``_stream_chat_response`` called ``_effective_model(agent, model_id)``, but ``agent`` lives inside the producer thread's local scope and isn't visible to the outer generator. Every streaming chat turn was therefore raising NameError after the response otherwise completed, which the catch-all then surfaced as "Sorry, something went wrong on my end". src/gaia/ui/_chat_helpers.py * Capture ``session_model = model_id`` in the outer scope right after custom-model resolution; reference that in the auto-title task instead of the inaccessible ``agent``. * Comment explains why the indirection — same model id the pre-flight + agent factory used, just made visible to the post-stream cleanup. Verified by re-running a fresh streaming chat turn end-to-end against gaia-lite: 3.1 s round-trip, ``done=True``, no error events, response "4" persisted correctly. Found via the chaos-test harness from the same reliability push — which was the point of layer 4.
…bling raw error Small models (4B-class) occasionally emit malformed native tool_calls envelopes — e.g. a 1000+ char summary_type argument that gets truncated mid-string. Previously _parse_llm_response raised ValueError uncaught, the exception bubbled through the chat helper, and the user saw: Agent error: Malformed native tool_calls envelope: Expecting ',' delimiter: line 1 column 220 (char 219) The fix wraps the parse call in process_query with a try/except that: 1. Logs the parse error to error_history (type=tool_call_parse_error) 2. Appends a synthetic recovery prompt instructing the model to retry with documented enum values (or fall back to plain text) 3. Continues the loop so the next LLM call has clean conversation 4. After 3 consecutive parse failures, gives up gracefully with a friendly fallback rather than spamming the user Same handler also catches finish_reason=length (tool-call truncated mid-arguments) and parallel tool_calls (NotImplementedError), since both manifest as the same user-facing failure mode. Surfaced by GAIA eval baseline scenario: - personality/honest_limitation Turn 2: 1000+ char summary_type arg - rag_quality/negation_handling Turns 1-2: finish_reason=length Tests: tests/unit/agents/test_parse_error_recovery.py — covers the parse-error path, the 3-strikes graceful giveup, and the underlying ValueError still raises from _parse_llm_response itself.
Two failure modes surfaced by GAIA eval baseline against gaia-lite on
Qwen3.5-4B-GGUF:
1. Context-overflow mid-loop (5 of 8 baseline failures)
When a multi-step ReAct turn accumulates tool results from several
search_file/index_document calls, the cumulative `messages` array
eventually exceeds the model's 32K context window. Lemonade returns
exceed_context_size and the chat helper surfaces "This conversation
got too long for the model's context window. Start a fresh task..."
Fix: wrap the LLM-call try/except in process_query (both streaming
and non-streaming branches) with a retry loop. When we detect a
context-overflow exception (substring match on the upstream error
text — typed errors get wrapped by AgentSDK), trim the messages
array to {first user input + last 4 entries} and retry ONCE. If
the retry also fails, return a friendly message asking the user to
start fresh — no more raw exception leaks.
Surfaced by:
- tool_selection/smart_discovery T1 (search → index → blow up)
- error_recovery/file_not_found T1+T2 (failed index → deep search)
- error_recovery/search_empty_fallback T1
- captured/captured_eval_smart_discovery
2. list_windows had no macOS branch
The tool returned "Window listing not available. Install pywinauto
(Windows) or wmctrl (Linux)." on Mac, and the agent reported "I
can't list open windows on this Mac" — judged FAIL.
Fix: add a Darwin branch that uses osascript with System Events to
return the visible (non-background) processes — equivalent to what
the user sees in Mission Control. No new dependency: osascript
ships with macOS.
Surfaced by web_system/list_windows.
Tests: extended tests/unit/agents/test_parse_error_recovery.py with
TestProcessQueryRecoversOnContextOverflow — covers the trim+retry
success path AND the after-retry-still-fails graceful fallback.
…keep latest
Previous trim strategy (keep first + last 4 messages) didn't help when the
final tool result itself was huge (e.g. RAG query returning many KB of
chunks). After trim we still couldn't fit, so the agent gave up.
New _shrink_messages_for_overflow helper:
- Keep the original user query intact
- Keep the LATEST tool result intact (model needs it to answer)
- Replace older tool results with a tiny stub
("[tool result omitted -- context overflow recovery]")
- Truncate verbose assistant chain-of-thought to 800 chars
This preserves the structural shape of the conversation so the model
still understands what tools have been called, but drops the bulk of
the bytes. Same shrink applied in both streaming and non-streaming
branches of process_query.
Surfaced by tool_selection/smart_discovery T1: agent indexed handbook
correctly, but the indexed-content tool result + chain-of-thought
combined to push past 32K. Old trim retained both and still failed;
new shrink keeps just the latest result so retry succeeds.
Tests still pass — the existing context-overflow tests now exercise
the new path implicitly via the same "raise then succeed" pattern.
The chat helper has a one-shot retry that calls _maybe_load_expected_model
to reload the model at the canonical 32K ctx — but only fires for errors
where ``retryable=True``. LemonadeContextOverflowError was always non-
retryable, so when a previous load left Qwen3.5-4B at 4096-ctx (e.g.
after an embedding model swap or auto-load default), the agent surfaced
"This conversation got too long" to the user even though the actual
remedy was a model reload.
Now LemonadeContextOverflowError.retryable is set dynamically based on
the reported n_ctx:
- If n_ctx < 32768 → retryable=True. The model was loaded with the
wrong ctx_size; the chat layer's one-shot retry will reload at 32K
via _maybe_load_expected_model and the same prompt will fit.
- If n_ctx == 32768 → retryable=False. This is a genuine "conversation
too big" situation; retry won't help, surface the friendly message.
Two parsing paths updated:
- _classify_lemonade_response (provider-side, structured payload):
reads n_ctx from nested error.details.response.error.n_ctx
- _classify_chat_exception (chat-layer fallback for when AgentSDK
re-raises with str(original)): regex-extracts the ctx number from
the textual message ("context size (4096 tokens)")
Surfaced by tool_selection/smart_discovery — backend was loaded with
n_ctx=4096 after a fresh restart sequence, the original handbook+RAG
turn legitimately needed ~18K tokens, and we were stuck refusing the
turn instead of reloading.
…load The previous commit made LemonadeContextOverflowError retryable when n_ctx < 32K, but the chat helper's reload-and-retry never fired because the agent's own try/except in process_query catches the exception first and runs its in-loop trim-and-retry instead. The trim doesn't help when the real issue is a 4K-loaded model — the second request goes to the same broken backend. Fix: detect the wrong-ctx-loaded sub-case in agent.py (substring match on "context size (4096|8192|16384" / "n_ctx': 4096|8192|16384") and RE-RAISE instead of trimming. This bubbles the typed error up to the chat helper, where _classify_chat_exception now reads it as retryable and triggers _maybe_load_expected_model to reload at 32K before the chat layer's own one-shot retry. Genuine "conversation too big to fit even at 32K" still goes through the trim+retry+friendly-fallback path as before. Both streaming and non-streaming branches updated symmetrically.
…lope The OpenAI tool_call spec says ``function.arguments`` is a JSON string, but llama.cpp 4B-class models occasionally emit it as a pre-parsed dict. ``json.loads(dict)`` raised TypeError which our recovery layer didn't catch (it only listened for ValueError / NotImplementedError), so the exception bubbled out as: Agent error: the JSON object must be str, bytes or bytearray, not dict Surfaced by tool_selection/smart_discovery T1 after the previous context-overflow fix unblocked a path that was previously masked by that earlier failure. Now we accept either shape: - str/bytes → json.loads as before - dict → use directly - empty/None → empty dict - anything else → raise ValueError with a recovery-friendly message so the parse-error retry kicks in
The substring detection ("context size (4096", etc.) only fires when the
raw payload is preserved in the exception. When AgentSDK re-raises with
the typed LemonadeContextOverflowError's friendly user_message
("This conversation got too long..."), the n_ctx detail is gone, so
substring matching always missed.
Fix: when a context-overflow fires AND substring match doesn't say
"wrong ctx", probe the Lemonade health endpoint via httpx to read the
LLM's actual ctx_size. If < 32K, treat it as wrong-ctx and re-raise so
the chat helper reload-and-retry kicks in.
The probe times out fast (3s) and returns False on any failure, so the
caller cleanly falls through to in-loop trim if the probe is unreliable.
Surfaced by another smart_discovery rerun: ctx was 4096, my earlier
substring guards missed because str(exception) was already friendly.
The 19 real_world scenarios were SKIPPED_NO_DOCUMENT because the referenced corpus files did not exist on disk. Add a deterministic generator that authors all 19 documents from a single Python data structure, idempotent on re-run. Documents are synthetic paraphrases (no copyrighted source text) but contain every ground-truth fact each scenario references. XLSX files are flattened by RAGSDK._extract_text_from_xlsx into row-keyed prose that surfaces SKUs, totals, and notes for the chunker. Re-run with: python eval/corpus/gen_real_world.py
CI blockers: - src/gaia/eval/runner.py: guard `import fcntl` and `_acquire_eval_lock` body so Windows degrades to a no-op (the lock guards a Lemonade race that doesn't happen on Windows dev boxes). Unblocks Test Eval Tool (Windows) — was ModuleNotFoundError on every test in tests/test_eval.py. Apr-20 review actionable items: - src/gaia/ui/_chat_helpers.py: drop blanket `except Exception` in _canonical_agent_type — canonical_id is a pure dict lookup; CLAUDE.md fail-loudly. - src/gaia/ui/models.py: memory_available_gb -> Optional[float] = None. Avoids a false memory-warning banner if psutil ever silently falls back. TS type and SettingsModal updated to null-guard. - src/gaia/agents/registry.py: lift the 5 GB gaia-lite floor to a named constant _GAIA_LITE_MIN_MEMORY_GB so the rationale stays adjacent to the value. - src/gaia/ui/routers/system.py: clarify update_settings docstring re Pydantic vs GAIA convention; widen psutil exception handler so an OSError from virtual_memory() (containers/seccomp) doesn't 500 the status endpoint. Tests + lint hygiene: - tests/unit/test_chat_preflight.py: 6 stale tests now exercise the new pre-flight semantics (right model + ctx >= 32K). New test_right_model_wrong_ctx_triggers_reload covers the negative case. - tests/unit/test_lemonade_model_loading.py: assert ctx_size=32768 fallback for unknown models. New test_known_model_uses_registry_ctx_size guards the registry-lookup path. - tests/test_eval.py: new test_acquire_eval_lock_windows_noop with fcntl=None. - tests/unit/chat/ui/test_chat_helpers.py: source-shape regex updated for the new _build_create_kwargs(...) call shape; new TestCanonicalAgentType class ratchets the dropped silent-except. - src/gaia/llm/providers/lemonade.py: unused `is_err` -> `_is_err`. - src/gaia/ui/_chat_helpers.py: use module-level `_re` consistently. - tests/unit/agents/test_registry.py: missing `from unittest.mock import patch`. - src/gaia/apps/webui/package-lock.json: bumped 0.17.3 -> 0.17.4 to match package.json.
…loor, walk __context__ - Delete tunnel-friendly-error.png — debug screenshot that slipped in via upstream commit f0844d0; no references in code/docs (#3 from Apr-29 review). - Restore uv.lock requires-python ">=3.13" to match origin/main (was silently narrowed to ">=3.12" in the same upstream commit). setup.py's python_requires stays >=3.10; the lock now no longer drifts from main (#4 from Apr-29 review). - Restore src/gaia/apps/webui/package-lock.json to origin/main (revert my drive-by 0.17.3 -> 0.17.4 bump). Main itself has the package.json=0.17.4 vs lockfile=0.17.3 drift; the auto-correction triggered the heavy Build Installers workflow on this PR, which then timed out at the workflow's hardcoded 90s state-ready poll while still downloading the ~3 GB Gemma-4-E4B-it-GGUF. Reverting eliminates the unrelated CI noise; the lockfile/package.json drift is its own tech debt. - _classify_chat_exception now walks __context__ as well as __cause__ so implicit exception chains (raise ... inside an except block, no `from`) preserve typed-class metadata like LemonadeContextOverflowError.retryable (#5 from Apr-29 review).
b2a952f to
62d0d2f
Compare
Bring the feature branch back to green by addressing the cluster of CI failures that landed when the memory v2 work merged with main. All fixes are mechanical or scoped to test isolation — no behavioural change to the memory pipeline itself. - Restore lost merge-conflict state in `ChatView.tsx` and `Sidebar.tsx`: the `getSessionHash` import, `hashCopied`/`copied` state, and the `handleCopyHash` callback all dropped during the merge — Vite build was failing on missing identifiers across PyPI Build Check and all three Build Installers jobs. - Lint/Pylint cleanup so the `Code Quality (Lint)` job is green again: remove unused vars/imports, drop dead `if x != x` branches, and promote a few pointless lambdas to method references in `agents/base/discovery.py`. Reorder `routers/memory.py` imports to satisfy isort. - Tighten `_canonical_agent_type` to surface `AttributeError` instead of swallowing it (matches the existing regression test added in #802; was failing locally and in CI Unit Tests). - Add an explicit `GAIA_MEMORY_DISABLED=1` opt-out to `MemoryMixin.init_memory`. The Path Validator security tests, Unit Tests, and Chat Agent Tests jobs all instantiate `ChatAgent`/`CodeAgent` without a Lemonade server available; the memory v2 hard-requirement on the embedding service fails them. This is a deliberate, named opt-out (not a silent fallback) — tests that exercise memory itself clear the variable via the new `tests/unit/conftest.py` autouse fixture and the `_mock_v2_init_context` helper, so memory test coverage is unchanged. CI workflows that don't need memory now set the env var explicitly.
Summary
Ships Chat Lite — a 4B-model sibling of the built-in Chat Agent for Macs and other hardware that cannot host the 35B default — plus the three Settings controls needed to make swapping models practical: an Active Model override, a Context Size picker, and per-agent Memory Warnings. Chat Agent and its defaults are untouched; this is purely additive.
Threads
New
chat-liteagent (registry.py) — reusesChatAgentbut presetsmodel_idtoQwen3-4B-Instruct-2507-GGUF(falls back toQwen3-4B-GGUF). Appears alongside Chat in the picker. Why: 35B won't load on ~8-16GB machines, so users were stuck with no working out-of-the-box option.AgentInfo.min_memory_gb— new optional field on registrations/manifests/API. Chat Lite declares5.0, Chat keepsNone. Settings renders a Memory Warnings block only for agents whose requirement exceedsmemory_available_gb. Why: warn before the user wastes time picking an agent that will OOM.Settings: Active Model — text field bound to the existing
custom_modelsetting, with "Use agent default" placeholder. Empty → agent's registeredmodels[0]wins (unchanged backend logic). Why: users needed a visible way to swap models per-agent without editing settings JSON.Settings: Context Size — preset chips (4K / 8K / 16K / 32K) plus numeric input, Apply reloads the active model via the existing
/api/system/load-model. Why: matches what Lemonade'slemonade load --ctx-sizealready supports but the UI never exposed.expected_model_loadedrespects registered agents (system.py) — used to hardcode the 35B default, so Chat Lite's 4B always tripped "Wrong model loaded". Now accepts any registered agent's preferred model as valid. Why: the old check is wrong the moment you have more than one agent with different model preferences.Pre-flight load handles wrong-model + small-ctx (
_chat_helpers.py) —_maybe_load_expected_modelused to short-circuit if any LLM was active, even the wrong one, and never checked ctx. It now requires the specific expected model with ctx ≥ 32K; otherwise it reloads. Why: Lemonade auto-loads requested models at its default 4096 ctx, silently truncating ChatAgent's >7K-token system prompt, producing an empty stream. This is what blocked Chat Lite from ever returning a response._ensure_model_loadedalso gains a 32K fallback for models not in the built-inMODELSregistry (same rationale).Test plan
python -m pytest tests/unit/agents/ tests/unit/chat/ui/test_agents_router.py— 156 pass, incl. 10 newpython util/lint.py --black --isort— greencd src/gaia/apps/webui && npm run build— clean, 0 new warningschat-litesession, send"Reply with exactly: hello from chat-lite"— model auto-loads, streams back"hello from chat-lite"at ~60 tok/sOut of scope
Two related improvements deferred to follow-ups:
localhost:8000. Users currently needLEMONADE_BASE_URL=http://localhost:13305/api/v1as an env var. A port probe would remove that step._try_reload_with_ctxreloads whatever model is currently loaded, not the target model. Harmless after the pre-flight fix above, but worth cleaning up.