feat(agents): add Chat Lite + Settings model/ctx/memory controls by kovtcharov · Pull Request #802 · amd/gaia

kovtcharov · 2026-04-18T03:31:16Z

Summary

Ships Chat Lite — a 4B-model sibling of the built-in Chat Agent for Macs and other hardware that cannot host the 35B default — plus the three Settings controls needed to make swapping models practical: an Active Model override, a Context Size picker, and per-agent Memory Warnings. Chat Agent and its defaults are untouched; this is purely additive.

Threads

New chat-lite agent (registry.py) — reuses ChatAgent but presets model_id to Qwen3-4B-Instruct-2507-GGUF (falls back to Qwen3-4B-GGUF). Appears alongside Chat in the picker. Why: 35B won't load on ~8-16GB machines, so users were stuck with no working out-of-the-box option.
AgentInfo.min_memory_gb — new optional field on registrations/manifests/API. Chat Lite declares 5.0, Chat keeps None. Settings renders a Memory Warnings block only for agents whose requirement exceeds memory_available_gb. Why: warn before the user wastes time picking an agent that will OOM.
Settings: Active Model — text field bound to the existing custom_model setting, with "Use agent default" placeholder. Empty → agent's registered models[0] wins (unchanged backend logic). Why: users needed a visible way to swap models per-agent without editing settings JSON.
Settings: Context Size — preset chips (4K / 8K / 16K / 32K) plus numeric input, Apply reloads the active model via the existing /api/system/load-model. Why: matches what Lemonade's lemonade load --ctx-size already supports but the UI never exposed.
expected_model_loaded respects registered agents (system.py) — used to hardcode the 35B default, so Chat Lite's 4B always tripped "Wrong model loaded". Now accepts any registered agent's preferred model as valid. Why: the old check is wrong the moment you have more than one agent with different model preferences.
Pre-flight load handles wrong-model + small-ctx (_chat_helpers.py) — _maybe_load_expected_model used to short-circuit if any LLM was active, even the wrong one, and never checked ctx. It now requires the specific expected model with ctx ≥ 32K; otherwise it reloads. Why: Lemonade auto-loads requested models at its default 4096 ctx, silently truncating ChatAgent's >7K-token system prompt, producing an empty stream. This is what blocked Chat Lite from ever returning a response. _ensure_model_loaded also gains a 32K fallback for models not in the built-in MODELS registry (same rationale).

Test plan

python -m pytest tests/unit/agents/ tests/unit/chat/ui/test_agents_router.py — 156 pass, incl. 10 new
python util/lint.py --black --isort — green
cd src/gaia/apps/webui && npm run build — clean, 0 new warnings
End-to-end on Mac: create a chat-lite session, send "Reply with exactly: hello from chat-lite" — model auto-loads, streams back "hello from chat-lite" at ~60 tok/s
Second message in same session — no reload (pre-flight correctly short-circuits when model + ctx already match)
Open Settings modal on a running instance: verify Active Model, Context Size, and Memory Warnings sections render and persist round-trips

Out of scope

Two related improvements deferred to follow-ups:

Lemonade's admin API lives on port 13305 in v10.x but GAIA still defaults to localhost:8000. Users currently need LEMONADE_BASE_URL=http://localhost:13305/api/v1 as an env var. A port probe would remove that step.
_try_reload_with_ctx reloads whatever model is currently loaded, not the target model. Harmless after the pre-flight fix above, but worth cleaning up.

… + validate ctx_size Follows up on the architecture review of PR #802 with three polish fixes; no behaviour change vs the previous commit on main agents paths except the `ctx_size` validation. 1. **Single source of truth for the 32K context requirement.** ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client`` and is re-exported by ``lemonade_manager`` for backwards compat. The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates the three side-by-side copies (each carrying a "must match ..." comment — exactly the smell review flagged as drift-prone). 2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``. The four ChatAgent/registry.create_agent call sites used to repeat the same 4-field bundle (rag_documents, library_documents, allowed_paths, ui_session_id). Centralising it means adding a new field — or forgetting one, which is what bit us last time and caused Chat Lite's PermissionError spiral — happens in one place. Unknown kwargs are still filtered by the per-agent factory, so this remains safe for manifest agents that don't recognise all fields. 3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently accepted by the endpoint and then failed deep in Lemonade with no actionable error. ``Field(None, gt=0)`` now surfaces a 422 at the boundary with a readable message. Verified with the existing 153-test suite and a live end-to-end sweep over the chat-lite agent on a Mac: cold auto-load, warm reuse, agent switch, parallel sessions (3 simultaneous), ctx reload 32K→64K, long 6K-token input, empty message, malformed load-model payload — all pass.

github-actions · 2026-04-20T07:47:43Z

Summary

Ships the Gaia Lite agent (ChatAgent + 4B model for low-memory hardware) and three Settings controls (Active Model override, Context Size picker, Memory Warnings) — purely additive and well-bounded. The real gem is the _maybe_load_expected_model fix in src/gaia/ui/_chat_helpers.py:407-467: the old short-circuit ("any LLM loaded → skip") silently truncated ChatAgent's >7K-token prompt when Lemonade had auto-loaded a model at 4096 ctx, producing empty streams. The new check (right model and ctx ≥ 32K) is the only reason Gaia Lite can ever return a response. Strong test coverage (10 new tests), clean consolidation of DEFAULT_CONTEXT_SIZE into a single source of truth, thoughtful comments at every non-obvious branch.

Issues Found

🟢 Minor

Silent except Exception in _canonical_agent_type (src/gaia/ui/_chat_helpers.py:93-106)

canonical_id is a pure dict lookup on a string — it physically cannot raise unless agent_type is unhashable, which the type hint already rules out. The blanket except Exception: return agent_type is silent degradation that CLAUDE.md's "fail loudly" rule tries to eliminate. Either drop the try/except entirely or narrow it:

def _canonical_agent_type(agent_type: str) -> str:
    """Resolve legacy agent-type aliases (e.g. ``chat-lite`` → ``gaia-lite``).

    Keeps the per-session agent cache from thrashing when a client mixes the
    old and new IDs within the same session — both resolve to the same
    canonical ID and therefore the same cache entry.
    """
    registry = _agent_registry
    if registry is None:
        return agent_type
    return registry.canonical_id(agent_type)

memory_available_gb defaults to 0.0 in src/gaia/ui/models.py:25

If psutil ever fails to populate memory_available_gb (e.g. an exception path in system_status), the default 0.0 makes a.min_memory_gb > 0 trip the Memory Warnings banner for every agent with a declared requirement — a confusing false-positive. Cheap guard: check a sentinel (status.memory_available_gb > 0) inside the filter in SettingsModal.tsx:549-551, or make the backend field Optional[float] = None and gate rendering on != null.

                        const warnings = status.memory_available_gb > 0
                            ? agents.filter(
                                (a) => a.min_memory_gb != null && a.min_memory_gb > status.memory_available_gb,
                            )
                            : [];

Source-shape regex test is fragile (tests/unit/chat/ui/test_chat_helpers.py:1493-1525)

The regex-over-source assertion for "streaming branch passes rag_file_paths=[]" will break if a future refactor adds a line break inside the call (e.g. a formatter change), even though the behavior is still correct. The comment acknowledges the tradeoff, but you could get the same ratcheting effect by monkeypatching registry.create_agent and asserting on the received kwargs — no mock for the full SSE/Lemonade stack required. Keep as-is if the cost/benefit feels right; flagging because a false CI red in 6 months will not be obvious.

Memory threshold comment vs. constant (src/gaia/agents/registry.py:102-109)

The 5 GB floor is justified as "Q4_K_M weights ~2.5 GB + context + runtime headroom". Nice rationale — consider lifting 5.0 to a module constant (_GAIA_LITE_MIN_MEMORY_GB = 5.0) so the number and the justification live together and don't drift if a bigger checkpoint is added to the models list later.

update_settings null-vs-empty-string docstring (src/gaia/ui/routers/system.py:430-435)

The new docstring is correct in its current wire format, but Pydantic absolutely can distinguish null from "unset" if the field uses Optional[str] = Field(default=None) + checks "custom_model" in request.model_fields_set. The comment bakes in a client-side workaround ("send \"\"") that future maintainers might puzzle over. Consider a one-line "see SettingsUpdateRequest for why" pointer, or clarify that this is a GAIA choice, not a Pydantic limitation.

Strengths

Single source of truth for DEFAULT_CONTEXT_SIZE — consolidated in src/gaia/llm/lemonade_client.py:100-108 and re-exported from lemonade_manager and routers/system. Exactly the kind of consolidation that prevents the "ctx ceiling drifted between modules" bug class.
_session_agent_kwargs helper in src/gaia/ui/_chat_helpers.py:342-365 dedupes four call sites that were already a foot-gun (the non-streaming-vs-streaming rag_file_paths=[] asymmetry is now documented at the call sites rather than being a silent timebomb).
Legacy chat-lite alias (_LEGACY_ID_ALIASES in registry.py:107-113) avoids a DB migration for persisted sessions, is plumbed through get, create_agent, and resolve_model (the last one is easy to miss and has a specific regression test), and the alias map is trivial to extend on the next rename.
Manifest factory kwarg filtering (registry.py:428-449) correctly identifies that the dynamic class's **kwargs forwards to BaseAgent.__init__ and introspects that signature rather than the class's — the comment explicitly explains why MRO-walking is a footgun.
ctx_size: Optional[int] = Field(None, gt=0) in LoadModelRequest (system.py:468) — boundary-level validation that returns a clean 422 instead of a cryptic Lemonade error. Exactly what CLAUDE.md's "fail loudly at the boundary" rule asks for.
MessageBubble "Gaia"-prefix strip (MessageBubble.tsx:242-251) — the word-boundary regex correctly handles both "Gaia Lite" and a hypothetical "Gaiadocs"; the inline comment earns its place.

Verdict

Approve with suggestions — no blocking issues. The except Exception and the memory_available_gb = 0.0 default are real but low-severity, and the other flags are nice-to-haves. Ship it after the author decides which suggestions to apply.

kovtcharov · 2026-04-20T18:02:06Z

@claude troubleshoot why our CI/CD pipeline is failing.

github-actions · 2026-04-20T18:04:01Z

CI/CD Failure Triage for PR #802

Two of the required checks are red on commit 6fd1e6f (run list):

🟡 1. Code Quality (Lint) — 2 issues

Job log

#	Check	File	Fix
1	Black formatting	`src/gaia/mcp/mcp_bridge.py` (the `"error": {…}` block around L332)	`python util/lint.py --black --fix`
2	Agent conventions	`schemas/agent-manifest.schema.json` is stale vs. the manifest model (this PR touches `AgentInfo.min_memory_gb` and `resolve_model` in `registry.py`)	`python util/gen_manifest_schema.py`

Note: the manifest-schema regeneration is the genuine tell here — the schema generator wasn't re-run after min_memory_gb landed, so it drifts from the runtime Pydantic model. That's the same drift the existing linter is designed to catch.

🟡 2. Unit Tests — 6 failures, all caused by this PR

Job log — 6 failed, 1484 passed, 16 skipped

All six are stale tests that still encode the old short-circuit behavior, not bugs in the new code. They directly contradict the "Pre-flight load handles wrong-model + small-ctx" thread in the PR description:

tests/unit/test_chat_preflight.py (4 tests) — pre-flight now requires the right model and ctx ≥ 32K, not just "any LLM is loaded":

test_llm_active_skips_load — asserted LemonadeClient not called; actual call: load_model('Qwen3.5-35B-A3B-GGUF', ctx_size=32768, prompt=False)
test_vlm_active_skips_load — same
test_concurrent_second_thread_skips_load — same
test_sse_not_called_on_fast_path — asserted _emit not called; actual calls: "Loading LLM model..." + "Could not auto-load LLM..."

tests/unit/test_lemonade_model_loading.py (2 tests) — _ensure_model_loaded now passes ctx_size=32768 for models not in the built-in MODELS registry (the 32K fallback described in the PR):

test_calls_load_when_model_not_loaded — expected ctx_size=None, got ctx_size=32768
test_model_loaded_when_not_present — same

Suggested fix plan

# 1. Reformat + regen schema
python util/lint.py --black --fix
python util/gen_manifest_schema.py

# 2. Update the 6 stale tests to assert the new behavior:
#    - test_chat_preflight.py: wrong-model or small-ctx case should now expect a reload + SSE emit
#    - test_lemonade_model_loading.py: expected `ctx_size=32768` on models outside MODELS registry
python -m pytest tests/unit/test_chat_preflight.py tests/unit/test_lemonade_model_loading.py -xvs

# 3. Full lint + unit suite before re-pushing
python util/lint.py --all
python -m pytest tests/unit/

All other checks (API Tests, CLI/Linux, CLI/Windows, Agent SDK, RAG, Security, CodeQL, Lemonade Embeddings) are green, so scope is narrow.

kovtcharov · 2026-04-29T02:19:58Z

@claude review this PR.

github-actions · 2026-04-29T02:23:31Z

PR Review — `feat(agents): add Chat Lite + Settings model/ctx/memory controls`

I reviewed the current state of this PR (HEAD 2c967a4, 31 commits, +6857/-2219 across 68 files). The Gaia Lite + Settings work itself looks solid — the previous review covered that scope well. But the PR has materially changed since then and I see three blockers plus a serious scope concern.

🔴 Blockers (must fix before merge)

1. eval/runner.py is Windows-incompatible — breaks 50+ tests.

src/gaia/eval/runner.py:20 unconditionally import fcntl for the new process-wide lock (commit 404ee397). fcntl is POSIX-only, so on Windows the module-load itself fails with ModuleNotFoundError: No module named 'fcntl' and every test that imports gaia.eval.runner errors out. The Test Eval Tool (Windows) job shows ~50 failures all with this exact stack.

The lock comment already acknowledges "better degraded than dead" for the OSError path on /tmp — apply the same logic to platform: if sys.platform == "win32": yield; return, or use msvcrt.locking() for cross-platform support. As written, this PR cannot ship to Windows users.

2. schemas/agent-manifest.schema.json is stale — lint job fails.

Code Quality is red on the same issue I flagged on Apr 20: the min_memory_gb field landed in the Pydantic model but the generated schema wasn't regenerated. One-shot fix:

python util/gen_manifest_schema.py
git add schemas/agent-manifest.schema.json && git commit -m "chore: regenerate manifest schema"

3. Merge conflicts against main.

gh pr view reports mergeable: CONFLICTING. Last merge from main was dbf0100a on commit-day-12 of this branch's life — a fresh rebase will surface what conflicts and let you resolve them deliberately rather than at merge time.

🟡 Important — scope creep is hiding the real review surface

The PR title and description still describe the original 6 threads (Gaia Lite agent, 3 Settings controls, expected_model_loaded, pre-flight load). The actual diff now contains at least 11 additional themes that aren't mentioned:

Theme	Evidence	LoC
Synthetic eval corpus generator	`eval/corpus/gen_real_world.py` + 19 docs + 4 `.xlsx`	+1303
`--agent-type` flag for eval	`src/gaia/eval/runner.py`	~200
Process-wide eval lock (the fcntl one)	`src/gaia/eval/runner.py`	~200
Tool-call envelope recovery + context-overflow retry	`src/gaia/agents/base/agent.py`	~540
Typed Lemonade errors + auto-retry	`src/gaia/llm/providers/lemonade.py`	~160
UI polish (cards, sidebar, greetings, ctx badge, drag-drop, varied greetings)	`src/gaia/apps/webui/src/components/*`	~700
Live model-download progress + auto-load + banner priority	`src/gaia/apps/webui/src/components/ConnectionBanner.tsx`	~140
Removed PyPI wheel-dist verification gate	`.github/workflows/pypi.yml`, `util/verify_wheel_dist.py`	−392 +33
Removed `AGENTS.md` (168 lines)	repo root	−168
Removed `tests/unit/test_lemonade_manager_preload.py`, `test_init_ctx_size.py`, `test_export_import.py`, `test_code_index_sdk.py`, `test_ui_extras.py`, `test_verify_wheel_dist.py`	tests	−981
Docs edits across `agent-ui`, `code-index`, `custom-agent`, `patterns`, `plugin-registry`, `setup-wizard`	`docs/`	~250

The reliability/UI polish work is mostly good in isolation — but I cannot do an honest review of "Chat Lite + Settings controls" when the diff also silently removes the PyPI wheel-dist gate (which was added specifically to prevent the webui-bundling regression that motivated docs/sdk/sdks/agent-ui.mdx) and 6 test modules totaling ~1000 LoC.

Two things would unblock review:

Update the PR description to enumerate every theme with a why, especially the deletions:
- Why is verify_wheel_dist gone? Has the underlying regression class been ruled out?
- Why is AGENTS.md deleted while CLAUDE.md still references AI-agent conventions?
- Why were those 6 test modules removed — were the modules-under-test deleted, or just the tests?
Or split — at minimum, separate the eval corpus + reliability fixes into their own PRs. The original Chat Lite scope is small and reviewable; bundled this way it isn't.

🟢 What's still good

The original 6 threads still hold up; suggestions from the Apr 20 review (the silent except Exception in _canonical_agent_type, the memory_available_gb = 0.0 default, the source-shape regex test) are worth checking but none are blockers.
Test additions are thorough — tests/unit/agents/test_parse_error_recovery.py (+185), the registry tests (+566), and the chat-helpers expansion (+114) are all well-targeted.
The _chat_helpers.py consolidation around _session_agent_kwargs and _maybe_load_expected_model (right model and ctx ≥ 32K) is the real load-bearing fix and it's well-commented.

Suggested path forward

# 1. Fix Windows fcntl
# In src/gaia/eval/runner.py, guard the import + helpers behind sys.platform.

# 2. Regenerate schema
python util/gen_manifest_schema.py

# 3. Rebase and re-run CI
git fetch origin && git rebase origin/main

# 4. Update PR description to match the actual diff (or split)

Once Windows + lint are green and the description matches the diff, this is reviewable for merge. Tagging @kovtcharov-amd for the scope/split decision since the deletions of AGENTS.md, the PyPI gate, and ~1000 LoC of tests are call-it-out-loud changes that need a maintainer eyes-on regardless of how the rest of the review goes.

… + validate ctx_size Follows up on the architecture review of PR #802 with three polish fixes; no behaviour change vs the previous commit on main agents paths except the `ctx_size` validation. 1. **Single source of truth for the 32K context requirement.** ``DEFAULT_CONTEXT_SIZE`` now lives in ``gaia.llm.lemonade_client`` and is re-exported by ``lemonade_manager`` for backwards compat. The router's ``_MIN_CONTEXT_SIZE`` is aliased to it. Eliminates the three side-by-side copies (each carrying a "must match ..." comment — exactly the smell review flagged as drift-prone). 2. **Extract ``_session_agent_kwargs`` helper** in ``_chat_helpers``. The four ChatAgent/registry.create_agent call sites used to repeat the same 4-field bundle (rag_documents, library_documents, allowed_paths, ui_session_id). Centralising it means adding a new field — or forgetting one, which is what bit us last time and caused Chat Lite's PermissionError spiral — happens in one place. Unknown kwargs are still filtered by the per-agent factory, so this remains safe for manifest agents that don't recognise all fields. 3. **Validate ``LoadModelRequest.ctx_size > 0``.** Manual testing showed ``ctx_size: -1`` and ``ctx_size: 0`` were silently accepted by the endpoint and then failed deep in Lemonade with no actionable error. ``Field(None, gt=0)`` now surfaces a 422 at the boundary with a readable message. Verified with the existing 153-test suite and a live end-to-end sweep over the chat-lite agent on a Mac: cold auto-load, warm reuse, agent switch, parallel sessions (3 simultaneous), ctx reload 32K→64K, long 6K-token input, empty message, malformed load-model payload — all pass.

github-actions · 2026-04-29T18:14:44Z

Summary

Substantively this is a strong, well-reasoned PR — the Chat Lite agent + Settings controls solve a real "35B won't load on this box" UX hole, and the pre-flight model + ctx≥32K fix is the right diagnosis for the silent-truncation bug. Inline comments are unusually good at explaining why (legacy-alias rationale, ctx-size floor reasoning, Gemma-vs-Qwen platform split). Most of my notes are about scope/description hygiene rather than the code itself.

Issues

🟡 Important

1. PR description is stale relative to the actual implementation.

The description claims the agent ID is chat-lite and the primary model is Qwen3-4B-Instruct-2507-GGUF. The code actually ships:

Agent ID is gaia-lite (chat-lite is preserved as a legacy alias for stored sessions).
Primary models are Qwen3.5-4B-GGUF (Darwin) and Gemma-4-E4B-it-GGUF (Linux/Windows), with platform-conditional ordering.

src/gaia/agents/registry.py:212-258 and the _LEGACY_ID_ALIASES plumbing are the right design, but the PR description is what reviewers and future archaeologists will read first — please bring it in line with reality before merge.

2. Scope is much wider than the description signals.

The description lists ~6 threads centered on Chat Lite + Settings. The diff also contains, none of which are mentioned:

Typed Lemonade error hierarchy (LemonadeError/LemonadeModelNotLoadedError/LemonadeContextOverflowError/LemonadeNetworkError) with _classify_chat_exception walker and one-shot retry-after-reload in both streaming and non-streaming paths.
Tool-call parse-error recovery loop in agents/base/agent.py (synthetic recovery prompt, 3-attempt cap, friendly fallback) + new tests/unit/agents/test_parse_error_recovery.py.
Context-overflow _shrink_messages_for_overflow + retry-after-trim.
eval/runner.py fcntl-based single-runner lock with stale-PID reclaim and GAIA_EVAL_NO_LOCK bypass.
system.py SSE-consuming download-progress streaming + _auto_load_after_download + new DownloadProgress model.
Auto-titling background task in _chat_helpers.py (Jaccard-overlap heuristic, 30 s throttle, fire-and-forget httpx call).
WebUI red-color/cursor redesign (~10 CSS files, MessageBubble/Sidebar/ConnectionBanner restyle).
New eval/corpus/gen_real_world.py (1,303 lines) + 19 generated documents + 4 xlsx files + 4 baseline/postfix eval/results/*.json snapshots.

Each of those has its own correctness surface. A reviewer reading the description will skip them; a future bisector tracking down (say) an eval-lock or auto-title regression will not realize this PR is the introduction. Either split or — more pragmatically given the size — expand the description to thread-list every distinct change with a one-line why. This is exactly the case CLAUDE.md's "PR Descriptions — Tight and Value-Focused" rule warns about: "If the PR really does bundle many threads, group them — don't list 16 commits."

3. tunnel-friendly-error.png (147 KB) committed at the repo root.

pr-files.txt:71 adds a binary screenshot with no references in code, docs, or .gitignore updates that would explain it. Looks like a debugging artifact that slipped in alongside pr-diff.txt / .claude-pr/. Please drop it before merge.

# (delete tunnel-friendly-error.png from the worktree before push)

🟢 Minor

4. uv.lock quietly drops the Python floor from 3.13 → 3.12 (uv.lock:3) without a corresponding pyproject.toml/setup.py change. Either it's intentional (in which case python_requires in the source-of-truth metadata should match, and the change deserves a one-liner in the description because it broadens supported Python versions) or it slipped in from a local-env bisect (in which case revert).

5. _classify_chat_exception walks __cause__ only, not __context__ (src/gaia/ui/_chat_helpers.py:117-121). Implicit chains from a bare raise ... inside an except block (no from e) won't be matched by the typed-isinstance pass. The substring fallback at :126 covers most of the gap, but if you ever rely on the typed-instance metadata (e.g. LemonadeContextOverflowError.retryable), an implicit chain will skip that branch.

    cur: Optional[BaseException] = exc
    while cur is not None:
        if isinstance(cur, LemonadeError):
            return cur
        cur = cur.__cause__ or cur.__context__

6. Auto-title task can race the next user turn for the Lemonade slot. _maybe_update_session_title fires _generate_session_title against /v1/chat/completions immediately after a stream finishes, without acquiring model_load_lock or any other coordinator. If the user's next message arrives while titling is in-flight and triggers a reload (e.g. they switched agents in the UI), both calls contend for Lemonade's single inference slot. Not a correctness bug — Lemonade serializes — but it can stretch the user-visible "thinking" delay by however long the title generation takes. Worth either gating titling behind the same lock or short-circuiting it when a load is pending.

7. eval/runner.py _acquire_eval_lock writes the PID after stale-lock reclaim, but if the os.write fails the lock fd stays open and the next failure message will print holder=-1. Cosmetic — the lock itself works — but worth a try/except around the PID stamping so the diagnostic stays useful.

8. gaia_lite_factory silently filters unknown kwargs via valid_fields (registry.py:229-235). This is the established pattern across the codebase, so I'm not asking you to change it here, but it does mask kwarg typos at call sites. Consider a follow-up that logs at debug when a kwarg is dropped — would catch a class of silent-config-bug.

Strengths

Inline rationale is excellent. The legacy-alias docstring (registry.py:78-96), the platform-split block (:200-215), the _GAIA_LITE_MIN_MEMORY_GB = 5.0 derivation (:217-224), and the 32K-fallback comments in lemonade_client.py all explain why, not what. This is the kind of comment that ages well — preserve this discipline.
Single-source-of-truth discipline. canonical_id() resolving aliases through one path; _GAIA_LITE_MODELS[0] driving both UI and factory preset; DEFAULT_CONTEXT_SIZE re-exported instead of re-declared. These are the right moves and they paid off in test cleanup.
Test coverage is real, not nominal. test_parse_error_recovery.py, the alias-resolution suite in test_registry.py, the test_right_model_wrong_ctx_triggers_reload case, and the test_known_model_uses_registry_ctx_size regression guard all hit failure modes that would otherwise return as production bug reports. Mocks are at sensible boundaries.
"Fail loudly" rule respected. _canonical_agent_type propagates AttributeError rather than swallowing it; typed Lemonade errors carry actionable user_message text; pre-flight reload emits a visible "Model reloaded — retrying…" SSE.

Verdict

Approve with suggestions.

The blockers are documentation, not code: please update the PR description (Issue 1) and drop tunnel-friendly-error.png (Issue 3) before merging. The uv.lock Python-version question (Issue 4) is worth a one-line clarification in the description either way. Everything else is non-blocking — minor robustness and ergonomic notes.

itomek · 2026-04-29T18:43:53Z

Cross-platform local validation

Every push to this branch was validated on both local Linux (Docker) and local Windows (SSH) before going out — so we don't burn an hour of CI to find platform-specific regressions.

Workflow

Commit the working tree, then bundle the rebased branch + commit: git bundle create /tmp/pr802.bundle origin/main..HEAD.
Transfer the bundle to the test box (docker cp for Linux, scp for Windows).
On the test box: git fetch /tmp/pr802.bundle "HEAD:refs/heads/test/pr802-…" && git checkout test/pr802-… (a throwaway branch — never touches the box's existing checkout).
Run the same gates CI runs:
- python util/lint.py --black --isort --pylint --flake8
- python -m pytest tests/unit/ tests/test_eval.py

Results for `b2a952f0`

Platform	Lint	`tests/test_eval.py`	`tests/unit/`
macOS host	✅ clean	✅ 100 / 100	✅ 1708 passed, 15 skipped, 0 failed
Linux container	✅ clean	✅ 100 / 100	✅ 1695 passed, 23 skipped, 0 failed
Windows (real `win32`)	n/a	✅ 100 / 100	1695 passed, 15 skipped, 13 failed*

*The 13 Windows-only failures are pre-existing Linux-specific tests (os.geteuid mock patches, sys.platform="linux" monkeypatches) in tests/unit/test_init_command.py::TestInstallViaPpa (7 tests, Ubuntu-PPA installer paths) and tests/unit/installer/test_uninstall_command.py (6 tests, POSIX path/permission paths). None have skipif win32 guards. CI's Run Unit Tests job is Linux-only, so it doesn't surface them. These tests would also fail against origin/main on Windows — not a regression from this PR.

What this catches

Linux-specific import paths and lazy-import side effects. The new if sys.platform == "win32": fcntl = None; else: import fcntl guard in eval/runner.py was verified to import cleanly on Linux and on real Windows (platform=win32, fcntl=None) before pushing — the original blocker fix that made all of tests/test_eval.py go green on Windows.
File-system case-sensitivity bugs that pass on macOS but fail on Linux.
Lockfile resolution differences between hosts.
Real-Windows behaviour the macOS-side monkeypatch.setattr(runner, "fcntl", None) test simulates — verified end-to-end here, not just by the GitHub Actions Windows runner.

One-time setup quirks

Linux: project venv was missing pyfakefs (declared in setup.py's dev extras), making tests/unit/installer/test_uninstall_command.py ERROR on collection. uv pip install pyfakefs clears it.
Windows: tests/unit/chat/ui/test_*.py needs the [ui] extra (FastAPI) on top of [dev]. uv pip install -e ".[dev,ui]" from a fresh venv is the one-shot recipe.

Chat Lite is a 4B-model sibling of the built-in Chat Agent for hardware that cannot host the 35B default (Macs, low-memory boxes). Same tools, same system prompt, just pinned to Qwen3-4B-Instruct-2507 via the ChatAgent config. ChatAgent and its registration are untouched. Agent metadata gains `min_memory_gb`, exposed via `/api/agents`. Settings renders a Memory Warnings section when any registered agent declares a requirement above the free memory reading. Two Settings additions: * Active Model — text override bound to the existing `custom_model` setting. Empty means "use agent default". * Context Size — preset chips (4K/8K/16K/32K) + numeric input that reloads the active model with the chosen ctx via the existing `/api/system/load-model` endpoint. Two correctness fixes made Chat Lite actually usable end-to-end: * `/api/system/status` previously flagged any non-35B model as "Wrong model loaded". It now accepts any registered agent's preferred model, so Chat Lite's 4B doesn't trip the banner. * `_maybe_load_expected_model` short-circuited when any LLM was active, even the wrong one. It now requires the specific expected model with ctx >= 32K, otherwise reloads. Without this, Lemonade auto-loaded requested models at its default 4096 ctx, silently truncated ChatAgent's >7K-token system prompt, and returned an empty stream. `_ensure_model_loaded` also falls back to 32K when the model is not in the built-in MODELS registry. Tests: 10 new unit tests covering registration, factory presets, min_memory_gb propagation through manifests and the agents API, plus a coexistence check that Chat Agent's defaults stay unchanged. Frontend `npm run build` passes with 0 new warnings.

Two related fixes for the same eval failure mode (Qwen 4B getting ``finish_reason=length`` mid-tool-call on the no_sycophancy scenario): 1. **Output budget too tight.** ``AgentConfig.max_tokens`` was 4096, which Qwen3.5-4B exhausted while serialising long tool-call argument strings (1000+-char ``summary_type`` blob in the case the eval surfaced). With our 32K ctx_size and a ~7.7K-token system prompt + history, 8K of output budget leaves plenty of room and still keeps ~24K for input. Going much higher would steal from input-history budget without measured gain. 2. **Misleading error message.** When the cap was hit we raised "Increase --ctx-size for model X" — but ``finish_reason=length`` from the OpenAI completions API specifically signals the *output token cap*, not the context window. ctx_size and max_tokens are separate limits, and conflating them sent users off chasing the wrong knob (the user had already loaded at 32K ctx). The new error names ``AgentConfig.max_tokens`` directly. Both are agent-runtime changes (affect ChatAgent generally), not gaia-lite specific. Test suite green; full registry + chat-ui suites pass with no behaviour-coupled assertions to update.

…op session links Bundles four user-asked tweaks discovered while iterating on gaia-lite: UI / Agent UI * ChatView.tsx — chat header model badge now shows the loaded context window inline ("Qwen3.5-4B-GGUF · 32K"). Title attr carries the full "context window: NN,NNN tokens" so a hover gives the precise number. Mismatched ctx (e.g. eval reload at 4K) is now visible at a glance instead of surfacing only as a chat-time error. * ChatView.tsx + Sidebar.tsx — dropped the "#abc1234" session-link pills from both the chat header and every sidebar row. They were clipboard-copy hash badges; nobody used them and they ate the row. * MessageBubble.tsx — every message bubble (user + assistant) now has a ``title`` tooltip with the full absolute send timestamp. Previously only assistant bubbles had this via the stats footer. * ChatView.css — small ``.model-ctx-size`` rule so the suffix is dimmer than the model name (eye lands on model first). Chat agent personality * src/gaia/agents/chat/agent.py — replaced the 2-example "RIGHT" list in the GREETING RULE with 10 varied openers + an explicit "VARY YOUR PHRASING" rule. The original prompt's recency bias pinned the model on a single canned "Hey! What are you working on?" every conversation; the rotation list breaks that pattern while keeping the warm/curious/no-feature-pitch invariant. Added a new WRONG example calling out the lock-in behaviour explicitly so the model can self-detect it. Eval discipline * CLAUDE.md — new rule: "Run agent evals SERIALLY, never in parallel." Today's session lost three eval runs to two concurrent ``gaia eval agent`` invocations race-evicting each other's models out of Lemonade's single-tenant slot, surfacing as bogus 4K-ctx errors and INFRA_ERROR. Documents the exact failure modes + the ``ps aux | grep "gaia eval" | wc -l`` = 0 sanity check before kicking off a new run. Backend changes confined to comments + system-prompt strings — agent runtime contracts unchanged. 212 unit tests pass; lint clean; frontend bundle rebuilt.

The runaway-eval failure mode the user kept hitting today: a parent agent in --fix mode shells out a second `gaia eval agent ...` while the first is still running, both invocations talk to the same Lemonade Server, Lemonade has a single-tenant LLM slot, the runs race-evict each other's models, and the whole thing surfaces as nondeterministic `n_ctx=4096` overflows or `model_load_error: llama-server failed to start`. CLAUDE.md now documents the rule; this commit enforces it. src/gaia/eval/runner.py * `_acquire_eval_lock()` — context manager around fcntl.flock on /tmp/gaia-eval-agent.lock. LOCK_EX | LOCK_NB so a second invocation fails fast (exit 2) with an actionable message naming the holder PID and run age, not by hanging. * Stale-lock recovery: if the holder PID is gone, the lock file is reclaimed automatically (no manual rm needed). * Escape hatch: GAIA_EVAL_NO_LOCK=1 skips the guard for unit tests or callers that genuinely manage Lemonade out of band. * AgentEvalRunner.run() now wraps the per-scenario loop in the lock; audit_only mode skips it (no Lemonade contention). Body extracted to _run_locked() so the wrapper stays thin. UI follow-on (separate concern, same touch): src/gaia/apps/webui/src/components/AgentActivity.css * The "N steps · M tools" expand-activity bar used to render as a full-width terminal panel with border + uppercase 11px text on every assistant turn — visually loud for content the user rarely expands. Now it's an inline 10px chip at 0.55 opacity, expanding to full opacity + bg + border on hover/focus, and forced to full opacity when the run is active or has errors (states the user actually needs to notice). Verification: subprocess test confirms two concurrent invocations produce exit-2 + clear error in the second one. Lint clean.

Every GAIA reply was rendering with the same vertical AMD-red border + red-tinted background as ``.msg-error``, so a normal multi-turn chat read as a stack of warnings. Reserved that visual language for actual errors only. src/gaia/apps/webui/src/components/MessageBubble.css * .msg-assistant: dropped the ``border-left: 2px solid var(--amd-red)`` that was duplicating .msg-error's left rail. Assistant messages now distinguish themselves from user messages purely via the elevated panel background, the GAIA avatar + name in the header, and the left-aligned text direction. Subtle is the point — a long thread should read as a calm conversation, not warnings. * Removed the ``[data-theme="dark"] .msg-assistant`` override that layered ``rgba(237, 28, 36, 0.02)`` (red haze) on top of the already-neutral --bg-assistant-msg variable in dark mode. * Avatar dark-mode treatment: ``border-color: rgba(237,28,36,0.25)`` and ``box-shadow: 0 0 8px rgba(237,28,36,0.12)`` (red glow) → neutral ``var(--border)`` + a 1-px white-tinted ring. Brand red is still present in the small "GAIA" name text + the user-avatar fill; just not bleeding into every UI surface. Error styling unchanged — .msg-error still owns the AMD-red left border + red background tint, and is now the only place that visual appears so it actually stands out. Pure CSS — no behavioural change. Bundle rebuilt; refresh to see.

Three connected polish passes on the assistant bubble: src/gaia/apps/webui/src/components/MessageBubble.css * **Rounded card.** .msg-assistant was a hard-edged horizontal strip; now it's a 14-px-radius panel that visually sits *inside* the chat column instead of cutting it horizontally. ``width: calc(100% - 16px)`` inset so the rounding actually shows. * **3D depth.** Layered box-shadow (4 stops: inset top-edge highlight, crisp 1-px contact, soft mid-distance, wider ambient) for a subtle elevation. Hover lifts ~1 px and softens the ambient — small enough not to read as a "click me" CTA, big enough to feel responsive. Dark-mode variant uses heavier alphas (0.30–0.55) because shadows on near-black backgrounds disappear at typical light-mode values. * **Softer fade-in.** Replaced the 200 ms slide+scale with a 380 ms fade+drift+deblur (assistant: 450 ms). cubic-bezier(0.22, 0.61, 0.36, 1) eases out smoothly. Dropped the 0.99 scale step — scaling rounded cards causes brief subpixel blur on retina. Added a 2-px filter:blur transition that mimics an "image developing" effect on entry — very subtle but adds polish without slowing perception. User messages unchanged: still flat, transparent, right-aligned. Only the GAIA reply panel is the rounded 3D card. Shadow tokens are pure CSS — no extra DOM, no JS, ~zero runtime cost.

Same direction as the GAIA message card (b5289a3): rounder, lighter, less red bleed, subtle elevation on the active state. src/gaia/apps/webui/src/components/Sidebar.css * Session items - radius-md (8px) → radius-lg (10px) for softer corners - Padding tweaked +1px vertically so rows don't feel crammed - Hover now adds a faint border-light ring in addition to the bg-hover fill — gives shape to the row before clicking - Active row's box-shadow was a 10-px red glow + inset 28-px red haze; replaced with a calm multi-layer (top highlight + contact + mid) elevation in light mode, deeper black-alpha shadows in dark mode. Brand-red is still present in the left-edge indicator and a 6%-alpha bg tint, but no longer bleeding into the surrounding sidebar. - Press-feedback scale tightened from 0.982 → 0.985 (less "trampoline", more "tap"). * Search input - radius-md → radius-lg, matching the session items - Now sits on a tertiary background with an inset hairline shadow, reading as gently recessed (the inverse of the session-active which reads as raised — coherent stack metaphor) - Focus ring: was border 0.4 alpha + 12-px red glow ("warning" vibe); now a soft 3-px 0.08-alpha brand-red halo + 0.35-alpha border. Reads as "focused", not "errored". * Bottom status bar - Tiny 1-px inset top highlight in addition to the existing border-top — gives a soft "lip" between the scrolling session list and the fixed status row, matching the layered-card feel. Pure CSS; no DOM changes. Bundle rebuilt.

The textarea had a custom 10×18 px solid-AMD-red block cursor with a 0.5-alpha red glow blinking at 1 Hz on every focus. On a screen the user is staring at all session, that's an obvious "warning indicator" treatment for a primitive that should be invisible until interacted with. src/gaia/apps/webui/src/components/ChatView.tsx * Removed ``getCaretXY`` helper, ``_computedStyleCache``, the ``caret`` state + ``updateCaret``/``setCaret`` callbacks, the inline ``caretColor: 'transparent'`` override, and the ``<span className="input-cursor" />`` element. The textarea now relies on the browser's native caret like every other text input. * onFocus/onBlur/onSelect handlers tied only to caret tracking are gone; nothing else cared about that state. src/gaia/apps/webui/src/components/ChatView.css * Removed the ``.input-cursor`` rule. ``@keyframes cursorBlink`` stays in styles/index.css — it's still used by AgentActivity and MessageBubble streaming indicators (those are in-line content, not always-on UI chrome, so the blink is appropriate there). Side benefit: drops a per-keystroke ``requestAnimationFrame`` + mirror-div DOM creation that was running on every input change.

The previous "softened" active state still carried two pieces of the error visual vocabulary: 1. A vertical AMD-red gradient bar on the left edge (::before pseudo-element) — same shape and color as .msg-error's left rail. Active row read as "this row is broken". 2. In dark mode: rgba(237,28,36,0.06) background + 0.12-α red border — selected row looked like a tinted warning band. Both gone. Active state is now communicated purely via elevation + neutral background lift, matching the GAIA message-card vocabulary exactly (the same 4-stop layered shadow: inset top highlight, contact, mid-distance, ambient). src/gaia/apps/webui/src/components/Sidebar.css * .session-item::before — entire pseudo-element + the hover/active transform-scaleY rules removed. The selected row no longer has a red strip; selection is the elevation + bg-active. * .session-item.active light-mode shadow stack matched 1:1 to .msg-assistant for visual coherence. * .session-item.active dark-mode background switched from rgba(237,28,36,0.06) → rgba(255,255,255,0.04). Border color likewise from rgba(237,28,36,0.12) → rgba(255,255,255,0.08). Heavier black shadows for the dark-on-dark elevation read. * sessionActivateDarkBg keyframe (which animated a red background flood + inset red glow on activation) deleted. sessionActivate keyframe simplified — the 1.5-px overshoot bounce removed for a calmer settle. Brand red is now reserved for: the small "GAIA" name text in message headers, the user-avatar fill, and .msg-error. That's it.

…color Two threads: (1) GAIA auto-titling (the running chat agent renames its own session). src/gaia/ui/_chat_helpers.py * New ``_maybe_update_session_title`` helper. Fired fire-and-forget after every assistant turn finishes; calls the same Lemonade chat- completion endpoint to generate a 3-6 word tab title. * Trigger rules: - Title is one of the defaults (New Chat / New Task / Untitled / Chat / empty) → first-response pass - User message has ≤ 0.15 word-overlap with current title AND is ≥ 25 chars → topic-shift pass * Skipped when title starts with "Eval:" (eval framework owns those) and throttled to one update / 30 s / session so concurrent fires don't pile up. * LLM call uses temperature 0.3 + max_tokens 24, stripping common title artifacts ("Title:", quote-wrapping, trailing punctuation) that small models add despite the instruction. * Background task reference pinned in _active_sse_handlers under a ``_titlebg:<sid>`` key so GC doesn't kill it mid-flight; cleared in the done callback. src/gaia/ui/routers/chat.py * Same hook in the non-streaming path immediately after db.add_message(assistant). Wrapped in try/except so an auto-title failure can never block the user's response. (2) Neutral selection color. src/gaia/apps/webui/src/styles/index.css * ::selection background: rgba(237, 28, 36, 0.18) (AMD red) → rgba(56, 132, 255, 0.22) (cool blue). Selecting any text used to paint a red highlight on top of red errors / red focus rings / formerly-red active session — every interaction read as a warning state. Standard cool-blue selection is what users expect. Tests: 46 chat-ui unit tests pass; lint clean.

…lures The chat path had no recovery for the three failure modes the user keeps hitting in the UI: * "No model loaded: <model>" — Lemonade evicted between turns * "request (N tokens) exceeds the available context size (M)" — wrong-ctx model load * "Network error: CURL Timeout was reached" — Lemonade busy/hung Each one previously surfaced as a wall of raw JSON inside the chat bubble and required the user to hand-restart the model. Now: src/gaia/llm/providers/lemonade.py * Added typed exceptions ``LemonadeError``, ``LemonadeModelNotLoadedError``, ``LemonadeContextOverflowError``, ``LemonadeNetworkError``. Each carries (a) a short ``.user_message`` suitable for direct chat-bubble rendering and (b) the raw payload on ``.payload`` for diagnostic logging. Class-level ``retryable`` flag declares whether the chat layer should auto-retry. * ``_classify_lemonade_response`` walks the response envelope — including the nested ``details.response.error`` shape Lemonade uses for backend_error wrappers — and returns the matching typed exception or a generic LemonadeError when the shape is novel. * The single ``raise ValueError(f"Unexpected response format from Lemonade Server: {response}")`` at the dict-of-choices guard now routes through the classifier, so callers see typed errors instead of a string-blob ValueError. src/gaia/ui/_chat_helpers.py * ``_classify_chat_exception`` walks the cause chain AND the stringified message text to detect typed errors even when AgentSDK wraps the original LemonadeError in a generic ValueError / RuntimeError (which it does in several paths). * ``_run_agent`` (the streaming worker thread) now does ONE automatic retry on retryable Lemonade errors. On a model-not-loaded / network-error first failure: forces a fresh ``_maybe_load_expected_model`` at our 32K ctx, emits a "Model reloaded — retrying..." status SSE so the user sees the recovery (not silent retry), and re-runs ``agent.process_query``. If THAT fails too, the friendly user_message is surfaced instead of the raw exception. * The catch-all ``except Exception`` block now prefers ``LemonadeError.user_message`` over ``str(e)``, so even non-retryable errors (context overflow) come through as a plain-English explanation instead of a JSON blob. Tests: full chat-ui suite (46 tests) passes; lint clean. The retry itself is exercised end-to-end the next time you trigger a model eviction — start a fresh chat session and you'll see the recovery status rather than the wall-of-JSON failure mode.

…t path Layer 2 + 3 of the chat-resilience push (continuation of 0795a5d). src/gaia/ui/_chat_helpers.py * **Pre-flight ctx_size guard.** ``_maybe_load_expected_model`` previously checked ``active_ctx and active_ctx < N`` — short-circuit evaluation meant a 0-or-missing ctx slipped through, leaving a broken model in place. New guard explicitly treats missing ctx as "needs reload" so a model loaded with no recipe_options.ctx_size (which is exactly the broken state we want to recover from) gets re-loaded at our 32K canonical. * **Bounded reload timeout.** Pre-flight now passes ``timeout=120`` to ``LemonadeClient.load_model`` instead of inheriting the default DEFAULT_MODEL_LOAD_TIMEOUT (12000 s / 200 min). Cold load of a 4B GGUF on consumer hardware fits in <60 s; if it hasn't completed in 120 s something is genuinely wrong and we'd rather surface the failure than block the chat thread for hours. * **Non-streaming chat now retries identically to streaming.** The one-shot retry on transient Lemonade errors (model evicted between turns, network blip) was only in the streaming worker before; the non-streaming path raised straight to the user. Mirrored the same classify → reload → retry sequence in ``_get_chat_response`` so both paths recover the same way. * **Friendly error mapping in non-streaming.** The catch-all ``except Exception`` in ``_get_chat_response`` was returning a stock "I'm having trouble connecting..." string regardless of the actual failure mode. Now it prefers the typed ``LemonadeError.user_message`` (e.g. "This conversation got too long for the model's context window"), falling back to the stock copy only when classification fails. Together with 0795a5d, this closes the three failure modes the user keeps hitting in the UI: * Wrong-ctx model load (now caught by tighter guard, reloaded at 32K) * Mid-conversation eviction (caught by retry on both streaming/non-streaming) * Lemonade hang during reload (bounded by 120-s timeout instead of 200 min) Tests: 60 chat-ui + preflight tests pass.

NameError regression introduced in 1a807d2 (auto-titling commit) — the streaming-path background task at the bottom of ``_stream_chat_response`` called ``_effective_model(agent, model_id)``, but ``agent`` lives inside the producer thread's local scope and isn't visible to the outer generator. Every streaming chat turn was therefore raising NameError after the response otherwise completed, which the catch-all then surfaced as "Sorry, something went wrong on my end". src/gaia/ui/_chat_helpers.py * Capture ``session_model = model_id`` in the outer scope right after custom-model resolution; reference that in the auto-title task instead of the inaccessible ``agent``. * Comment explains why the indirection — same model id the pre-flight + agent factory used, just made visible to the post-stream cleanup. Verified by re-running a fresh streaming chat turn end-to-end against gaia-lite: 3.1 s round-trip, ``done=True``, no error events, response "4" persisted correctly. Found via the chaos-test harness from the same reliability push — which was the point of layer 4.

…bling raw error Small models (4B-class) occasionally emit malformed native tool_calls envelopes — e.g. a 1000+ char summary_type argument that gets truncated mid-string. Previously _parse_llm_response raised ValueError uncaught, the exception bubbled through the chat helper, and the user saw: Agent error: Malformed native tool_calls envelope: Expecting ',' delimiter: line 1 column 220 (char 219) The fix wraps the parse call in process_query with a try/except that: 1. Logs the parse error to error_history (type=tool_call_parse_error) 2. Appends a synthetic recovery prompt instructing the model to retry with documented enum values (or fall back to plain text) 3. Continues the loop so the next LLM call has clean conversation 4. After 3 consecutive parse failures, gives up gracefully with a friendly fallback rather than spamming the user Same handler also catches finish_reason=length (tool-call truncated mid-arguments) and parallel tool_calls (NotImplementedError), since both manifest as the same user-facing failure mode. Surfaced by GAIA eval baseline scenario: - personality/honest_limitation Turn 2: 1000+ char summary_type arg - rag_quality/negation_handling Turns 1-2: finish_reason=length Tests: tests/unit/agents/test_parse_error_recovery.py — covers the parse-error path, the 3-strikes graceful giveup, and the underlying ValueError still raises from _parse_llm_response itself.

Two failure modes surfaced by GAIA eval baseline against gaia-lite on Qwen3.5-4B-GGUF: 1. Context-overflow mid-loop (5 of 8 baseline failures) When a multi-step ReAct turn accumulates tool results from several search_file/index_document calls, the cumulative `messages` array eventually exceeds the model's 32K context window. Lemonade returns exceed_context_size and the chat helper surfaces "This conversation got too long for the model's context window. Start a fresh task..." Fix: wrap the LLM-call try/except in process_query (both streaming and non-streaming branches) with a retry loop. When we detect a context-overflow exception (substring match on the upstream error text — typed errors get wrapped by AgentSDK), trim the messages array to {first user input + last 4 entries} and retry ONCE. If the retry also fails, return a friendly message asking the user to start fresh — no more raw exception leaks. Surfaced by: - tool_selection/smart_discovery T1 (search → index → blow up) - error_recovery/file_not_found T1+T2 (failed index → deep search) - error_recovery/search_empty_fallback T1 - captured/captured_eval_smart_discovery 2. list_windows had no macOS branch The tool returned "Window listing not available. Install pywinauto (Windows) or wmctrl (Linux)." on Mac, and the agent reported "I can't list open windows on this Mac" — judged FAIL. Fix: add a Darwin branch that uses osascript with System Events to return the visible (non-background) processes — equivalent to what the user sees in Mission Control. No new dependency: osascript ships with macOS. Surfaced by web_system/list_windows. Tests: extended tests/unit/agents/test_parse_error_recovery.py with TestProcessQueryRecoversOnContextOverflow — covers the trim+retry success path AND the after-retry-still-fails graceful fallback.

…keep latest Previous trim strategy (keep first + last 4 messages) didn't help when the final tool result itself was huge (e.g. RAG query returning many KB of chunks). After trim we still couldn't fit, so the agent gave up. New _shrink_messages_for_overflow helper: - Keep the original user query intact - Keep the LATEST tool result intact (model needs it to answer) - Replace older tool results with a tiny stub ("[tool result omitted -- context overflow recovery]") - Truncate verbose assistant chain-of-thought to 800 chars This preserves the structural shape of the conversation so the model still understands what tools have been called, but drops the bulk of the bytes. Same shrink applied in both streaming and non-streaming branches of process_query. Surfaced by tool_selection/smart_discovery T1: agent indexed handbook correctly, but the indexed-content tool result + chain-of-thought combined to push past 32K. Old trim retained both and still failed; new shrink keeps just the latest result so retry succeeds. Tests still pass — the existing context-overflow tests now exercise the new path implicitly via the same "raise then succeed" pattern.

The chat helper has a one-shot retry that calls _maybe_load_expected_model to reload the model at the canonical 32K ctx — but only fires for errors where ``retryable=True``. LemonadeContextOverflowError was always non- retryable, so when a previous load left Qwen3.5-4B at 4096-ctx (e.g. after an embedding model swap or auto-load default), the agent surfaced "This conversation got too long" to the user even though the actual remedy was a model reload. Now LemonadeContextOverflowError.retryable is set dynamically based on the reported n_ctx: - If n_ctx < 32768 → retryable=True. The model was loaded with the wrong ctx_size; the chat layer's one-shot retry will reload at 32K via _maybe_load_expected_model and the same prompt will fit. - If n_ctx == 32768 → retryable=False. This is a genuine "conversation too big" situation; retry won't help, surface the friendly message. Two parsing paths updated: - _classify_lemonade_response (provider-side, structured payload): reads n_ctx from nested error.details.response.error.n_ctx - _classify_chat_exception (chat-layer fallback for when AgentSDK re-raises with str(original)): regex-extracts the ctx number from the textual message ("context size (4096 tokens)") Surfaced by tool_selection/smart_discovery — backend was loaded with n_ctx=4096 after a fresh restart sequence, the original handbook+RAG turn legitimately needed ~18K tokens, and we were stuck refusing the turn instead of reloading.

…load The previous commit made LemonadeContextOverflowError retryable when n_ctx < 32K, but the chat helper's reload-and-retry never fired because the agent's own try/except in process_query catches the exception first and runs its in-loop trim-and-retry instead. The trim doesn't help when the real issue is a 4K-loaded model — the second request goes to the same broken backend. Fix: detect the wrong-ctx-loaded sub-case in agent.py (substring match on "context size (4096|8192|16384" / "n_ctx': 4096|8192|16384") and RE-RAISE instead of trimming. This bubbles the typed error up to the chat helper, where _classify_chat_exception now reads it as retryable and triggers _maybe_load_expected_model to reload at 32K before the chat layer's own one-shot retry. Genuine "conversation too big to fit even at 32K" still goes through the trim+retry+friendly-fallback path as before. Both streaming and non-streaming branches updated symmetrically.

…lope The OpenAI tool_call spec says ``function.arguments`` is a JSON string, but llama.cpp 4B-class models occasionally emit it as a pre-parsed dict. ``json.loads(dict)`` raised TypeError which our recovery layer didn't catch (it only listened for ValueError / NotImplementedError), so the exception bubbled out as: Agent error: the JSON object must be str, bytes or bytearray, not dict Surfaced by tool_selection/smart_discovery T1 after the previous context-overflow fix unblocked a path that was previously masked by that earlier failure. Now we accept either shape: - str/bytes → json.loads as before - dict → use directly - empty/None → empty dict - anything else → raise ValueError with a recovery-friendly message so the parse-error retry kicks in

The substring detection ("context size (4096", etc.) only fires when the raw payload is preserved in the exception. When AgentSDK re-raises with the typed LemonadeContextOverflowError's friendly user_message ("This conversation got too long..."), the n_ctx detail is gone, so substring matching always missed. Fix: when a context-overflow fires AND substring match doesn't say "wrong ctx", probe the Lemonade health endpoint via httpx to read the LLM's actual ctx_size. If < 32K, treat it as wrong-ctx and re-raise so the chat helper reload-and-retry kicks in. The probe times out fast (3s) and returns False on any failure, so the caller cleanly falls through to in-loop trim if the probe is unreliable. Surfaced by another smart_discovery rerun: ctx was 4096, my earlier substring guards missed because str(exception) was already friendly.

The 19 real_world scenarios were SKIPPED_NO_DOCUMENT because the referenced corpus files did not exist on disk. Add a deterministic generator that authors all 19 documents from a single Python data structure, idempotent on re-run. Documents are synthetic paraphrases (no copyrighted source text) but contain every ground-truth fact each scenario references. XLSX files are flattened by RAGSDK._extract_text_from_xlsx into row-keyed prose that surfaces SKUs, totals, and notes for the chunker. Re-run with: python eval/corpus/gen_real_world.py

CI blockers: - src/gaia/eval/runner.py: guard `import fcntl` and `_acquire_eval_lock` body so Windows degrades to a no-op (the lock guards a Lemonade race that doesn't happen on Windows dev boxes). Unblocks Test Eval Tool (Windows) — was ModuleNotFoundError on every test in tests/test_eval.py. Apr-20 review actionable items: - src/gaia/ui/_chat_helpers.py: drop blanket `except Exception` in _canonical_agent_type — canonical_id is a pure dict lookup; CLAUDE.md fail-loudly. - src/gaia/ui/models.py: memory_available_gb -> Optional[float] = None. Avoids a false memory-warning banner if psutil ever silently falls back. TS type and SettingsModal updated to null-guard. - src/gaia/agents/registry.py: lift the 5 GB gaia-lite floor to a named constant _GAIA_LITE_MIN_MEMORY_GB so the rationale stays adjacent to the value. - src/gaia/ui/routers/system.py: clarify update_settings docstring re Pydantic vs GAIA convention; widen psutil exception handler so an OSError from virtual_memory() (containers/seccomp) doesn't 500 the status endpoint. Tests + lint hygiene: - tests/unit/test_chat_preflight.py: 6 stale tests now exercise the new pre-flight semantics (right model + ctx >= 32K). New test_right_model_wrong_ctx_triggers_reload covers the negative case. - tests/unit/test_lemonade_model_loading.py: assert ctx_size=32768 fallback for unknown models. New test_known_model_uses_registry_ctx_size guards the registry-lookup path. - tests/test_eval.py: new test_acquire_eval_lock_windows_noop with fcntl=None. - tests/unit/chat/ui/test_chat_helpers.py: source-shape regex updated for the new _build_create_kwargs(...) call shape; new TestCanonicalAgentType class ratchets the dropped silent-except. - src/gaia/llm/providers/lemonade.py: unused `is_err` -> `_is_err`. - src/gaia/ui/_chat_helpers.py: use module-level `_re` consistently. - tests/unit/agents/test_registry.py: missing `from unittest.mock import patch`. - src/gaia/apps/webui/package-lock.json: bumped 0.17.3 -> 0.17.4 to match package.json.

…loor, walk __context__ - Delete tunnel-friendly-error.png — debug screenshot that slipped in via upstream commit f0844d0; no references in code/docs (#3 from Apr-29 review). - Restore uv.lock requires-python ">=3.13" to match origin/main (was silently narrowed to ">=3.12" in the same upstream commit). setup.py's python_requires stays >=3.10; the lock now no longer drifts from main (#4 from Apr-29 review). - Restore src/gaia/apps/webui/package-lock.json to origin/main (revert my drive-by 0.17.3 -> 0.17.4 bump). Main itself has the package.json=0.17.4 vs lockfile=0.17.3 drift; the auto-correction triggered the heavy Build Installers workflow on this PR, which then timed out at the workflow's hardcoded 90s state-ready poll while still downloading the ~3 GB Gemma-4-E4B-it-GGUF. Reverting eliminates the unrelated CI noise; the lockfile/package.json drift is its own tech debt. - _classify_chat_exception now walks __context__ as well as __cause__ so implicit exception chains (raise ... inside an except block, no `from`) preserve typed-class metadata like LemonadeContextOverflowError.retryable (#5 from Apr-29 review).

Bring the feature branch back to green by addressing the cluster of CI failures that landed when the memory v2 work merged with main. All fixes are mechanical or scoped to test isolation — no behavioural change to the memory pipeline itself. - Restore lost merge-conflict state in `ChatView.tsx` and `Sidebar.tsx`: the `getSessionHash` import, `hashCopied`/`copied` state, and the `handleCopyHash` callback all dropped during the merge — Vite build was failing on missing identifiers across PyPI Build Check and all three Build Installers jobs. - Lint/Pylint cleanup so the `Code Quality (Lint)` job is green again: remove unused vars/imports, drop dead `if x != x` branches, and promote a few pointless lambdas to method references in `agents/base/discovery.py`. Reorder `routers/memory.py` imports to satisfy isort. - Tighten `_canonical_agent_type` to surface `AttributeError` instead of swallowing it (matches the existing regression test added in #802; was failing locally and in CI Unit Tests). - Add an explicit `GAIA_MEMORY_DISABLED=1` opt-out to `MemoryMixin.init_memory`. The Path Validator security tests, Unit Tests, and Chat Agent Tests jobs all instantiate `ChatAgent`/`CodeAgent` without a Lemonade server available; the memory v2 hard-requirement on the embedding service fails them. This is a deliberate, named opt-out (not a silent fallback) — tests that exercise memory itself clear the variable via the new `tests/unit/conftest.py` autouse fixture and the `_mock_v2_init_context` helper, so memory test coverage is unchanged. CI workflows that don't need memory now set the env var explicitly.

kovtcharov requested a review from kovtcharov-amd as a code owner April 18, 2026 03:31

github-actions Bot added agents llm LLM backend changes tests Test changes electron Electron app changes performance Performance-critical changes labels Apr 18, 2026

kovtcharov self-assigned this Apr 18, 2026

kovtcharov requested a review from itomek-amd April 18, 2026 04:47

kovtcharov added this to the v0.17.4 — Website launch and shell-safety [OSS] milestone Apr 18, 2026

itomek modified the milestones: v0.17.4 — Custom agent and installer updates, v0.17.5 — Website launch and shell-safety [OSS] Apr 22, 2026

github-actions Bot added mcp MCP integration changes cli CLI changes eval Evaluation framework changes labels Apr 26, 2026

kovtcharov removed the agents label Apr 26, 2026

github-actions Bot added devops DevOps/infrastructure changes agents dependencies Dependency updates labels Apr 26, 2026

kovtcharov mentioned this pull request Apr 28, 2026

fix(chat-agent): block reasoning loops + supplementation on out-of-scope questions #919

Merged

4 tasks

itomek force-pushed the feature/mac-4b-default branch from 2c967a4 to 5f51005 Compare April 29, 2026 18:09

kovtcharov and others added 22 commits April 29, 2026 15:57

itomek force-pushed the feature/mac-4b-default branch from b2a952f to 62d0d2f Compare April 29, 2026 20:05

kovtcharov-amd approved these changes Apr 29, 2026

View reviewed changes

kovtcharov-amd added this pull request to the merge queue Apr 29, 2026

Merged via the queue into main with commit 37e35eb Apr 29, 2026
45 of 47 checks passed

kovtcharov-amd deleted the feature/mac-4b-default branch April 29, 2026 22:20

github-actions Bot mentioned this pull request May 1, 2026

chore(release): v0.17.5 #940

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): add Chat Lite + Settings model/ctx/memory controls#802

feat(agents): add Chat Lite + Settings model/ctx/memory controls#802
kovtcharov-amd merged 32 commits intomainfrom
feature/mac-4b-default

kovtcharov commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

kovtcharov commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

kovtcharov commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

itomek commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kovtcharov commented Apr 18, 2026

Summary

Threads

Test plan

Out of scope

Uh oh!

github-actions Bot commented Apr 20, 2026

Summary

Issues Found

🟢 Minor

Strengths

Verdict

Uh oh!

kovtcharov commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

CI/CD Failure Triage for PR #802

🟡 1. Code Quality (Lint) — 2 issues

🟡 2. Unit Tests — 6 failures, all caused by this PR

Suggested fix plan

Uh oh!

kovtcharov commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

PR Review — feat(agents): add Chat Lite + Settings model/ctx/memory controls

🔴 Blockers (must fix before merge)

🟡 Important — scope creep is hiding the real review surface

🟢 What's still good

Suggested path forward

Uh oh!

github-actions Bot commented Apr 29, 2026

Summary

Issues

🟡 Important

🟢 Minor

Strengths

Verdict

Uh oh!

itomek commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-platform local validation

Workflow

Results for b2a952f0

What this catches

One-time setup quirks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PR Review — `feat(agents): add Chat Lite + Settings model/ctx/memory controls`

itomek commented Apr 29, 2026 •

edited

Loading

Results for `b2a952f0`