fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting by itomek · Pull Request #1302 · amd/gaia

itomek · 2026-05-29T19:06:38Z

Stacked on #1300 — review the classifier fix there first; this PR's diff is only the auto-heal behaviour on top of it.

Why this matters

Before: when the Agent UI backend detected a corrupt/incomplete model on first boot, it called input("[y/N]") inside the FastAPI lifespan threadpool — which has no TTY. input() raised EOFError, the boot-init job failed, and users were left with a broken UI requiring a manual force-redownload. After: the corrupt-download repair path is fully non-interactive. When prompt=False (boot context), all three prompt helpers auto-proceed without ever calling input(), recovery attempts a single delete+redownload with INFO-level progress, and only surfaces a loud actionable error if recovery genuinely fails.

What changed

_prompt_user_for_delete — added the same sys.stdin.isatty() / sys.stdout.isatty() guard its two siblings already had. This closes the EOFError gap.
load_model corrupt-download branch — the prompt argument now gates both _prompt_user_for_repair and _prompt_user_for_delete calls (if prompt and <isatty>), so prompt=False callers (the UI server and LemonadeManager._try_preload_with_ctx) never reach input().
Recovery is bounded to one delete+redownload attempt; download progress is logged at INFO so the boot log shows movement during a multi-GB repair; the corrupt-detection "why" is at DEBUG.
Unrecoverable failure raises a single LemonadeClientError naming what failed, what to do (UI Force-redownload / manual delete), and where to look (server.log) — no EOFError, no hang.
Interactive TTY path (prompt=True + real terminal) is preserved unchanged.

Test plan

Unit — 7 new tests in tests/unit/test_lemonade_model_loading.py (17 total, 10 pre-existing): each AC covered by a named test — _prompt_user_for_delete non-TTY returns without EOFError; prompt=False never reaches either prompt helper; recovery bounded to 1 delete+redownload; progress logged at INFO; unrecoverable raises actionable error; interactive TTY still reaches the prompt.
Regression — 60 pass across test_lemonade_model_loading.py + test_lemonade_error_classification.py + test_lemonade_manager_preload.py (includes all of fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download #1300's tests).
Lint — clean.
Real-world (Strix Halo — Ryzen AI MAX+ 395): cloned branch, 17 unit tests pass; all three _prompt_user_for_* helpers return cleanly under non-TTY stdin (no EOFError); prompt=False on a corrupt-classified load confirms neither repair nor delete prompt is called, boot log shows 📥 Resuming download… then a clean actionable error — no hang.
Real-world (AMD desktop, Radeon RX 7900 XTX): isatty guards and prompt=False path both verified on hardware (pytest not installed in that box's venv so unit suite run on Strix Halo only).

…orrupt download `_is_corrupt_download_error` matched the generic string "llama-server failed to start" as proof of a corrupt/incomplete model download. Lemonade raises that string for many non-corruption failures (resource limits, ctx_size, GPU/backend startup, port conflicts), so an ordinary load failure was routed into a destructive delete + re-download of the model (default ~25GB), dead-ending first-boot. Keep the five specific corruption phrases as unconditional signals; "llama-server failed to start" now only counts as corruption when one of those phrases also corroborates it. A bare load failure falls through to load_model's non-corrupt branch, which raises an actionable LemonadeClientError without entering the repair path. Closes #1294

These two files were committed to main with formatting that does not satisfy Black — surfaced by the CI merge-commit check on this PR. No logic changes.

… validator Commit 905036c introduced a timestamped backup naming convention via the security path validator, but two assertions in test_code_agent.py still expected the old hardcoded .bak suffix. Use result["backup_path"] instead.

On a fresh Agent UI install, first boot loaded the default model from the FastAPI lifespan threadpool (no TTY) with prompt=False. The corrupt-download repair path ignored that flag and fired interactive [y/N] prompts: input() raised EOFError (or hung on an idle pipe), dead-ending boot-init. Two fixes in load_model's corrupt-download branch: - _prompt_user_for_delete now has the same stdin/stdout isatty guard as its siblings _prompt_user_for_download / _prompt_user_for_repair, returning the proceed-default under a non-interactive environment instead of calling input(). - The repair/delete branch honors the prompt argument: with prompt=False it skips both prompts and auto-proceeds through the bounded recovery (resume, then a single delete + re-download). Recovery progress is logged at INFO (percent from the pull stream) so the backend log the UI tails shows movement and boot does not look frozen; the corrupt-detected detail logs at DEBUG. If recovery still fails after the one delete + re-download, it raises a single actionable LemonadeClientError naming the recovery affordance (UI Force-redownload / manual delete) and the Lemonade server.log -- no EOFError, no hang, no silent swallow. A real TTY with prompt=True still prompts as before.

kovtcharov-amd · 2026-05-29T21:39:37Z

@claude review this PR.

The base branch was changed.

github-actions · 2026-05-29T21:42:06Z

Solid fix — the diagnosis is right and the implementation is clean. The two root causes (the missing isatty guard on _prompt_user_for_delete and the corrupt-download branch ignoring prompt) are both addressed, and the prompt and not _prompt_user_for_* gating preserves the interactive TTY path unchanged. Extracting _consume_pull_stream removes the duplicated stream-draining loop, and the unrecoverable-failure error names what failed / what to do / where to look — exactly the fail-loudly shape CLAUDE.md asks for. Test coverage maps cleanly to the ACs.

One thing worth a look (🟢 minor, and pre-existing — not introduced here):

Asymmetric handling when a pull stream ends without complete or error. The fresh-download path now raises explicitly if the stream exhausts without a completion event (lemonade_client.py:2906-2910), but the resume path doesn't. If _consume_pull_stream(model_name, "resume") returns False with no exception (stream just ends), the if at :2860 is falsy, the try block exits without returning, the except at :2871 isn't entered (no exception was raised), and control falls through the whole corrupt branch to re-raise the original corrupt error at :2925 — so the delete+redownload second attempt is never reached.

That contradicts AC#4 ("bounded to ONE delete+redownload") only in the narrow case of a silently-truncated resume stream, which is unusual in practice. The behavior matches the pre-PR code, so it's not a regression. If you want the recovery to be robust to that case, mirroring the fresh-download treatment (treat resume-incomplete as a failure that escalates to delete+redownload) would close it. Optional for this PR.

Couldn't run the suite here (no venv in this environment), but the new tests are fully mocked and the assertions look correct; your Strix Halo run covers the real execution. Nothing blocking from my side.

itomek added 5 commits May 29, 2026 14:42

style: apply black formatting to code quality issues introduced in #1188

44493be

These two files were committed to main with formatting that does not satisfy Black — surfaced by the CI merge-commit check on this PR. No logic changes.

docs(plans): add issue-1293 non-interactive boot auto-heal plan

ac19af1

itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 19:06

github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels May 29, 2026

itomek added this to the v0.19 — Test & CI Hardening [OSS] milestone May 29, 2026

kovtcharov-amd previously approved these changes May 29, 2026

View reviewed changes

Base automatically changed from tmi/issue-1294-corrupt-classification to main May 29, 2026 21:40

itomek enabled auto-merge May 30, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302

fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302
itomek wants to merge 5 commits into
mainfrom
tmi/issue-1293-noninteractive-boot

itomek commented May 29, 2026

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itomek commented May 29, 2026

Why this matters

What changed

Test plan

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants