Skip to content

fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302

Open
itomek wants to merge 5 commits into
mainfrom
tmi/issue-1293-noninteractive-boot
Open

fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting#1302
itomek wants to merge 5 commits into
mainfrom
tmi/issue-1293-noninteractive-boot

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented May 29, 2026

Closes #1293

Stacked on #1300 — review the classifier fix there first; this PR's diff is only the auto-heal behaviour on top of it.

Why this matters

Before: when the Agent UI backend detected a corrupt/incomplete model on first boot, it called input("[y/N]") inside the FastAPI lifespan threadpool — which has no TTY. input() raised EOFError, the boot-init job failed, and users were left with a broken UI requiring a manual force-redownload. After: the corrupt-download repair path is fully non-interactive. When prompt=False (boot context), all three prompt helpers auto-proceed without ever calling input(), recovery attempts a single delete+redownload with INFO-level progress, and only surfaces a loud actionable error if recovery genuinely fails.

What changed

  • _prompt_user_for_delete — added the same sys.stdin.isatty() / sys.stdout.isatty() guard its two siblings already had. This closes the EOFError gap.
  • load_model corrupt-download branch — the prompt argument now gates both _prompt_user_for_repair and _prompt_user_for_delete calls (if prompt and <isatty>), so prompt=False callers (the UI server and LemonadeManager._try_preload_with_ctx) never reach input().
  • Recovery is bounded to one delete+redownload attempt; download progress is logged at INFO so the boot log shows movement during a multi-GB repair; the corrupt-detection "why" is at DEBUG.
  • Unrecoverable failure raises a single LemonadeClientError naming what failed, what to do (UI Force-redownload / manual delete), and where to look (server.log) — no EOFError, no hang.
  • Interactive TTY path (prompt=True + real terminal) is preserved unchanged.

Test plan

  • Unit — 7 new tests in tests/unit/test_lemonade_model_loading.py (17 total, 10 pre-existing): each AC covered by a named test — _prompt_user_for_delete non-TTY returns without EOFError; prompt=False never reaches either prompt helper; recovery bounded to 1 delete+redownload; progress logged at INFO; unrecoverable raises actionable error; interactive TTY still reaches the prompt.
  • Regression — 60 pass across test_lemonade_model_loading.py + test_lemonade_error_classification.py + test_lemonade_manager_preload.py (includes all of fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download #1300's tests).
  • Lint — clean.
  • Real-world (Strix Halo — Ryzen AI MAX+ 395): cloned branch, 17 unit tests pass; all three _prompt_user_for_* helpers return cleanly under non-TTY stdin (no EOFError); prompt=False on a corrupt-classified load confirms neither repair nor delete prompt is called, boot log shows 📥 Resuming download… then a clean actionable error — no hang.
  • Real-world (AMD desktop, Radeon RX 7900 XTX): isatty guards and prompt=False path both verified on hardware (pytest not installed in that box's venv so unit suite run on Strix Halo only).

itomek added 5 commits May 29, 2026 14:42
…orrupt download

`_is_corrupt_download_error` matched the generic string
"llama-server failed to start" as proof of a corrupt/incomplete model
download. Lemonade raises that string for many non-corruption failures
(resource limits, ctx_size, GPU/backend startup, port conflicts), so an
ordinary load failure was routed into a destructive delete + re-download
of the model (default ~25GB), dead-ending first-boot.

Keep the five specific corruption phrases as unconditional signals;
"llama-server failed to start" now only counts as corruption when one of
those phrases also corroborates it. A bare load failure falls through to
load_model's non-corrupt branch, which raises an actionable
LemonadeClientError without entering the repair path.

Closes #1294


These two files were committed to main with formatting that does not
satisfy Black — surfaced by the CI merge-commit check on this PR.
No logic changes.
… validator

Commit 905036c introduced a timestamped backup naming convention via the
security path validator, but two assertions in test_code_agent.py still
expected the old hardcoded .bak suffix. Use result["backup_path"] instead.
On a fresh Agent UI install, first boot loaded the default model from the
FastAPI lifespan threadpool (no TTY) with prompt=False. The corrupt-download
repair path ignored that flag and fired interactive [y/N] prompts: input()
raised EOFError (or hung on an idle pipe), dead-ending boot-init.

Two fixes in load_model's corrupt-download branch:
- _prompt_user_for_delete now has the same stdin/stdout isatty guard as its
  siblings _prompt_user_for_download / _prompt_user_for_repair, returning the
  proceed-default under a non-interactive environment instead of calling input().
- The repair/delete branch honors the prompt argument: with prompt=False it
  skips both prompts and auto-proceeds through the bounded recovery (resume,
  then a single delete + re-download). Recovery progress is logged at INFO
  (percent from the pull stream) so the backend log the UI tails shows movement
  and boot does not look frozen; the corrupt-detected detail logs at DEBUG.

If recovery still fails after the one delete + re-download, it raises a single
actionable LemonadeClientError naming the recovery affordance (UI Force-redownload
/ manual delete) and the Lemonade server.log -- no EOFError, no hang, no silent
swallow. A real TTY with prompt=True still prompts as before.
@itomek itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 19:06
@github-actions github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels May 29, 2026
@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

@claude review this PR.

kovtcharov-amd
kovtcharov-amd previously approved these changes May 29, 2026
Base automatically changed from tmi/issue-1294-corrupt-classification to main May 29, 2026 21:40
@itomek itomek dismissed kovtcharov-amd’s stale review May 29, 2026 21:40

The base branch was changed.

@github-actions
Copy link
Copy Markdown
Contributor

Solid fix — the diagnosis is right and the implementation is clean. The two root causes (the missing isatty guard on _prompt_user_for_delete and the corrupt-download branch ignoring prompt) are both addressed, and the prompt and not _prompt_user_for_* gating preserves the interactive TTY path unchanged. Extracting _consume_pull_stream removes the duplicated stream-draining loop, and the unrecoverable-failure error names what failed / what to do / where to look — exactly the fail-loudly shape CLAUDE.md asks for. Test coverage maps cleanly to the ACs.

One thing worth a look (🟢 minor, and pre-existing — not introduced here):

Asymmetric handling when a pull stream ends without complete or error. The fresh-download path now raises explicitly if the stream exhausts without a completion event (lemonade_client.py:2906-2910), but the resume path doesn't. If _consume_pull_stream(model_name, "resume") returns False with no exception (stream just ends), the if at :2860 is falsy, the try block exits without returning, the except at :2871 isn't entered (no exception was raised), and control falls through the whole corrupt branch to re-raise the original corrupt error at :2925 — so the delete+redownload second attempt is never reached.

That contradicts AC#4 ("bounded to ONE delete+redownload") only in the narrow case of a silently-truncated resume stream, which is unusual in practice. The behavior matches the pre-PR code, so it's not a regression. If you want the recovery to be robust to that case, mirroring the fresh-download treatment (treat resume-incomplete as a failure that escalates to delete+redownload) would close it. Optional for this PR.

Couldn't run the suite here (no venv in this environment), but the new tests are fully mocked and the assertions look correct; your Strix Halo run covers the real execution. Nothing blocking from my side.

@itomek itomek enabled auto-merge May 30, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llm LLM backend changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent UI first-boot fails to load model: interactive "Delete and re-download?" prompt dead-ends the non-interactive backend

2 participants