Skip to content

fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download#1300

Merged
itomek merged 3 commits into
mainfrom
tmi/issue-1294-corrupt-classification
May 29, 2026
Merged

fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download#1300
itomek merged 3 commits into
mainfrom
tmi/issue-1294-corrupt-classification

Conversation

@itomek
Copy link
Copy Markdown
Collaborator

@itomek itomek commented May 29, 2026

Closes #1294

Why this matters

Before: an ordinary model-load failure (resource limits, ctx_size, GPU/backend startup, port conflicts) all surface from Lemonade as "llama-server failed to start" — and _is_corrupt_download_error treated that generic string as file corruption. On a fresh install that misread sent first-boot into a destructive delete + ~25 GB re-download that couldn't fix the real problem, then dead-ended. After: the bare failure is no longer mistaken for corruption — it surfaces as an actionable LemonadeClientError and the model cache is left intact. Genuine corruption (five specific signals) still triggers the repair flow.

Part of the fresh-install first-boot reliability set (#1293 is the stacked follow-up that fixes the non-interactive auto-heal on top of this accurate classification).

Test plan

  • Unit — tests/unit/test_lemonade_error_classification.py (35 pass): bare llama-server failed to start → not corrupt (incl. the exact {code/type: model_load_error} payload from the user's boot log and an OOM variant); all five specific corruption phrases → corrupt; load_model on a bare failure raises an actionable error and calls neither delete_model nor pull_model_stream; genuine corruption still enters the repair path.
  • Regression — tests/unit/test_lemonade_model_loading.py + test_lemonade_manager_preload.py (18 pass), independently re-run from a clean checkout of this branch.
  • Lint — util/lint.py (black / isort / flake8 / pylint -E) clean.
  • Real-world (Strix Halo — Ryzen AI MAX+ 395) — cloned branch into /tmp on AMD hardware, ran all 35 classifier tests (pass), and spot-checked live: _is_corrupt_download_error("llama-server failed to start")False; _is_corrupt_download_error("files are incomplete")True.

@itomek itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 18:35
@github-actions github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels May 29, 2026
@itomek itomek enabled auto-merge May 29, 2026 18:39
itomek added 2 commits May 29, 2026 14:42
…orrupt download

`_is_corrupt_download_error` matched the generic string
"llama-server failed to start" as proof of a corrupt/incomplete model
download. Lemonade raises that string for many non-corruption failures
(resource limits, ctx_size, GPU/backend startup, port conflicts), so an
ordinary load failure was routed into a destructive delete + re-download
of the model (default ~25GB), dead-ending first-boot.

Keep the five specific corruption phrases as unconditional signals;
"llama-server failed to start" now only counts as corruption when one of
those phrases also corroborates it. A bare load failure falls through to
load_model's non-corrupt branch, which raises an actionable
LemonadeClientError without entering the repair path.

Closes #1294


These two files were committed to main with formatting that does not
satisfy Black — surfaced by the CI merge-commit check on this PR.
No logic changes.
@itomek itomek force-pushed the tmi/issue-1294-corrupt-classification branch from cccc34f to 44493be Compare May 29, 2026 18:44
… validator

Commit 905036c introduced a timestamped backup naming convention via the
security path validator, but two assertions in test_code_agent.py still
expected the old hardcoded .bak suffix. Use result["backup_path"] instead.
@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

@claude review this pr.

@itomek itomek added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit 3ed27c3 May 29, 2026
37 checks passed
@itomek itomek deleted the tmi/issue-1294-corrupt-classification branch May 29, 2026 21:40
@github-actions
Copy link
Copy Markdown
Contributor

Looks correct and well-scoped — approving on the merits, with two minor non-blockers below. The fix does exactly what the issue calls for: dropping "llama-server failed to start" from the corruption phrase list so a bare load failure no longer triggers the destructive delete + ~25 GB re-download.

I traced the runtime path to confirm the behavior holds end-to-end:

  • _is_corrupt_download_error now matches only the five specific phrases (src/gaia/llm/lemonade_client.py:1234).
  • On a bare llama-server failed to start, load_model skips the corrupt block and falls to lemonade_client.py:2873-2881, which re-raises an actionable LemonadeClientError with the original message and never calls delete_model/pull_model_stream. ✅
  • Genuine corruption still enters the resume→delete→fresh-download repair flow (lemonade_client.py:2784-2871). ✅

The test coverage is thorough — the real-world model_load_error payload, the corroboration case (startup failure + corruption phrase → still corrupt), and the load_model routing/decline cases all map cleanly onto that code path. Note: I couldn't execute the suite in this review sandbox (no pytest available), so I'm relying on your reported 35-pass run + Strix Halo spot-check for the green result.

🟢 Minor, non-blocking:

  1. .claude/plans/issue-1294.md embeds personal absolute paths (/Users/tomasz/src/amd/gaia/.venv/... in test_command/lint_command). If plan artifacts are meant to live in the repo, consider scrubbing the machine-specific paths; otherwise it may be better left untracked.
  2. Scope: the diff also carries black-formatting touch-ups in file_io.py / test_file_io_guardrails.py and a backup_path assertion fix in test_code_agent.py — unrelated to the title. They look like legitimate prior-commit cleanup on the branch, just worth a one-line callout in the PR body so reviewers don't wonder why they're here.

Nice surgical fix with a clear corroboration rule. The stacked #1293 follow-up for the non-interactive auto-heal is the right call to keep separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents llm LLM backend changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lemonade: _is_corrupt_download_error misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads

2 participants