fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download by itomek · Pull Request #1300 · amd/gaia

itomek · 2026-05-29T18:35:42Z

Why this matters

Before: an ordinary model-load failure (resource limits, ctx_size, GPU/backend startup, port conflicts) all surface from Lemonade as "llama-server failed to start" — and _is_corrupt_download_error treated that generic string as file corruption. On a fresh install that misread sent first-boot into a destructive delete + ~25 GB re-download that couldn't fix the real problem, then dead-ended. After: the bare failure is no longer mistaken for corruption — it surfaces as an actionable LemonadeClientError and the model cache is left intact. Genuine corruption (five specific signals) still triggers the repair flow.

Part of the fresh-install first-boot reliability set (#1293 is the stacked follow-up that fixes the non-interactive auto-heal on top of this accurate classification).

Test plan

Unit — tests/unit/test_lemonade_error_classification.py (35 pass): bare llama-server failed to start → not corrupt (incl. the exact {code/type: model_load_error} payload from the user's boot log and an OOM variant); all five specific corruption phrases → corrupt; load_model on a bare failure raises an actionable error and calls neither delete_model nor pull_model_stream; genuine corruption still enters the repair path.
Regression — tests/unit/test_lemonade_model_loading.py + test_lemonade_manager_preload.py (18 pass), independently re-run from a clean checkout of this branch.
Lint — util/lint.py (black / isort / flake8 / pylint -E) clean.
Real-world (Strix Halo — Ryzen AI MAX+ 395) — cloned branch into /tmp on AMD hardware, ran all 35 classifier tests (pass), and spot-checked live: _is_corrupt_download_error("llama-server failed to start") → False; _is_corrupt_download_error("files are incomplete") → True.

…orrupt download `_is_corrupt_download_error` matched the generic string "llama-server failed to start" as proof of a corrupt/incomplete model download. Lemonade raises that string for many non-corruption failures (resource limits, ctx_size, GPU/backend startup, port conflicts), so an ordinary load failure was routed into a destructive delete + re-download of the model (default ~25GB), dead-ending first-boot. Keep the five specific corruption phrases as unconditional signals; "llama-server failed to start" now only counts as corruption when one of those phrases also corroborates it. A bare load failure falls through to load_model's non-corrupt branch, which raises an actionable LemonadeClientError without entering the repair path. Closes #1294

These two files were committed to main with formatting that does not satisfy Black — surfaced by the CI merge-commit check on this PR. No logic changes.

… validator Commit 905036c introduced a timestamped backup naming convention via the security path validator, but two assertions in test_code_agent.py still expected the old hardcoded .bak suffix. Use result["backup_path"] instead.

kovtcharov-amd · 2026-05-29T21:38:40Z

@claude review this pr.

github-actions · 2026-05-29T21:40:29Z

Looks correct and well-scoped — approving on the merits, with two minor non-blockers below. The fix does exactly what the issue calls for: dropping "llama-server failed to start" from the corruption phrase list so a bare load failure no longer triggers the destructive delete + ~25 GB re-download.

I traced the runtime path to confirm the behavior holds end-to-end:

_is_corrupt_download_error now matches only the five specific phrases (src/gaia/llm/lemonade_client.py:1234).
On a bare llama-server failed to start, load_model skips the corrupt block and falls to lemonade_client.py:2873-2881, which re-raises an actionable LemonadeClientError with the original message and never calls delete_model/pull_model_stream. ✅
Genuine corruption still enters the resume→delete→fresh-download repair flow (lemonade_client.py:2784-2871). ✅

The test coverage is thorough — the real-world model_load_error payload, the corroboration case (startup failure + corruption phrase → still corrupt), and the load_model routing/decline cases all map cleanly onto that code path. Note: I couldn't execute the suite in this review sandbox (no pytest available), so I'm relying on your reported 35-pass run + Strix Halo spot-check for the green result.

🟢 Minor, non-blocking:

.claude/plans/issue-1294.md embeds personal absolute paths (/Users/tomasz/src/amd/gaia/.venv/... in test_command/lint_command). If plan artifacts are meant to live in the repo, consider scrubbing the machine-specific paths; otherwise it may be better left untracked.
Scope: the diff also carries black-formatting touch-ups in file_io.py / test_file_io_guardrails.py and a backup_path assertion fix in test_code_agent.py — unrelated to the title. They look like legitimate prior-commit cleanup on the branch, just worth a one-line callout in the PR body so reviewers don't wonder why they're here.

Nice surgical fix with a clear corroboration rule. The stacked #1293 follow-up for the non-interactive auto-heal is the right call to keep separate.

itomek requested a review from kovtcharov-amd as a code owner May 29, 2026 18:35

github-actions Bot added llm LLM backend changes tests Test changes performance Performance-critical changes labels May 29, 2026

itomek enabled auto-merge May 29, 2026 18:39

itomek added 2 commits May 29, 2026 14:42

style: apply black formatting to code quality issues introduced in #1188

44493be

These two files were committed to main with formatting that does not satisfy Black — surfaced by the CI merge-commit check on this PR. No logic changes.

itomek force-pushed the tmi/issue-1294-corrupt-classification branch from cccc34f to 44493be Compare May 29, 2026 18:44

github-actions Bot added the agents label May 29, 2026

itomek mentioned this pull request May 29, 2026

fix(lemonade): auto-heal corrupt model on non-interactive boot without prompting #1302

Open

5 tasks

itomek self-assigned this May 29, 2026

itomek added this to the v0.19 — Test & CI Hardening [OSS] milestone May 29, 2026

kovtcharov-amd approved these changes May 29, 2026

View reviewed changes

itomek added this pull request to the merge queue May 29, 2026

Merged via the queue into main with commit 3ed27c3 May 29, 2026
37 checks passed

itomek deleted the tmi/issue-1294-corrupt-classification branch May 29, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download#1300

fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download#1300
itomek merged 3 commits into
mainfrom
tmi/issue-1294-corrupt-classification

itomek commented May 29, 2026 •

edited

Loading

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itomek commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this matters

Test plan

Uh oh!

kovtcharov-amd commented May 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

itomek commented May 29, 2026 •

edited

Loading