fix(lemonade): don't classify generic "llama-server failed to start" as a corrupt download#1300
Conversation
…orrupt download `_is_corrupt_download_error` matched the generic string "llama-server failed to start" as proof of a corrupt/incomplete model download. Lemonade raises that string for many non-corruption failures (resource limits, ctx_size, GPU/backend startup, port conflicts), so an ordinary load failure was routed into a destructive delete + re-download of the model (default ~25GB), dead-ending first-boot. Keep the five specific corruption phrases as unconditional signals; "llama-server failed to start" now only counts as corruption when one of those phrases also corroborates it. A bare load failure falls through to load_model's non-corrupt branch, which raises an actionable LemonadeClientError without entering the repair path. Closes #1294
cccc34f to
44493be
Compare
… validator Commit 905036c introduced a timestamped backup naming convention via the security path validator, but two assertions in test_code_agent.py still expected the old hardcoded .bak suffix. Use result["backup_path"] instead.
|
@claude review this pr. |
|
Looks correct and well-scoped — approving on the merits, with two minor non-blockers below. The fix does exactly what the issue calls for: dropping I traced the runtime path to confirm the behavior holds end-to-end:
The test coverage is thorough — the real-world 🟢 Minor, non-blocking:
Nice surgical fix with a clear corroboration rule. The stacked #1293 follow-up for the non-interactive auto-heal is the right call to keep separate. |
Closes #1294
Why this matters
Before: an ordinary model-load failure (resource limits,
ctx_size, GPU/backend startup, port conflicts) all surface from Lemonade as"llama-server failed to start"— and_is_corrupt_download_errortreated that generic string as file corruption. On a fresh install that misread sent first-boot into a destructive delete + ~25 GB re-download that couldn't fix the real problem, then dead-ended. After: the bare failure is no longer mistaken for corruption — it surfaces as an actionableLemonadeClientErrorand the model cache is left intact. Genuine corruption (five specific signals) still triggers the repair flow.Part of the fresh-install first-boot reliability set (#1293 is the stacked follow-up that fixes the non-interactive auto-heal on top of this accurate classification).
Test plan
tests/unit/test_lemonade_error_classification.py(35 pass): barellama-server failed to start→ not corrupt (incl. the exact{code/type: model_load_error}payload from the user's boot log and an OOM variant); all five specific corruption phrases → corrupt;load_modelon a bare failure raises an actionable error and calls neitherdelete_modelnorpull_model_stream; genuine corruption still enters the repair path.tests/unit/test_lemonade_model_loading.py+test_lemonade_manager_preload.py(18 pass), independently re-run from a clean checkout of this branch.util/lint.py(black / isort / flake8 / pylint -E) clean./tmpon AMD hardware, ran all 35 classifier tests (pass), and spot-checked live:_is_corrupt_download_error("llama-server failed to start")→False;_is_corrupt_download_error("files are incomplete")→True.