Skip to content

Lemonade: _is_corrupt_download_error misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads #1294

@itomek

Description

@itomek

Summary

LemonadeClient._is_corrupt_download_error treats the generic error string "llama-server failed to start" as evidence of a corrupt/incomplete model download. That string is raised by Lemonade for many non-corruption failures (resource limits, ctx_size issues, GPU/backend startup problems, port conflicts). Misclassifying them routes ordinary load failures into the delete-and-redownload repair path — wasting a full multi-GB re-download (the default model is ~25 GB) and (combined with the interactive-prompt defect, sibling issue #1293) dead-ending first-boot.

Impact

Root cause analysis

src/gaia/llm/lemonade_client.py:1238-1248:

return any(
    phrase in error_message
    for phrase in [
        "download validation failed",
        "files are incomplete",
        "files are missing",
        "incomplete or missing",
        "corrupted download",
        "llama-server failed to start",  # Often indicates corrupt model files
    ]
)

The first five phrases are specific corruption signals. "llama-server failed to start" is a generic startup failure — the comment ("Often indicates corrupt model files") concedes it is a heuristic, not a reliable signal. The user's case is the counter-example: a force delete+redownload at ctx_size=32768 later succeeded, where the boot preload at ctx_size=0 failed with this string — pointing at a load/ctx/resource cause, not file corruption.

Proposed direction

  • Remove "llama-server failed to start" from the corruption phrase list, or only treat it as corruption when corroborated by a specific signal (e.g. a follow-up file-integrity check, a missing/short shard, or an explicit Lemonade corruption code).
  • Classify a bare "llama-server failed to start" as a load failure that surfaces an actionable error (per the repo's "fail loudly" rule) instead of silently re-downloading.
  • Prefer a structured signal from Lemonade (error code/type) over substring matching where available.

Acceptance criteria

Test plan (TDD)

Unit (tests/unit/test_lemonade_error_classification.py — file already exists):

  • Parametrized cases: each of the five specific phrases → True; bare "llama-server failed to start"False; "llama-server failed to start" plus a corruption phrase → True.
  • load_model with a mocked bare llama-server failed to start response does not call delete_model / pull_model_stream, and raises an actionable error.
  • load_model with a mocked specific corruption error does enter the repair path.

Integration (@pytest.mark.integration, require_lemonade):

  • Induce a non-corruption load failure against a real Lemonade Server (e.g. an impossible ctx_size / resource condition) and assert no model deletion/re-download occurs and the error is actionable.
  • Induce a genuinely truncated model and assert the repair path engages.

Real-world (manual, AMD Linux hardware; tear down afterward):

  • On an AMD Ryzen AI integrated-GPU box (Strix-class) and a discrete Radeon GPU desktop (RX-class): trigger a load failure that is not corruption (e.g. oversized ctx for available memory) and confirm the model on disk is not deleted/re-downloaded and the surfaced message is actionable.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions