Lemonade: `_is_corrupt_download_error` misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads

## Summary

`LemonadeClient._is_corrupt_download_error` treats the **generic** error string `"llama-server failed to start"` as evidence of a corrupt/incomplete model download. That string is raised by Lemonade for *many* non-corruption failures (resource limits, `ctx_size` issues, GPU/backend startup problems, port conflicts). Misclassifying them routes ordinary load failures into the **delete-and-redownload** repair path — wasting a full multi-GB re-download (the default model is ~25 GB) and (combined with the interactive-prompt defect, sibling issue #1293) dead-ending first-boot.

## Impact

- A transient or environmental load failure can trigger a **silent ~25 GB re-download** of a model that was never corrupt.
- The recovery "resume" completes in seconds (nothing actually re-downloaded) and the reload fails again with the same error — the user observed exactly this (~37 s between "corrupted files" and "Resume failed" in the boot log of #1293, far too short to re-fetch GBs).
- Makes #1293's auto-heal unsafe: auto-recovery is only correct if "corrupt" actually means corrupt.

## Root cause analysis

`src/gaia/llm/lemonade_client.py:1238`-`1248`:

```python
return any(
    phrase in error_message
    for phrase in [
        "download validation failed",
        "files are incomplete",
        "files are missing",
        "incomplete or missing",
        "corrupted download",
        "llama-server failed to start",  # Often indicates corrupt model files
    ]
)
```

The first five phrases are specific corruption signals. `"llama-server failed to start"` is a **generic startup failure** — the comment ("*Often* indicates corrupt model files") concedes it is a heuristic, not a reliable signal. The user's case is the counter-example: a force delete+redownload at `ctx_size=32768` later succeeded, where the boot preload at `ctx_size=0` failed with this string — pointing at a load/ctx/resource cause, not file corruption.

## Proposed direction

- [ ] Remove `"llama-server failed to start"` from the corruption phrase list, **or** only treat it as corruption when corroborated by a specific signal (e.g. a follow-up file-integrity check, a missing/short shard, or an explicit Lemonade corruption code).
- [ ] Classify a bare `"llama-server failed to start"` as a **load failure** that surfaces an actionable error (per the repo's "fail loudly" rule) instead of silently re-downloading.
- [ ] Prefer a structured signal from Lemonade (error `code`/`type`) over substring matching where available.

## Acceptance criteria

- [ ] `_is_corrupt_download_error("...llama-server failed to start...")` returns `False` unless a specific corruption signal is also present.
- [ ] The five existing specific corruption phrases continue to return `True` (no regression).
- [ ] A bare `llama-server failed to start` load failure raises an actionable `LemonadeClientError` (what failed / what to do / where to look) and does **not** enter the delete+redownload path.
- [ ] When corruption *is* correctly detected, the existing repair flow still runs (gated by #1293's non-interactive policy).

## Test plan (TDD)

**Unit** (`tests/unit/test_lemonade_error_classification.py` — file already exists):
- [ ] Parametrized cases: each of the five specific phrases → `True`; bare `"llama-server failed to start"` → `False`; `"llama-server failed to start"` **plus** a corruption phrase → `True`.
- [ ] `load_model` with a mocked bare `llama-server failed to start` response does **not** call `delete_model` / `pull_model_stream`, and raises an actionable error.
- [ ] `load_model` with a mocked specific corruption error **does** enter the repair path.

**Integration** (`@pytest.mark.integration`, `require_lemonade`):
- [ ] Induce a non-corruption load failure against a real Lemonade Server (e.g. an impossible `ctx_size` / resource condition) and assert no model deletion/re-download occurs and the error is actionable.
- [ ] Induce a genuinely truncated model and assert the repair path engages.

**Real-world** (manual, AMD Linux hardware; tear down afterward):
- [ ] On an **AMD Ryzen AI integrated-GPU box (Strix-class)** and a **discrete Radeon GPU desktop (RX-class)**: trigger a load failure that is *not* corruption (e.g. oversized ctx for available memory) and confirm the model on disk is **not** deleted/re-downloaded and the surfaced message is actionable.

## Related

- #1293 — interactive-prompt boot crash (the other half of the fresh-install first-boot failure). Safe auto-heal there depends on accurate classification here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemonade: `_is_corrupt_download_error` misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads #1294

Summary

Impact

Root cause analysis

Proposed direction

Acceptance criteria

Test plan (TDD)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lemonade: _is_corrupt_download_error misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads #1294

Description

Summary

Impact

Root cause analysis

Proposed direction

Acceptance criteria

Test plan (TDD)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Lemonade: `_is_corrupt_download_error` misclassifies generic "llama-server failed to start" as corruption → wrong recovery path + wasteful re-downloads #1294