Skip to content

[bug] LLM judge silently falls back to simple judge when endpoint unreachable — artifact is indistinguishable from a real LLM verdict #27

@dogkeeper886

Description

@dogkeeper886

Context

Surfaced while forensically tracing CI run 24659380207 — the first dual-mode dispatch on the self-hosted runner. The per-test JSON artifacts reported `llmJudge.reason` === `simpleJudge.reason` verbatim for every test, and `summary.json` claimed the LLM judge passed 1 and failed 1 — identical to simpleJudge's counters. It looked like the LLM agreed with simple.

The job's stderr (which the artifact doesn't preserve) told the real story:

```
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
```

The LLM never ran. The artifact had no signal that this was the case.

Reproduction

Deterministic. Point the framework at an unreachable endpoint:

```
bash cicd/scripts/run-tests.sh --suite smoke --id TC-SMOKE-002 \ --judge-url http://127.0.0.1:1 --format json > /tmp/out.json 2> /tmp/err.log

grep -E '\[JUDGE\]|\[WARN\]|\[LLM\]' /tmp/err.log

[JUDGE] Running simple judge...

[JUDGE] Running LLM judge...

[WARN] LLM judge not available, using simple judge results

jq '{simple: .simpleJudge.reason, llm: .llmJudge.reason}' \ cicd/results/$(ls -1t cicd/results/ | head -1)/TC-SMOKE-002.json

{

"simple": "All steps passed with exit code 0, patterns matched, no errors",

"llm": "All steps passed with exit code 0, patterns matched, no errors"

}

```

The two reason strings are byte-identical because the fallback literally copies `simpleJudgments`.

Code location

`cicd/tests/src/cli.ts:139-154`:

```ts
let llmJudgments = simpleJudgments.map((j) => ({
...j,
reason: config.noLlm ? 'LLM judge disabled' : j.reason,
}));

if (!config.noLlm) {
process.stderr.write('[JUDGE] Running LLM judge...\n');
const llmJudge = new LLMJudge(config.judgeUrl, config.judgeModel);

const available = await llmJudge.isAvailable();
if (available) {
llmJudgments = await llmJudge.judgeResults(results);
await llmJudge.unloadModel();
} else {
process.stderr.write('[WARN] LLM judge not available, using simple judge results\n');
}
}
```

And `cicd/tests/src/judge/llm-judge.ts:27-36`:

```ts
async isAvailable(): Promise {
try {
const response = await axios.get(`${this.ollamaUrl}/api/tags`, { timeout: 5000 });
return response.status === 200;
} catch {
return false;
}
}
```

Two problems compound each other:

  • `isAvailable()` swallows the axios error — we can't tell `ECONNREFUSED` from `ENOTFOUND` from a 5s timeout after the fact. Makes triage blind.
  • `cli.ts` writes `simpleJudgments` as `llmJudgments` when unavailable. The downstream JSON reporter writes them indistinguishably from real LLM output.

Why it matters

The whole point of dual-judge mode is that a disagreement between simple and LLM is a high-signal human-triage flag. When LLM silently clones simple, dual mode becomes no-op mode — but reports success. Regressions the LLM is uniquely positioned to catch get masked as passing.

Also: when this is paired with CLAUDE.md's statement that "LLM judge configuration lives in `cicd/tests/.env`", anyone debugging a mysterious lack of LLM coverage has no signal in the artifact to guide them.

Proposed fix

  1. Log the actual error in `isAvailable()`. Replace `catch {}` with `catch (e) { log with code and message }`. Costs nothing; saves hours.
  2. When unavailable, write a distinct verdict rather than copying simpleJudgments. Options:
    • `llmJudgments[i] = { testId, pass: false, reason: 'LLM judge unavailable: ${errorCode}', evidence: '' }` — fails the test (pessimistic; forces attention).
    • Or `pass: simpleJudgment.pass, reason: '[LLM unavailable — falling back to simple judge] ${simple.reason}', evidenceGrounded: false` — carries simple's verdict but makes the fallback visible.
  3. Annotate `summary.json` with `llm.availability: 'ok' | 'unreachable' | 'disabled'` so downstream consumers can check without parsing reasons.

Pick (2b) + (3) as the least-surprise default: simple judge's verdict still drives the run, but every artifact and the summary make clear the LLM didn't contribute. Pessimistic-fail mode (2a) is defensible but changes the pass/fail contract of the run.

Acceptance criteria

  • An artifact from a dual-mode run with an unreachable LLM endpoint is distinguishable from one where the LLM ran and agreed. A string match or jq filter should flag it.
  • `summary.json` exposes LLM availability as a first-class field.
  • `isAvailable()` writes the error code/message to stderr when it fails, so CI logs show why.

Out of scope

  • Making `isAvailable()` more aggressive (retry, warmup call, etc.) — separate decision.
  • Changing whether fallback passes or fails the run — this issue is about visibility first; the verdict-policy question is (2a) vs (2b) and can land later.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions