[bug] LLM judge silently falls back to simple judge when endpoint unreachable — artifact is indistinguishable from a real LLM verdict

## Context

Surfaced while forensically tracing CI run [24659380207](https://github.com/dogkeeper886/testlink-code/actions/runs/24659380207) — the first dual-mode dispatch on the self-hosted runner. The per-test JSON artifacts reported \`llmJudge.reason\` === \`simpleJudge.reason\` verbatim for every test, and \`summary.json\` claimed the LLM judge passed 1 and failed 1 — identical to simpleJudge's counters. It looked like the LLM agreed with simple.

The job's stderr (which the artifact doesn't preserve) told the real story:

\`\`\`
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
\`\`\`

The LLM never ran. The artifact had no signal that this was the case.

## Reproduction

Deterministic. Point the framework at an unreachable endpoint:

\`\`\`
bash cicd/scripts/run-tests.sh --suite smoke --id TC-SMOKE-002 \  --judge-url http://127.0.0.1:1 --format json > /tmp/out.json 2> /tmp/err.log

grep -E '\\[JUDGE\\]|\\[WARN\\]|\\[LLM\\]' /tmp/err.log
# [JUDGE] Running simple judge...
# [JUDGE] Running LLM judge...
# [WARN] LLM judge not available, using simple judge results

jq '{simple: .simpleJudge.reason, llm: .llmJudge.reason}' \  cicd/results/\$(ls -1t cicd/results/ | head -1)/TC-SMOKE-002.json
# {
#   "simple": "All steps passed with exit code 0, patterns matched, no errors",
#   "llm":    "All steps passed with exit code 0, patterns matched, no errors"
# }
\`\`\`

The two reason strings are byte-identical because the fallback literally copies \`simpleJudgments\`.

## Code location

\`cicd/tests/src/cli.ts:139-154\`:

\`\`\`ts
let llmJudgments = simpleJudgments.map((j) => ({
  ...j,
  reason: config.noLlm ? 'LLM judge disabled' : j.reason,
}));

if (!config.noLlm) {
  process.stderr.write('[JUDGE] Running LLM judge...\\n');
  const llmJudge = new LLMJudge(config.judgeUrl, config.judgeModel);

  const available = await llmJudge.isAvailable();
  if (available) {
    llmJudgments = await llmJudge.judgeResults(results);
    await llmJudge.unloadModel();
  } else {
    process.stderr.write('[WARN] LLM judge not available, using simple judge results\\n');
  }
}
\`\`\`

And \`cicd/tests/src/judge/llm-judge.ts:27-36\`:

\`\`\`ts
async isAvailable(): Promise<boolean> {
  try {
    const response = await axios.get(\`\${this.ollamaUrl}/api/tags\`, { timeout: 5000 });
    return response.status === 200;
  } catch {
    return false;
  }
}
\`\`\`

Two problems compound each other:
- \`isAvailable()\` **swallows the axios error** — we can't tell \`ECONNREFUSED\` from \`ENOTFOUND\` from a 5s timeout after the fact. Makes triage blind.
- \`cli.ts\` writes \`simpleJudgments\` as \`llmJudgments\` when unavailable. The downstream JSON reporter writes them indistinguishably from real LLM output.

## Why it matters

The whole point of dual-judge mode is that a disagreement between simple and LLM is a high-signal human-triage flag. When LLM silently clones simple, dual mode becomes no-op mode — but reports success. Regressions the LLM is uniquely positioned to catch get masked as passing.

Also: when this is paired with CLAUDE.md's statement that \"LLM judge configuration lives in \`cicd/tests/.env\`\", anyone debugging a mysterious lack of LLM coverage has no signal in the artifact to guide them.

## Proposed fix

1. **Log the actual error in \`isAvailable()\`.** Replace \`catch {}\` with \`catch (e) { log with code and message }\`. Costs nothing; saves hours.
2. **When unavailable, write a distinct verdict** rather than copying simpleJudgments. Options:
   - \`llmJudgments[i] = { testId, pass: false, reason: 'LLM judge unavailable: \${errorCode}', evidence: '' }\` — fails the test (pessimistic; forces attention).
   - Or \`pass: simpleJudgment.pass, reason: '[LLM unavailable — falling back to simple judge] \${simple.reason}', evidenceGrounded: false\` — carries simple's verdict but makes the fallback visible.
3. **Annotate \`summary.json\`** with \`llm.availability: 'ok' | 'unreachable' | 'disabled'\` so downstream consumers can check without parsing reasons.

Pick (2b) + (3) as the least-surprise default: simple judge's verdict still drives the run, but every artifact and the summary make clear the LLM didn't contribute. Pessimistic-fail mode (2a) is defensible but changes the pass/fail contract of the run.

## Acceptance criteria

- An artifact from a dual-mode run with an unreachable LLM endpoint is distinguishable from one where the LLM ran and agreed. A string match or jq filter should flag it.
- \`summary.json\` exposes LLM availability as a first-class field.
- \`isAvailable()\` writes the error code/message to stderr when it fails, so CI logs show why.

## Out of scope

- Making \`isAvailable()\` more aggressive (retry, warmup call, etc.) — separate decision.
- Changing whether fallback passes or fails the run — this issue is about visibility first; the verdict-policy question is (2a) vs (2b) and can land later.

## Related

- #22 / PR #24 — self-hosted runner wiring that made dispatching dual mode possible
- #25 — stdout corruption bug found in the same trace session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] LLM judge silently falls back to simple judge when endpoint unreachable — artifact is indistinguishable from a real LLM verdict #27

Context

Reproduction

[JUDGE] Running simple judge...

[JUDGE] Running LLM judge...

[WARN] LLM judge not available, using simple judge results

{

"simple": "All steps passed with exit code 0, patterns matched, no errors",

"llm": "All steps passed with exit code 0, patterns matched, no errors"

}

Code location

Why it matters

Proposed fix

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[bug] LLM judge silently falls back to simple judge when endpoint unreachable — artifact is indistinguishable from a real LLM verdict #27

Description

Context

Reproduction

[JUDGE] Running simple judge...

[JUDGE] Running LLM judge...

[WARN] LLM judge not available, using simple judge results

{

"simple": "All steps passed with exit code 0, patterns matched, no errors",

"llm": "All steps passed with exit code 0, patterns matched, no errors"

}

Code location

Why it matters

Proposed fix

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions