Context
Surfaced while forensically tracing CI run 24659380207 — the first dual-mode dispatch on the self-hosted runner. The per-test JSON artifacts reported `llmJudge.reason` === `simpleJudge.reason` verbatim for every test, and `summary.json` claimed the LLM judge passed 1 and failed 1 — identical to simpleJudge's counters. It looked like the LLM agreed with simple.
The job's stderr (which the artifact doesn't preserve) told the real story:
```
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
```
The LLM never ran. The artifact had no signal that this was the case.
Reproduction
Deterministic. Point the framework at an unreachable endpoint:
```
bash cicd/scripts/run-tests.sh --suite smoke --id TC-SMOKE-002 \ --judge-url http://127.0.0.1:1 --format json > /tmp/out.json 2> /tmp/err.log
grep -E '\[JUDGE\]|\[WARN\]|\[LLM\]' /tmp/err.log
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
jq '{simple: .simpleJudge.reason, llm: .llmJudge.reason}' \ cicd/results/$(ls -1t cicd/results/ | head -1)/TC-SMOKE-002.json
{
"simple": "All steps passed with exit code 0, patterns matched, no errors",
"llm": "All steps passed with exit code 0, patterns matched, no errors"
}
```
The two reason strings are byte-identical because the fallback literally copies `simpleJudgments`.
Code location
`cicd/tests/src/cli.ts:139-154`:
```ts
let llmJudgments = simpleJudgments.map((j) => ({
...j,
reason: config.noLlm ? 'LLM judge disabled' : j.reason,
}));
if (!config.noLlm) {
process.stderr.write('[JUDGE] Running LLM judge...\n');
const llmJudge = new LLMJudge(config.judgeUrl, config.judgeModel);
const available = await llmJudge.isAvailable();
if (available) {
llmJudgments = await llmJudge.judgeResults(results);
await llmJudge.unloadModel();
} else {
process.stderr.write('[WARN] LLM judge not available, using simple judge results\n');
}
}
```
And `cicd/tests/src/judge/llm-judge.ts:27-36`:
```ts
async isAvailable(): Promise {
try {
const response = await axios.get(`${this.ollamaUrl}/api/tags`, { timeout: 5000 });
return response.status === 200;
} catch {
return false;
}
}
```
Two problems compound each other:
- `isAvailable()` swallows the axios error — we can't tell `ECONNREFUSED` from `ENOTFOUND` from a 5s timeout after the fact. Makes triage blind.
- `cli.ts` writes `simpleJudgments` as `llmJudgments` when unavailable. The downstream JSON reporter writes them indistinguishably from real LLM output.
Why it matters
The whole point of dual-judge mode is that a disagreement between simple and LLM is a high-signal human-triage flag. When LLM silently clones simple, dual mode becomes no-op mode — but reports success. Regressions the LLM is uniquely positioned to catch get masked as passing.
Also: when this is paired with CLAUDE.md's statement that "LLM judge configuration lives in `cicd/tests/.env`", anyone debugging a mysterious lack of LLM coverage has no signal in the artifact to guide them.
Proposed fix
- Log the actual error in `isAvailable()`. Replace `catch {}` with `catch (e) { log with code and message }`. Costs nothing; saves hours.
- When unavailable, write a distinct verdict rather than copying simpleJudgments. Options:
- `llmJudgments[i] = { testId, pass: false, reason: 'LLM judge unavailable: ${errorCode}', evidence: '' }` — fails the test (pessimistic; forces attention).
- Or `pass: simpleJudgment.pass, reason: '[LLM unavailable — falling back to simple judge] ${simple.reason}', evidenceGrounded: false` — carries simple's verdict but makes the fallback visible.
- Annotate `summary.json` with `llm.availability: 'ok' | 'unreachable' | 'disabled'` so downstream consumers can check without parsing reasons.
Pick (2b) + (3) as the least-surprise default: simple judge's verdict still drives the run, but every artifact and the summary make clear the LLM didn't contribute. Pessimistic-fail mode (2a) is defensible but changes the pass/fail contract of the run.
Acceptance criteria
- An artifact from a dual-mode run with an unreachable LLM endpoint is distinguishable from one where the LLM ran and agreed. A string match or jq filter should flag it.
- `summary.json` exposes LLM availability as a first-class field.
- `isAvailable()` writes the error code/message to stderr when it fails, so CI logs show why.
Out of scope
- Making `isAvailable()` more aggressive (retry, warmup call, etc.) — separate decision.
- Changing whether fallback passes or fails the run — this issue is about visibility first; the verdict-policy question is (2a) vs (2b) and can land later.
Related
Context
Surfaced while forensically tracing CI run 24659380207 — the first dual-mode dispatch on the self-hosted runner. The per-test JSON artifacts reported `llmJudge.reason` === `simpleJudge.reason` verbatim for every test, and `summary.json` claimed the LLM judge passed 1 and failed 1 — identical to simpleJudge's counters. It looked like the LLM agreed with simple.
The job's stderr (which the artifact doesn't preserve) told the real story:
```
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
```
The LLM never ran. The artifact had no signal that this was the case.
Reproduction
Deterministic. Point the framework at an unreachable endpoint:
```
bash cicd/scripts/run-tests.sh --suite smoke --id TC-SMOKE-002 \ --judge-url http://127.0.0.1:1 --format json > /tmp/out.json 2> /tmp/err.log
grep -E '\[JUDGE\]|\[WARN\]|\[LLM\]' /tmp/err.log
[JUDGE] Running simple judge...
[JUDGE] Running LLM judge...
[WARN] LLM judge not available, using simple judge results
jq '{simple: .simpleJudge.reason, llm: .llmJudge.reason}' \ cicd/results/$(ls -1t cicd/results/ | head -1)/TC-SMOKE-002.json
{
"simple": "All steps passed with exit code 0, patterns matched, no errors",
"llm": "All steps passed with exit code 0, patterns matched, no errors"
}
```
The two reason strings are byte-identical because the fallback literally copies `simpleJudgments`.
Code location
`cicd/tests/src/cli.ts:139-154`:
```ts
let llmJudgments = simpleJudgments.map((j) => ({
...j,
reason: config.noLlm ? 'LLM judge disabled' : j.reason,
}));
if (!config.noLlm) {
process.stderr.write('[JUDGE] Running LLM judge...\n');
const llmJudge = new LLMJudge(config.judgeUrl, config.judgeModel);
const available = await llmJudge.isAvailable();
if (available) {
llmJudgments = await llmJudge.judgeResults(results);
await llmJudge.unloadModel();
} else {
process.stderr.write('[WARN] LLM judge not available, using simple judge results\n');
}
}
```
And `cicd/tests/src/judge/llm-judge.ts:27-36`:
```ts
async isAvailable(): Promise {
try {
const response = await axios.get(`${this.ollamaUrl}/api/tags`, { timeout: 5000 });
return response.status === 200;
} catch {
return false;
}
}
```
Two problems compound each other:
Why it matters
The whole point of dual-judge mode is that a disagreement between simple and LLM is a high-signal human-triage flag. When LLM silently clones simple, dual mode becomes no-op mode — but reports success. Regressions the LLM is uniquely positioned to catch get masked as passing.
Also: when this is paired with CLAUDE.md's statement that "LLM judge configuration lives in `cicd/tests/.env`", anyone debugging a mysterious lack of LLM coverage has no signal in the artifact to guide them.
Proposed fix
Pick (2b) + (3) as the least-surprise default: simple judge's verdict still drives the run, but every artifact and the summary make clear the LLM didn't contribute. Pessimistic-fail mode (2a) is defensible but changes the pass/fail contract of the run.
Acceptance criteria
Out of scope
Related