Commit 9cd8765
committed
fix(judge): tighten creation-review rubric to eliminate false rejects
Production data made the false-reject patterns concrete: in a single
demo session the judge rejected three forge attempts with hedge
language ("cannot confidently verify") or speculation about test
coverage ("provided evaluation does not verify edge cases"), while
only one closely-related forge passed. None of the rejections cited
an actual safety, correctness, determinism, or boundedness violation.
Rubric edits (in buildCreationPromptParts):
1. Frame each criterion as binary pass/fail with a NAMED offending
construct on fail. PASS unless you can name the violation.
2. Add explicit APPROVAL RULES (hard) section:
- All four criteria pass → approved=true with confidence in [0.7, 1.0]
- "Cannot confidently verify" is NOT a violation; cannot-verify
means approve.
- Don't reject because you wish there were different/more tests.
- Don't reject for stylistic preferences (try/catch, naming, length).
- Discrepancy between author's expectedOutput and the code's actual
output is the AUTHOR's problem — the code is the source of truth
as long as it conforms to the schema.
The author-expectation rule is the biggest single change. The
production "FAIL" reasoning frequently complained that "Test 1 output
does not match the declared expected results" — but in a forge, the
LLM author wrote both the code AND the expectedOutputs, and the code
runs in a deterministic sandbox. Inconsistency between the two is
proof of a hallucinated expectation, not buggy code. Reject the
expectation, not the tool.
39 existing tests still pass. New 40th test pins the explicit rule
guidance ("cannot confidently verify", "the code is the source of
truth", "try/catch") so future edits can't accidentally remove the
hard approval rules without flagging the regression.1 parent e0d15ee commit 9cd8765
2 files changed
Lines changed: 34 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
442 | 442 | | |
443 | 443 | | |
444 | 444 | | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
450 | 459 | | |
451 | 460 | | |
452 | 461 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
233 | 233 | | |
234 | 234 | | |
235 | 235 | | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
236 | 256 | | |
237 | 257 | | |
238 | 258 | | |
| |||
0 commit comments