feat: add tests for judge consistency #3

tmickleydoyle · 2025-10-20T14:09:36Z

Created judge consistency tests to ensure stable evaluation behavior when changing judge instructions.

New Files:

tests/judgeConsistency.test.ts - 6 tests (3 runs each) for Logic Equivalence& API Signature scores
tests/fixtures/judgeConsistencyFixtures.ts - Static diff pairs (perfect, wrong, ambiguous)
scripts/show-test-outputs.ts - Debug script to view full judge rationales

Modified Files:

scores/logic-equivalence.ts - Exported systemPrompt
scores/api-signature.ts - Exported systemPrompt
package.json - Added test, test:consistency, show-test-outputs scripts

How It Works:

Tests run same diffs 3x, verify judges are consistent
Perfect matches must score 1, wrong must score 0, ambiguous must be consistent
All tests passing ✅, build/lint clean ✅

pkg-pr-new · 2025-10-20T14:09:57Z

Open in StackBlitz

npm i https://pkg.pr.new/sst/opencode-bench@3

commit: 69e28cd

Aslemammad · 2025-10-21T03:20:24Z

tests/fixtures/judgeConsistencyFixtures.ts

+  /**
+   * Clear Mismatch: Parameter name changed.
+   * Reference: process_batch(items, validate)
+   * Candidate: process_batch(data, validate) - parameter name differs


isn't this too strict? parameter name is different but basically everything is the same and the output is the same.

Aslemammad · 2025-10-21T03:28:44Z

tests/judgeConsistency.test.ts

+  const { reference, candidate } = diffPair;
+
+  if (scoreType === "logic-equivalence") {
+    return `Reference diff:\n${reference}\n\nCandidate diff:\n${candidate}\n\nCompare ONLY the logical behavior (conditions, edge cases, side effects). Ignore code structure and style. Respond with JSON.`;


can we reuse what we have in the scores files? to keep things consistent.

tmickleydoyle · 2025-10-23T00:19:35Z

@Aslemammad, I think I need a review before I can merge

tmickleydoyle had a problem deploying to production October 20, 2025 14:10 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 20, 2025 14:10 — with GitHub Actions Inactive

tmickleydoyle had a problem deploying to production October 20, 2025 14:10 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 20, 2025 14:10 — with GitHub Actions Inactive

tmickleydoyle had a problem deploying to production October 20, 2025 14:10 — with GitHub Actions Failure

Aslemammad requested changes Oct 21, 2025

View reviewed changes

tmickleydoyle marked this pull request as ready for review October 21, 2025 14:53

tmickleydoyle had a problem deploying to production October 22, 2025 18:56 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 22, 2025 18:56 — with GitHub Actions Inactive

tmickleydoyle had a problem deploying to production October 22, 2025 18:56 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 22, 2025 19:26 — with GitHub Actions Inactive

tmickleydoyle temporarily deployed to production October 22, 2025 19:39 — with GitHub Actions Inactive

tmickleydoyle temporarily deployed to production October 22, 2025 19:40 — with GitHub Actions Inactive

tmickleydoyle requested a review from Aslemammad October 22, 2025 20:03

tmickleydoyle had a problem deploying to production October 22, 2025 20:12 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 22, 2025 20:12 — with GitHub Actions Inactive

tmickleydoyle had a problem deploying to production October 22, 2025 20:12 — with GitHub Actions Error

tmickleydoyle had a problem deploying to production October 22, 2025 20:21 — with GitHub Actions Failure

tmickleydoyle temporarily deployed to production October 22, 2025 23:54 — with GitHub Actions Inactive

tmickleydoyle temporarily deployed to production October 23, 2025 00:23 — with GitHub Actions Inactive

tmickleydoyle temporarily deployed to production October 23, 2025 00:24 — with GitHub Actions Inactive

tmickleydoyle temporarily deployed to production October 23, 2025 00:55 — with GitHub Actions Inactive

Aslemammad approved these changes Oct 23, 2025

View reviewed changes

tmickleydoyle added 6 commits October 23, 2025 06:35

feat: add tests for judge consistency

0c886ce

feat: add tests for judge consistency

586217f

fix: use same prompts in tests

928f644

test: add judge consistency tests

ae84009

fix: add github token to actions

358d6f0

fix: success metric for agent test

bdfe9d5

Aslemammad force-pushed the tmd/test-diff-scoring branch from 69e28cd to bdfe9d5 Compare October 23, 2025 03:07

Aslemammad temporarily deployed to production October 23, 2025 03:07 — with GitHub Actions Inactive

Aslemammad merged commit edea019 into main Oct 23, 2025
7 checks passed

Aslemammad deleted the tmd/test-diff-scoring branch October 23, 2025 03:07

Aslemammad temporarily deployed to production October 23, 2025 03:07 — with GitHub Actions Inactive

Aslemammad temporarily deployed to production October 23, 2025 03:35 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add tests for judge consistency #3

feat: add tests for judge consistency #3

Uh oh!

tmickleydoyle commented Oct 20, 2025

Uh oh!

pkg-pr-new bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Aslemammad Oct 21, 2025

Uh oh!

Aslemammad Oct 21, 2025

Uh oh!

tmickleydoyle commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add tests for judge consistency #3

feat: add tests for judge consistency #3

Uh oh!

Conversation

tmickleydoyle commented Oct 20, 2025

Uh oh!

pkg-pr-new bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aslemammad Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Aslemammad Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tmickleydoyle commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkg-pr-new bot commented Oct 20, 2025 •

edited

Loading