Skip to content

Conversation

@tmickleydoyle
Copy link
Collaborator

Created judge consistency tests to ensure stable evaluation behavior when changing judge instructions.

New Files:

  • tests/judgeConsistency.test.ts - 6 tests (3 runs each) for Logic Equivalence& API Signature scores
  • tests/fixtures/judgeConsistencyFixtures.ts - Static diff pairs (perfect, wrong, ambiguous)
  • scripts/show-test-outputs.ts - Debug script to view full judge rationales

Modified Files:

  • scores/logic-equivalence.ts - Exported systemPrompt
  • scores/api-signature.ts - Exported systemPrompt
  • package.json - Added test, test:consistency, show-test-outputs scripts

How It Works:

  • Tests run same diffs 3x, verify judges are consistent
  • Perfect matches must score 1, wrong must score 0, ambiguous must be consistent
  • All tests passing ✅, build/lint clean ✅

@pkg-pr-new
Copy link

pkg-pr-new bot commented Oct 20, 2025

Open in StackBlitz

npm i https://pkg.pr.new/sst/opencode-bench@3

commit: 69e28cd

/**
* Clear Mismatch: Parameter name changed.
* Reference: process_batch(items, validate)
* Candidate: process_batch(data, validate) - parameter name differs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this too strict? parameter name is different but basically everything is the same and the output is the same.

const { reference, candidate } = diffPair;

if (scoreType === "logic-equivalence") {
return `Reference diff:\n${reference}\n\nCandidate diff:\n${candidate}\n\nCompare ONLY the logical behavior (conditions, edge cases, side effects). Ignore code structure and style. Respond with JSON.`;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse what we have in the scores files? to keep things consistent.

@tmickleydoyle tmickleydoyle marked this pull request as ready for review October 21, 2025 14:53
@tmickleydoyle
Copy link
Collaborator Author

@Aslemammad, I think I need a review before I can merge

@Aslemammad Aslemammad force-pushed the tmd/test-diff-scoring branch from 69e28cd to bdfe9d5 Compare October 23, 2025 03:07
@Aslemammad Aslemammad merged commit edea019 into main Oct 23, 2025
7 checks passed
@Aslemammad Aslemammad deleted the tmd/test-diff-scoring branch October 23, 2025 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants