Skip to content

Add deterministic eval suite (+ README/manifest refresh)#4

Closed
drakulavich wants to merge 16 commits into
mainfrom
eval-suite
Closed

Add deterministic eval suite (+ README/manifest refresh)#4
drakulavich wants to merge 16 commits into
mainfrom
eval-suite

Conversation

@drakulavich
Copy link
Copy Markdown
Owner

Summary

Two strands of work, bundled:

1. Eval suite (evals/) — a deterministic, offline, keyless benchmark for Iago's diagram pipeline, plus published results.

  • Scores three axes over a corpus: diagram-type selection, abstention, and heuristic Mermaid validity (selection + abstain modeled as one 5-class problem).
  • Imports the production classify_diagram_type / heuristic_diagram from action/scripts/run.py — benchmarks shipping code, no reimplementation.
  • Mermaid validity via a parse-only Node helper (mermaid.parse(), no headless browser).
  • python evals/run.py regenerates marker-delimited result blocks in BENCHMARKS.md + README.md and writes badge.json / results.json; --check gates CI on thresholds (accuracy ≥ baseline, validity 100%) and published-artifact drift.
  • Hybrid corpus: 7 hand-authored synthetic cases (one per rubric branch + 3 abstain mechanisms) now; evals/capture.py snapshots real public PRs with provenance as the corpus grows.
  • New keyless evals job in .github/workflows/test.yml.
  • Baseline: 7 cases, 100% selection accuracy, 4/4 heuristic Mermaid valid.
  • BENCHMARKS.md is explicit about what is not measured yet (LLM-output faithfulness / LLM-Mermaid validity — spot-checked only, LLM-judge eval on the roadmap).

Caught a real bug: fix(action) — the heuristic ER generator emitted 003_ORDERS for a migration named 003_orders.sql; Mermaid rejects identifiers starting with a digit. Fixed by stripping numeric migration prefixes; covered by a regression test.

2. Docs/manifest refresh

  • README restyled in kesha-voice-kit style (centered logo/title, Tests/npm/MIT/Bun + selection-accuracy badges, bold-lead feature bullets, footer flourish).
  • Synced plugin descriptions across plugin.json + marketplace.json; added greptile/squawk keywords.
  • Design spec + implementation plan under docs/superpowers/.

Test Plan

  • python -m pytest evals/tests → 23 passed
  • python evals/run.py --check → exit 0 (100% accuracy, 4/4 validity)
  • Drift gate verified: tampering an artifact makes --check fail (exit 1), restoring clears it
  • All 4 Mermaid diagram types parse headlessly on Node
  • CI evals job green on this PR

drakulavich and others added 16 commits May 30, 2026 09:33
Deterministic, keyless benchmark of diagram-type selection, abstention,
and heuristic Mermaid validity over a hybrid synthetic+real PR corpus.
Publishes to BENCHMARKS.md + README badge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k keywords

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Centered logo/title, Tests/npm/MIT/Bun badges, bold-lead feature
bullets, and a footer flourish.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mermaid rejects identifiers starting with a digit, so a migration named
003_orders.sql produced an unparseable `003_ORDERS` entity. Strip leading
digits and separators before uppercasing. Caught by the new eval suite.
@drakulavich
Copy link
Copy Markdown
Owner Author

Closing in favor of a clean skill-first reorientation: dropping the GitHub Action (and the eval suite that tested it), refocusing Iago as a Claude Code / Codex skill. New PR to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant