Add deterministic eval suite (+ README/manifest refresh) by drakulavich · Pull Request #4 · drakulavich/iago

drakulavich · 2026-05-30T16:07:03Z

Summary

Two strands of work, bundled:

1. Eval suite (evals/) — a deterministic, offline, keyless benchmark for Iago's diagram pipeline, plus published results.

Scores three axes over a corpus: diagram-type selection, abstention, and heuristic Mermaid validity (selection + abstain modeled as one 5-class problem).
Imports the production classify_diagram_type / heuristic_diagram from action/scripts/run.py — benchmarks shipping code, no reimplementation.
Mermaid validity via a parse-only Node helper (mermaid.parse(), no headless browser).
python evals/run.py regenerates marker-delimited result blocks in BENCHMARKS.md + README.md and writes badge.json / results.json; --check gates CI on thresholds (accuracy ≥ baseline, validity 100%) and published-artifact drift.
Hybrid corpus: 7 hand-authored synthetic cases (one per rubric branch + 3 abstain mechanisms) now; evals/capture.py snapshots real public PRs with provenance as the corpus grows.
New keyless evals job in .github/workflows/test.yml.
Baseline: 7 cases, 100% selection accuracy, 4/4 heuristic Mermaid valid.
BENCHMARKS.md is explicit about what is not measured yet (LLM-output faithfulness / LLM-Mermaid validity — spot-checked only, LLM-judge eval on the roadmap).

Caught a real bug: fix(action) — the heuristic ER generator emitted 003_ORDERS for a migration named 003_orders.sql; Mermaid rejects identifiers starting with a digit. Fixed by stripping numeric migration prefixes; covered by a regression test.

2. Docs/manifest refresh

README restyled in kesha-voice-kit style (centered logo/title, Tests/npm/MIT/Bun + selection-accuracy badges, bold-lead feature bullets, footer flourish).
Synced plugin descriptions across plugin.json + marketplace.json; added greptile/squawk keywords.
Design spec + implementation plan under docs/superpowers/.

Test Plan

python -m pytest evals/tests → 23 passed
python evals/run.py --check → exit 0 (100% accuracy, 4/4 validity)
Drift gate verified: tampering an artifact makes --check fail (exit 1), restoring clears it
All 4 Mermaid diagram types parse headlessly on Node
CI evals job green on this PR

Deterministic, keyless benchmark of diagram-type selection, abstention, and heuristic Mermaid validity over a hybrid synthetic+real PR corpus. Publishes to BENCHMARKS.md + README badge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…k keywords Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Centered logo/title, Tests/npm/MIT/Bun badges, bold-lead feature bullets, and a footer flourish. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mermaid rejects identifiers starting with a digit, so a migration named 003_orders.sql produced an unparseable `003_ORDERS` entity. Strip leading digits and separators before uppercasing. Caught by the new eval suite.

drakulavich · 2026-05-30T18:22:14Z

Closing in favor of a clean skill-first reorientation: dropping the GitHub Action (and the eval suite that tested it), refocusing Iago as a Claude Code / Codex skill. New PR to follow.

drakulavich and others added 16 commits May 30, 2026 09:33

chore(plugin): sync descriptions across manifests, add greptile/squaw…

33fd68f

…k keywords Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(readme): restyle header in kesha-voice-kit style

29c7fb7

Centered logo/title, Tests/npm/MIT/Bun badges, bold-lead feature bullets, and a footer flourish. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: implementation plan for iago eval suite

d56eacd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(evals): bootstrap import of production diagram functions

7369a1a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): corpus loader with schema validation

c911611

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): scoring with confusion matrix and per-class metrics

83dcccb

feat(evals): parse-only mermaid validator with python wrapper

1d11f8a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): reporters for tables, markers, badge, results.json

1af8343

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(evals): synthetic corpus covering each rubric branch + abstain

89d38dc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): gh-backed real-PR capture helper

8b71635

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(evals): driver with --check gate for thresholds and doc drift

960a58b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: add BENCHMARKS.md, README benchmarks section and accuracy badge

cf1c9f3

fix(action): strip numeric migration prefix from ER entity names

e64f82a

Mermaid rejects identifiers starting with a digit, so a migration named 003_orders.sql produced an unparseable `003_ORDERS` entity. Strip leading digits and separators before uppercasing. Caught by the new eval suite.

chore(evals): generate baseline results and set accuracy threshold

8c07af6

ci: run eval harness tests and --check gate

19ad342

drakulavich closed this May 30, 2026

drakulavich mentioned this pull request May 30, 2026

Drop the GitHub Action — refocus Iago as a Claude Code / Codex skill #5

Merged

4 tasks

drakulavich deleted the eval-suite branch May 30, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deterministic eval suite (+ README/manifest refresh)#4

Add deterministic eval suite (+ README/manifest refresh)#4
drakulavich wants to merge 16 commits into
mainfrom
eval-suite

drakulavich commented May 30, 2026

Uh oh!

drakulavich commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drakulavich commented May 30, 2026

Summary

Test Plan

Uh oh!

drakulavich commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant