Skip to content

chore(eval): 2026-02-28 eval run across 5 models with v0.1.7 analysis#402

Merged
chaliy merged 3 commits intomainfrom
claude/run-evals-update-readme-XlkTR
Mar 1, 2026
Merged

chore(eval): 2026-02-28 eval run across 5 models with v0.1.7 analysis#402
chaliy merged 3 commits intomainfrom
claude/run-evals-update-readme-XlkTR

Conversation

@chaliy
Copy link
Contributor

@chaliy chaliy commented Mar 1, 2026

Summary

Key findings

Test plan

  • Eval results are valid JSON
  • README updates reflect actual scores
  • No code changes — eval data and docs only

…h v0.1.7 delta

Run full 58-task eval against Haiku 4.5, Sonnet 4.6, Opus 4.6, GPT-5.2,
and GPT-5.3-Codex. Dataset expanded from 52 to 58 tasks with 3 new
categories (database_operations, config_management, build_simulation).

Results (58 tasks):
- Haiku 4.5: 54/58 (97%) — best overall
- GPT-5.3-Codex: 51/58 (91%)
- Opus 4.6: 50/58 (91%)
- Sonnet 4.6: 48/58 (93%)
- GPT-5.2: 41/58 (77%)

Delta on shared 37 tasks from v0.1.7:
- Haiku: 98% → 100% (perfect score)
- Opus: 93% → 96%
- GPT-5.2: 86% → 86% (+3 tasks)

Interpreter fixes unblocked json_to_csv_export and script_function_lib.

https://claude.ai/code/session_01PoDSCiVJhgGiH2QwfRskwe
First Sonnet 4.6 run (205324) that triggered awk Unicode boundary panic.
Kept for debugging reference — the successful re-run (211120) is the
canonical result.

https://claude.ai/code/session_01PoDSCiVJhgGiH2QwfRskwe
Remove partial results from the first Sonnet 4.6 run that panicked due
to awk Unicode char boundary bug (#395). The successful re-run (211120)
is the canonical result.

https://claude.ai/code/session_01PoDSCiVJhgGiH2QwfRskwe
@chaliy chaliy merged commit fef0340 into main Mar 1, 2026
10 checks passed
@chaliy chaliy deleted the claude/run-evals-update-readme-XlkTR branch March 1, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants