Skip to content

chore(eval): expand eval suite to 52 tasks, add multi-model results#349

Merged
chaliy merged 6 commits intomainfrom
claude/analyze-bashkit-evals-aU9Lh
Feb 27, 2026
Merged

chore(eval): expand eval suite to 52 tasks, add multi-model results#349
chaliy merged 6 commits intomainfrom
claude/analyze-bashkit-evals-aU9Lh

Conversation

@chaliy
Copy link
Contributor

@chaliy chaliy commented Feb 27, 2026

Summary

† Partial runs due to Anthropic API credit exhaustion during parallel execution.

Results

Model Tasks Passed Score
Opus 4.6 23† 23/23 100%
Sonnet 4.6 26† 23/26 94%
Haiku 4.5 52 43/52 92%
GPT-5.2 52 32/52 79%

New interpreter bugs surfaced

Test plan

  • Eval dataset loads and runs (verified against all 4 models)
  • Relaxed expectations don't break existing passing tasks
  • cargo build -p bashkit-eval --release succeeds

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

…ries)

Gap analysis of agentic bash usage patterns (SWE-bench, Terminal-Bench,
Claude Code, mini-swe-agent) identified major coverage gaps in the eval
suite. Added tasks across 3 new categories and enhanced 5 existing ones:

New categories:
- code_search: recursive grep, find+replace across codebases
- variable_expansion: ${var:-default}, ${var##pat}, ${var//old/new}
- environment: source, export, env propagation

Enhanced categories:
- scripting: trap EXIT cleanup, getopts parsing, declare -A assoc arrays
- pipelines: process substitution diff <(), xargs batch processing
- text_processing: heredoc config generation, comm set operations
- complex_tasks: test output parsing, script debugging
- data_transformation: [[ =~ ]] BASH_REMATCH regex extraction

These cover the top bash patterns used by AI coding agents that had zero
eval representation: code search (grep -rn), parameter expansion, heredocs,
trap/error handling, process substitution, xargs, and associative arrays.

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
…ries)

Gap analysis of agentic bash usage patterns (SWE-bench, Terminal-Bench,
Claude Code, mini-swe-agent) identified major coverage gaps in the eval
suite. Added tasks across 2 new categories and enhanced 5 existing ones:

New categories:
- code_search: recursive grep, find+replace across codebases
- environment: source/export, env defaults, config propagation

Enhanced categories:
- scripting: trap EXIT cleanup, getopts parsing, declare -A assoc arrays
- pipelines: process substitution diff <(), xargs batch processing
- text_processing: heredoc config generation, comm set operations
- complex_tasks: test output parsing, script debugging
- data_transformation: regex extraction from structured logs
- file_operations: organize files by extension

All tasks model realistic agentic workflows (code search, config setup,
test parsing, deployment scripts) rather than synthetic feature tests.
These cover the top bash patterns used by AI coding agents that had zero
eval representation.

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
43/52 passed (92%), 93% tool call success rate, 7.3 min total.
New tasks surfaced bashkit bugs: heredoc redirect to file fails,
mv into newly-created directories unreliable. Several near-misses
from output format sensitivity in expectations.

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
Use stdout_regex instead of stdout_contains for 5 tasks where exact
output format varies by model but the underlying task is done correctly:
- search_find_replace: flexible "Files modified: 3" matching
- script_assoc_array: flexible visitor/country output matching
- pipe_xargs_batch: flexible line count/error count matching
- script_getopts_parser: flexible verbose message matching
- data_regex_extract: removed redundant checks, flexible summary matching

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
…aset

Run evals against three models on expanded 52-task dataset:
- Opus 4.6: 23/23 (100%) — partial run, API credits exhausted
- Sonnet 4.6: 23/26 (94%) — partial run, first eval for this model
- GPT-5.2: 32/52 (79%) — full run

Update README with new results section including per-category
comparison, failure analysis, and new interpreter bugs surfaced.

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
Keep both our new tasks (code_search, environment, etc.) and main's
new tasks (database_operations, config_management, build_simulation).
Dataset now has 58 tasks across 15 categories.

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz
@chaliy chaliy merged commit 015876e into main Feb 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants