chore(eval): expand eval suite to 52 tasks, add multi-model results by chaliy · Pull Request #349 · everruns/bashkit

chaliy · 2026-02-27T04:43:07Z

Summary

Expand eval dataset from 37 to 52 tasks with 2 new categories (code_search, environment) and new tasks in existing categories (heredoc, getopts, associative arrays, process substitution, xargs, comm, trap)
Relax format-sensitive expectations to use stdout_regex — focus on correctness, not exact output format
Run evals against Haiku 4.5 (52 tasks), Sonnet 4.6 (26†), Opus 4.6 (23†), GPT-5.2 (52 tasks)
Update README with results, per-category comparison, and failure analysis
File 3 new interpreter bug issues (bug: heredoc redirect to file does not write content (cat <<EOF > file) #345, bug: xargs does not execute commands, just echoes them #346, bug: word splitting on $() command substitution broken in for-loops #347) surfaced by the new tasks

† Partial runs due to Anthropic API credit exhaustion during parallel execution.

Results

Model	Tasks	Passed	Score
Opus 4.6	23†	23/23	100%
Sonnet 4.6	26†	23/26	94%
Haiku 4.5	52	43/52	92%
GPT-5.2	52	32/52	79%

New interpreter bugs surfaced

bug: heredoc redirect to file does not write content (cat <<EOF > file) #345 — heredoc redirect to file (cat <<EOF > file)
bug: xargs does not execute commands, just echoes them #346 — xargs doesn't execute commands
bug: word splitting on $() command substitution broken in for-loops #347 — word splitting on $() in for-loops

Test plan

Eval dataset loads and runs (verified against all 4 models)
Relaxed expectations don't break existing passing tasks
cargo build -p bashkit-eval --release succeeds

https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

…ries) Gap analysis of agentic bash usage patterns (SWE-bench, Terminal-Bench, Claude Code, mini-swe-agent) identified major coverage gaps in the eval suite. Added tasks across 3 new categories and enhanced 5 existing ones: New categories: - code_search: recursive grep, find+replace across codebases - variable_expansion: ${var:-default}, ${var##pat}, ${var//old/new} - environment: source, export, env propagation Enhanced categories: - scripting: trap EXIT cleanup, getopts parsing, declare -A assoc arrays - pipelines: process substitution diff <(), xargs batch processing - text_processing: heredoc config generation, comm set operations - complex_tasks: test output parsing, script debugging - data_transformation: [[ =~ ]] BASH_REMATCH regex extraction These cover the top bash patterns used by AI coding agents that had zero eval representation: code search (grep -rn), parameter expansion, heredocs, trap/error handling, process substitution, xargs, and associative arrays. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

…ries) Gap analysis of agentic bash usage patterns (SWE-bench, Terminal-Bench, Claude Code, mini-swe-agent) identified major coverage gaps in the eval suite. Added tasks across 2 new categories and enhanced 5 existing ones: New categories: - code_search: recursive grep, find+replace across codebases - environment: source/export, env defaults, config propagation Enhanced categories: - scripting: trap EXIT cleanup, getopts parsing, declare -A assoc arrays - pipelines: process substitution diff <(), xargs batch processing - text_processing: heredoc config generation, comm set operations - complex_tasks: test output parsing, script debugging - data_transformation: regex extraction from structured logs - file_operations: organize files by extension All tasks model realistic agentic workflows (code search, config setup, test parsing, deployment scripts) rather than synthetic feature tests. These cover the top bash patterns used by AI coding agents that had zero eval representation. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

43/52 passed (92%), 93% tool call success rate, 7.3 min total. New tasks surfaced bashkit bugs: heredoc redirect to file fails, mv into newly-created directories unreliable. Several near-misses from output format sensitivity in expectations. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

Use stdout_regex instead of stdout_contains for 5 tasks where exact output format varies by model but the underlying task is done correctly: - search_find_replace: flexible "Files modified: 3" matching - script_assoc_array: flexible visitor/country output matching - pipe_xargs_batch: flexible line count/error count matching - script_getopts_parser: flexible verbose message matching - data_regex_extract: removed redundant checks, flexible summary matching https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

…aset Run evals against three models on expanded 52-task dataset: - Opus 4.6: 23/23 (100%) — partial run, API credits exhausted - Sonnet 4.6: 23/26 (94%) — partial run, first eval for this model - GPT-5.2: 32/52 (79%) — full run Update README with new results section including per-category comparison, failure analysis, and new interpreter bugs surfaced. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

Keep both our new tasks (code_search, environment, etc.) and main's new tasks (database_operations, config_management, build_simulation). Dataset now has 58 tasks across 15 categories. https://claude.ai/code/session_0158eammVU6hRXeg9VJZuCJz

claude added 6 commits February 27, 2026 03:48

chaliy merged commit 015876e into main Feb 27, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(eval): expand eval suite to 52 tasks, add multi-model results#349

chore(eval): expand eval suite to 52 tasks, add multi-model results#349
chaliy merged 6 commits intomainfrom
claude/analyze-bashkit-evals-aU9Lh

chaliy commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chaliy commented Feb 27, 2026

Summary

Results

New interpreter bugs surfaced

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants