perf-comparison autoloop: run benchmarks and commit real results.json each iteration#154
Conversation
…ds real data Agent-Logs-Url: https://github.com/githubnext/tsessebe/sessions/cb4498ea-d02b-4dba-91a5-1be05efb5fe7 Co-authored-by: mrjf <180956+mrjf@users.noreply.github.com>
|
@copilot you need to compile the agentic workflow |
Ran |
|
Looked at run 24674003775. The single failing test is a pre-existing flaky property-based test, unrelated to this PR: This PR only touches Happy to fix the test in a separate PR if you'd like, but it's out of scope here. |
|
Warning The Evergreen CI FixI investigated the failing Test & Lint CI check on this PR. Finding: This PR only modifies
Root cause: The CI failure appears to have been transient. The PR branch was created minutes after Action taken: Pushed an empty commit to trigger a fresh CI run. All checks should pass now.
|
The ExplodeOptions interface used snake_case `ignore_index` while all tests and TypeScript conventions use camelCase `ignoreIndex`. This caused three test failures: - explodeSeries > "resets to RangeIndex when ignoreIndex=true" - explodeDataFrame > "resets to RangeIndex when ignoreIndex=true" - property test "ignore_index produces RangeIndex 0..n-1" Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
Evergreen: Fixed failing CI checksFixed 3 failing tests in the Test & Lint check caused by a snake_case/camelCase mismatch in Root cause: Fix: Renamed
Tests fixed:
|
The repo has 599 TS/Python benchmark pairs under
benchmarks/, but nothing invokesbenchmarks/run_benchmarks.sh, sobenchmarks/results.jsononmainstays a stub ({ "benchmarks": [], "timestamp": null }).pages.ymlcopies that stub into the playground, andbenchmarks.htmlrenders "No benchmark data available yet."Changes
All in
.autoloop/programs/perf-comparison/program.md:Evaluation. Evaluation always runs and its file side-effects are committed on accepted iterations, so this guaranteesbenchmarks/results.jsonis regenerated with real data every run. Previously the per-iteration checklist merely asked the agent to run the script and update the file, which is whyresults.jsonnever got populated.bun installso TS benchmarks can import fromsrc/.benchmarked_functionsis read from the regeneratedresults.json(count of entries that actually ran to completion) instead of counting files on disk — so the ratchet credits only benchmarks that truly work end-to-end.|| true+ output redirection with|| echo "WARN: …" >&2so failures surface in autoloop logs while still guaranteeing the final metric line is emitted for autoloop to parse.Flow after this change
Note on scope
AGENTS.mdtells agents not to modify.autoloop/programs/**, but the program definition is exactly what the issue asks to change, so the edit is scoped to that one file.