One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example#63
Merged
Conversation
63c156f to
82bda01
Compare
rodneykinney
approved these changes
May 20, 2026
Member
rodneykinney
left a comment
There was a problem hiding this comment.
Love that we're getting quantified measurements of how often the skills are colliding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: One
workspaceskill (scaffold/see/save); addasta_skillssuite as new-evals exampleBranch:
add-project-setup-skill→mainworkspaceis the single place the agent manages a research project's repo: scaffold infrastructure (Quarto build tool, GitHub Pages auto-deploy, dev container), show the user the rendered output, save iterations on the user's behalf. Today the rendering is Quarto-based (.qmd/.md/.ipynb; scaffold defaults to website output, other Quarto formats via_quarto.yml). Doesn't handle research execution (useresearch-step).The agent's job in any context is to give the user a web URL for the rendered work. Only the URL source differs: local agents (host, dev container, Codespace) serve it from Quarto on a forwarded port; headless agents push to a PR branch and let GitHub Pages CI deploy the URL. One SKILL.md covers both.
This consolidates the old
preview(rawquarto render/quarto preview, CI deploy, with the_quarto.ymlanddocs.ymlassets) and the oldworkspace(justdevcontainer.json), adds a Makefile wrapping the Quarto / deploy / dev-container entry points, and adds the new save workflow (commit + push + PR + merge with explicit user approval). The oldpreviewis deleted.Skill:
skills/workspace/SKILL.md. Scaffolded assets:skills/workspace/assets/(Makefile, devcontainer.json, _quarto.yml, docs.yml, README.md, DEVELOPER.md — the last two are user-owned).Dockerfilebundlesmake; dropped from devcontainer asset'spostCreateCommand.The companion PR (allenai/asta-bench-private#226) adds a new
asta_skillseval suite — the first per-skill suite under the Benchmarking workflow (#60). The three cases forworkspace(see Validation) are also the worked example for adding new per-skill cases, referenced from the README.Side fix:
check-pluginsMakefile target now flags untracked files inplugins/(wasgit diff --quiet, which only detects modifications to tracked files — that gap let PR #58 land withprepare-submission.shinskills/but not inplugins/; this PR's rebuild picks it up).Outcomes
Each maps to an eval case. Quote is the eval prompt. Per-flow behavior in
SKILL.md.Set up a working project. (
start_workspace)Review the agent's work in a browser. (
view_agent_output)Save the current iteration. (
save_iteration)Validation
Cases in
asta_skills, scored on routing today (see allenai/asta-bench-private#226). Agent runtime:claude_code+sonnet-4-6in an Inspect Docker sandbox. Three arms compared to isolate what's actually responsible for the routing improvement:preview,workspace) with their existing descriptions.workspace's description appended to mention show / save (description fix only, no consolidation).workspaceskill with a user-first description.Score =
workspace_skill_activated(1 if the agent invokedworkspacefor this user prompt; 0 otherwise).start_workspaceresearch-step/init)view_agent_outputartifacts— A2A-task-output skill whose triggers bleed into "show me what you've put together")save_iterationArm B (this PR) wins all three. Arm C — description-fix-only on the old separate
workspace— wins 2 of 3; it fails onview_agent_outputbecauseartifactsalready claims "what did the agent produce" / "show me the artifacts" triggers that bleed into research-project view prompts. So consolidation's unique value is routing reliability when sibling skills have overlapping natural-language triggers: with one consolidated workspace, you don't have to carefully scope every other skill's description to avoid bleed. For the other two cases (start, save), a description fix on the old skills would have sufficed.Coverage gaps: only the headless branch of SKILL.md is exercised. The user-reaches-port branch (host / dev container / Codespace agents using
make preview→ localhost or Codespaces URL) isn't simulated. Other agent runtimes (codex_cli,gemini_cli) and other models aren't tested either.(Working-limit 600s; routing activation happens within seconds, but
start_workspaceruns longer because the agent does real scaffolding after routing.)