One `workspace` skill (scaffold/see/save); add `asta_skills` suite as new-evals example by jbragg · Pull Request #63 · allenai/asta-plugins

jbragg · 2026-05-20T00:29:42Z

PR: One `workspace` skill (scaffold/see/save); add `asta_skills` suite as new-evals example

Branch: add-project-setup-skill → main

workspace is the single place the agent manages a research project's repo: scaffold infrastructure (Quarto build tool, GitHub Pages auto-deploy, dev container), show the user the rendered output, save iterations on the user's behalf. Today the rendering is Quarto-based (.qmd / .md / .ipynb; scaffold defaults to website output, other Quarto formats via _quarto.yml). Doesn't handle research execution (use research-step).

The agent's job in any context is to give the user a web URL for the rendered work. Only the URL source differs: local agents (host, dev container, Codespace) serve it from Quarto on a forwarded port; headless agents push to a PR branch and let GitHub Pages CI deploy the URL. One SKILL.md covers both.

This consolidates the old preview (raw quarto render / quarto preview, CI deploy, with the _quarto.yml and docs.yml assets) and the old workspace (just devcontainer.json), adds a Makefile wrapping the Quarto / deploy / dev-container entry points, and adds the new save workflow (commit + push + PR + merge with explicit user approval). The old preview is deleted.

Skill: skills/workspace/SKILL.md. Scaffolded assets: skills/workspace/assets/ (Makefile, devcontainer.json, _quarto.yml, docs.yml, README.md, DEVELOPER.md — the last two are user-owned). Dockerfile bundles make; dropped from devcontainer asset's postCreateCommand.

The companion PR (allenai/asta-bench-private#226) adds a new asta_skills eval suite — the first per-skill suite under the Benchmarking workflow (#60). The three cases for workspace (see Validation) are also the worked example for adding new per-skill cases, referenced from the README.

Side fix: check-plugins Makefile target now flags untracked files in plugins/ (was git diff --quiet, which only detects modifications to tracked files — that gap let PR #58 land with prepare-submission.sh in skills/ but not in plugins/; this PR's rebuild picks it up).

Outcomes

Each maps to an eval case. Quote is the eval prompt. Per-flow behavior in SKILL.md.

Set up a working project. (start_workspace)

"Let's start a new research project on retrieval-augmented generation. Set things up so I can write notes and drafts as I go."
Review the agent's work in a browser. (view_agent_output)

"Show me what you've put together so far."
Save the current iteration. (save_iteration)

"Save what you've done."

Validation

Cases in asta_skills, scored on routing today (see allenai/asta-bench-private#226). Agent runtime: claude_code + sonnet-4-6 in an Inspect Docker sandbox. Three arms compared to isolate what's actually responsible for the routing improvement:

A: original — two separate skills (preview, workspace) with their existing descriptions.
C: separate skills, with workspace's description appended to mention show / save (description fix only, no consolidation).
B: this PR — one consolidated workspace skill with a user-first description.

Score = workspace_skill_activated (1 if the agent invoked workspace for this user prompt; 0 otherwise).

case	arm A	arm C	arm B
`start_workspace`	0 (lost to `research-step/init`)	1 (description fix sufficient — no competing sibling)	1
`view_agent_output`	0 (inferred)	0 (lost to `artifacts` — A2A-task-output skill whose triggers bleed into "show me what you've put together")	1
`save_iteration`	0 (no skill mentions save)	1 (description fix sufficient)	1

Arm B (this PR) wins all three. Arm C — description-fix-only on the old separate workspace — wins 2 of 3; it fails on view_agent_output because artifacts already claims "what did the agent produce" / "show me the artifacts" triggers that bleed into research-project view prompts. So consolidation's unique value is routing reliability when sibling skills have overlapping natural-language triggers: with one consolidated workspace, you don't have to carefully scope every other skill's description to avoid bleed. For the other two cases (start, save), a description fix on the old skills would have sufficed.

Coverage gaps: only the headless branch of SKILL.md is exercised. The user-reaches-port branch (host / dev container / Codespace agents using make preview → localhost or Codespaces URL) isn't simulated. Other agent runtimes (codex_cli, gemini_cli) and other models aren't tested either.

(Working-limit 600s; routing activation happens within seconds, but start_workspace runs longer because the agent does real scaffolding after routing.)

…template

rodneykinney

Love that we're getting quantified measurements of how often the skills are colliding

Consolidate preview into workspace; add save workflow + DEVELOPER.md …

82bda01

…template

jbragg force-pushed the add-project-setup-skill branch from 63c156f to 82bda01 Compare May 20, 2026 00:30

jbragg requested a review from rodneykinney May 20, 2026 01:14

rodneykinney approved these changes May 20, 2026

View reviewed changes

jbragg merged commit 8405ac8 into main May 20, 2026
6 checks passed

jbragg deleted the add-project-setup-skill branch May 20, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One `workspace` skill (scaffold/see/save); add `asta_skills` suite as new-evals example#63

One `workspace` skill (scaffold/see/save); add `asta_skills` suite as new-evals example#63
jbragg merged 1 commit into
mainfrom
add-project-setup-skill

jbragg commented May 20, 2026

Uh oh!

rodneykinney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbragg commented May 20, 2026

PR: One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example

Outcomes

Validation

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR: One `workspace` skill (scaffold/see/save); add `asta_skills` suite as new-evals example