Skip to content

One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example#63

Merged
jbragg merged 1 commit into
mainfrom
add-project-setup-skill
May 20, 2026
Merged

One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example#63
jbragg merged 1 commit into
mainfrom
add-project-setup-skill

Conversation

@jbragg
Copy link
Copy Markdown
Collaborator

@jbragg jbragg commented May 20, 2026

PR: One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example

Branch: add-project-setup-skillmain

workspace is the single place the agent manages a research project's repo: scaffold infrastructure (Quarto build tool, GitHub Pages auto-deploy, dev container), show the user the rendered output, save iterations on the user's behalf. Today the rendering is Quarto-based (.qmd / .md / .ipynb; scaffold defaults to website output, other Quarto formats via _quarto.yml). Doesn't handle research execution (use research-step).

The agent's job in any context is to give the user a web URL for the rendered work. Only the URL source differs: local agents (host, dev container, Codespace) serve it from Quarto on a forwarded port; headless agents push to a PR branch and let GitHub Pages CI deploy the URL. One SKILL.md covers both.

This consolidates the old preview (raw quarto render / quarto preview, CI deploy, with the _quarto.yml and docs.yml assets) and the old workspace (just devcontainer.json), adds a Makefile wrapping the Quarto / deploy / dev-container entry points, and adds the new save workflow (commit + push + PR + merge with explicit user approval). The old preview is deleted.

Skill: skills/workspace/SKILL.md. Scaffolded assets: skills/workspace/assets/ (Makefile, devcontainer.json, _quarto.yml, docs.yml, README.md, DEVELOPER.md — the last two are user-owned). Dockerfile bundles make; dropped from devcontainer asset's postCreateCommand.

The companion PR (allenai/asta-bench-private#226) adds a new asta_skills eval suite — the first per-skill suite under the Benchmarking workflow (#60). The three cases for workspace (see Validation) are also the worked example for adding new per-skill cases, referenced from the README.

Side fix: check-plugins Makefile target now flags untracked files in plugins/ (was git diff --quiet, which only detects modifications to tracked files — that gap let PR #58 land with prepare-submission.sh in skills/ but not in plugins/; this PR's rebuild picks it up).

Outcomes

Each maps to an eval case. Quote is the eval prompt. Per-flow behavior in SKILL.md.

  • Set up a working project. (start_workspace)

    "Let's start a new research project on retrieval-augmented generation. Set things up so I can write notes and drafts as I go."

  • Review the agent's work in a browser. (view_agent_output)

    "Show me what you've put together so far."

  • Save the current iteration. (save_iteration)

    "Save what you've done."

Validation

Cases in asta_skills, scored on routing today (see allenai/asta-bench-private#226). Agent runtime: claude_code + sonnet-4-6 in an Inspect Docker sandbox. Three arms compared to isolate what's actually responsible for the routing improvement:

  • A: original — two separate skills (preview, workspace) with their existing descriptions.
  • C: separate skills, with workspace's description appended to mention show / save (description fix only, no consolidation).
  • B: this PR — one consolidated workspace skill with a user-first description.

Score = workspace_skill_activated (1 if the agent invoked workspace for this user prompt; 0 otherwise).

case arm A arm C arm B
start_workspace 0 (lost to research-step/init) 1 (description fix sufficient — no competing sibling) 1
view_agent_output 0 (inferred) 0 (lost to artifacts — A2A-task-output skill whose triggers bleed into "show me what you've put together") 1
save_iteration 0 (no skill mentions save) 1 (description fix sufficient) 1

Arm B (this PR) wins all three. Arm C — description-fix-only on the old separate workspace — wins 2 of 3; it fails on view_agent_output because artifacts already claims "what did the agent produce" / "show me the artifacts" triggers that bleed into research-project view prompts. So consolidation's unique value is routing reliability when sibling skills have overlapping natural-language triggers: with one consolidated workspace, you don't have to carefully scope every other skill's description to avoid bleed. For the other two cases (start, save), a description fix on the old skills would have sufficed.

Coverage gaps: only the headless branch of SKILL.md is exercised. The user-reaches-port branch (host / dev container / Codespace agents using make preview → localhost or Codespaces URL) isn't simulated. Other agent runtimes (codex_cli, gemini_cli) and other models aren't tested either.

(Working-limit 600s; routing activation happens within seconds, but start_workspace runs longer because the agent does real scaffolding after routing.)

@jbragg jbragg force-pushed the add-project-setup-skill branch from 63c156f to 82bda01 Compare May 20, 2026 00:30
@jbragg jbragg requested a review from rodneykinney May 20, 2026 01:14
Copy link
Copy Markdown
Member

@rodneykinney rodneykinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that we're getting quantified measurements of how often the skills are colliding

@jbragg jbragg merged commit 8405ac8 into main May 20, 2026
6 checks passed
@jbragg jbragg deleted the add-project-setup-skill branch May 20, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants