Releases: fubak/ultraswarm
v3.3.0 — SmallHarness worker
What's new
Added: small-harness built-in worker
SmallHarness is now a first-class ultraswarm worker — a terminal-first Rust coding agent with native MCP integration, multi-backend support (OpenAI, OpenRouter, Ollama, LM Studio, MLX, llama.cpp), and real-time cost tracking.
Default tier mapping:
| Tier | Backend | Model |
|---|---|---|
| simple | OpenAI | gpt-4o-mini |
| moderate | OpenRouter | claude-sonnet-4-6 |
| complex | OpenRouter | claude-opus-4-8 |
| expert | OpenRouter | claude-opus-4-8 |
Backend and model are injected via BACKEND / AGENT_MODEL environment variables. Add OPENAI_API_KEY and OPENROUTER_API_KEY to workerEnvAllowlist to pass credentials through. Override any tier via the standard overrides config key.
Also includes a SmallHarness host skill — a SKILL.md enabling SmallHarness sessions to invoke ultraswarm as an orchestration tool in the other direction.
Bug fixed
The previous invocation template (from the in-development registry entry) used unrecognised CLI flags (--backend, --model, --approval never) and omitted --allow-tools, causing all tool calls to be silently denied. Fixed in this release.
v3.2.1 — forbiddenPaths + alias-pin fixes, e2e harness, coverage
Patch release: two bug fixes surfaced by a new end-to-end test harness, plus a comprehensive test-coverage lift.
Fixed
forbiddenPathsbypass via new directories (security). The implement step listed worker output withgit status --porcelain, which collapses a brand-new untracked directory todir/. A worker writing a forbidden file into a new subdirectory (e.g.vault/leak.secret) was reported asvault/, slipped past theforbiddenPathsglob, and integrated. Now uses-uallso files are listed individually and enforced correctly. (Regression test proven to fail without the fix.)- Aliases could not be pinned in a plan.
validatePlanvalidatedtask.cliagainst the built-in registry only, rejecting user-defined alias names even though routing supports explicit alias selection. It now validates against the effective registry (built-ins + configured aliases).
Added
- Deterministic, network-free end-to-end test harness driving the real runner in-process (
ULTRASWARM_BRAIN=mockseam + fake worker fixtures): complex multi-wave run → status → logs → export → merge, plus approval-gate and failure/retry paths. - Broad unit-coverage lift: suite 178 → 309 tests; overall line 90.6% → 96.8%, branch 77.3% → 87.3%.
Full diff: v3.2.0...v3.2.1
v3.2.0 — User-defined harness aliases
User-defined harness aliases
Register your own CLI entries in config under a new top-level aliases key — generalizing the previously hardcoded pi-local. Configure the same CLI binary with different models, run multiple local LLMs each tuned for a job, and keep local-model harnesses lean.
{
"enabled": ["codex", "pi-qwen-coder"],
"aliases": {
"pi-qwen-coder": {
"extends": "pi",
"specialty": "local coding, small refactors, unit tests",
"maxTier": "moderate",
"models": {
"simple": { "model": "qwen3-coder:7b", "invocation": "pi -p --provider ollama --model qwen3-coder:7b --config ~/.pi/lean.json \"$(cat .ultraswarm-prompt.txt)\"" }
}
}
}
}extendsinherits the base CLI's binary, timeout, effort flags, and capabilities; override only what differs.- Lean harness lives in the invocation string (
--config, fewer flags) — local models often do better with less wrapping. maxTiercaps the tier an alias accepts; higher-tier tasks are clamped down so a small local model is never handed expert work.- Strictly opt-in — with no
aliases, behavior is byte-identical to before.
Built on a new buildRegistry(config) seam (frozen built-ins + resolved aliases); resolveRoute, the worker manager, runner routing, and the decomposition roster all consult it. Full validation of alias entries. 173 tests, all 15 release checks green.
Full diff: v3.1.0...v3.2.0
v3.1.0 — Pi workers + per-task effort levels
New worker integrations and a per-task reasoning-effort axis.
Added
piworker — the provider-agnosticpicoding CLI (Anthropic Claude spread: Haiku → Sonnet → Opus → Opus--thinking high). Headless viapi -p.pi-localworker — always-on local/private worker driving Ollama models through the samepibinary, for fully offline-capable runs. (Requires a configuredollamaprovider and a local model that emits structured tool-calls — see README.)- Optional registry
binaryfield so a logical worker can map to a different executable (pi-local→pi). - Per-task effort levels — the decomposition brain assigns
effort(off/low/medium/high/xhigh) per task, independent of model tier, defaulting tolow. Injected per-CLI forcodex/droid/pivia aneffortFlagsmap +{{EFFORT}}slot. - Effort-first escalation — on QA failure the attempt loop climbs effort (low → medium → high) before stepping up the model tier. Routine tasks climb effort within their tier; high-risk/complex tasks use the full effort-then-tier ladder.
Behavior change
Because effort defaults to low and is decoupled from tier, expert-tier tasks now run the expert model at low effort and escalate on QA failure — no longer pinned to high. Pin a task with effort: "high" for maximum reasoning up front.
Validation
146 tests pass; repository validation and host-skill provenance lock green. Verified with live end-to-end runs: pi worker end-to-end, effort injection (codex/pi), and effort-first escalation on both the routine and intelligent paths.
v2.4.3 - Enhanced Codex Integration & Native Skill Architecture
Enhanced Codex Integration & Native Skill Architecture
This release introduces a proper skill-based integration for Codex CLI and improves cross-platform compatibility with enhanced documentation and validation.
🎯 Key Improvements
Native Codex Integration:
- ✅ Proper installable skill for
~/.agents/skills/ultraswarm - ✅ Dedicated installation script with symlink-based auto-updating
- ✅ Deprecated legacy AGENTS.md approach for better maintainability
Enhanced Validation & Documentation:
- ✅ Added comprehensive validation checks for Codex skill contract and installer
- ✅ Updated README with clear distinction between Claude Code and Codex usage
- ✅ Cross-platform compatibility improvements with robust error handling
🔧 Installation
For Codex CLI (NEW):
git clone https://github.com/fubak/ultraswarm.git ~/projects/ultraswarm
cd ~/projects/ultraswarm && npm install
bash scripts/install-codex-skill.shRestart Codex, then invoke: $ultraswarm <task>
For Claude Code:
/plugin marketplace add fubak/ultraswarm
/plugin install ultraswarm@ultraswarm
Invoke: /ultraswarm <task>
For Standalone CLI:
node ~/projects/ultraswarm/bin/ultraswarm.mjs --decompose "<task>" --yes🧪 Quality Assurance
- ✅ 99/99 tests passing - Full test suite coverage
- ✅ 15/15 validation checks green - Comprehensive validation
- ✅ End-to-end compatibility verified - Live testing across platforms
- ✅ Robust error handling - Graceful failure modes with actionable guidance
📋 Architecture Improvements
- Skill-based integration: Cleaner, more maintainable Codex integration
- Enhanced documentation: Clear usage patterns for different platforms
- Improved validation: Comprehensive checks for cross-platform compatibility
- Better error handling: Clear guidance for common installation issues
This release makes ultraswarm more accessible and reliable across all supported platforms while maintaining the same powerful orchestration capabilities.
ultraswarm v2.4.2 — High-Risk Path Hardening (closes #13, #14)
The high-risk competition/escalation path now works under the documented config shape and fails cleanly. Verified with two live end-to-end runs.
Fixed
- #13 — high-risk tasks no longer crash with "CLI name must be a non-empty string" when a worker fails early with no alternate, and retries no longer die with "a branch named … already exists". The competition/fallback paths gate on cli usability (a known worker resolvable via
DEFAULT_REGISTRY/overrides, or an explicitregistryentry) instead ofcfg.registryalone — so high-risk tasks actually run under the documentedenabled/overridesconfig (they previously always tombstoned). A missing/self alternate tombstones cleanly; stale worktree branches are pruned before re-creation. - #14 — a dependent of a failed high-risk task is blocked across waves and every task appears in the final report.
Added
- High-risk integration tests + two live runs through
bin(this path never ran live before): a failing high-risk task with a blocked dependent (no crash, complete report), and the full happy path —competing on codex vs grok→ live Sonnet judge → 3-lens Opus adversarial QA →merged ✓. 99 tests.
ultraswarm v2.4.1 — Runner Hardening (closes #6–#12)
The standalone runner now works end-to-end through its CLI entry path, with every runner issue (#6–#12) closed and the bin seam under test. Started from a grok-CLI WIP branch (that made the runner executable); this finishes the job.
Fixed
- #6 —
--decomposeproduces valid plans (model_tier/riskenums + CLI roster in the prompt, plus normalization somodel_tier:"haiku"→simple,risk:"low"→routine). The documentedenabled+overridesconfig shape resolves worker commands — no hand-craftedregistryneeded. - #7 — external workers get the clean task prompt, not the orchestration wrapper.
- #8 — worker launch failures classified (auth/transport/not-installed/timeout) with actionable hints (
worker grok failed (auth) — run `grok login`); worktree-auth limitation documented. - #9 — no-op / scaffolding-only worker output can no longer pass review or merge.
- #10 — dependents of a failed task are reported
blocked (dependency X did not merge)and never run blind; cascades across waves. - #11 — reports show per-task attempts, a merged/failed/blocked summary with success rate, and token-capture coverage.
- #12 — host scaffolding (
.ultraswarm-plan.json, config,.ultraswarm/,.grok/) no longer leaks into feature commits (mergeWavedrops the redundantgit add -A);.gitignoreupdated. - Silent-task-loss guard — an unknown CLI returns a loud
cli_failedinstead of throwing;binprints a clean error + exit 1 on an invalid plan instead of a stack trace.
Added
- End-to-end-through-
binseam tests (the coverage the v2.4.0 break slipped through), +13 tests overall (96 total).
Verified live: a real task runs worktree → worker → gates → live claude QA review → merge, the report shows the new metrics, and host scaffolding stays out of the commit.
ultraswarm v2.4.0 — Portable Host Runner
Portability release: ultraswarm now runs two co-equal ways — as the Claude Code /ultraswarm skill, or as a standalone CLI hosted from Codex, Grok, or any shell (no Claude Code required). Same orchestration core, identical behaviour; the standalone runner just trades the live /workflows UI for portability.
Added
- Standalone host runner (
bin/ultraswarm.mjs+lib/). A host-supplied (or fallback-decomposed) plan JSON runs through dependency waves → implement → adaptive QA → merge → report. Shares a host-agnostic pure core with the skill (router.mjsreused; QA cascade/competition lifted fromSKILL.md, proven byte-for-byte by a parity harness). Impl wrappers are plain subprocesses — only the brain roles call an LLM.- Flags:
--plan-file <json>·--decompose "<task>"(fallback) ·--yes·--resume <id>(journaled). - Plan contract rejects unknown CLIs, bad tiers, dependency cycles, and unsafe task ids.
hosts/codex/AGENTS.md+hosts/grok/ultraswarm.mdlaunchers.
- Flags:
claude -pbrain adapter — the runner's brain defaults to your local authenticatedclaudeCLI: noANTHROPIC_API_KEY, no separate API billing, reusing your Claude Code auth. Falls back to the raw Anthropic API whenclaudeisn't onPATH. Override withULTRASWARM_BRAIN=claude-cli | anthropic-api. Live-smoked against claude 2.1.175.package.json+ deps (@anthropic-ai/sdk,ajv); CI runsnpm ci;validate.shcheck [12] parsesbin/+lib/.
Fixed
- Command-injection hardening (two security reviews): git plumbing on plan-derived values uses
execFileSync+ argv +--; task ids charset-validated at the boundary. - Brain tier→model-id resolution (caught by the final review): QA/judge/lens calls resolve tier labels to real model ids before hitting the brain.
- README accuracy pass + concrete Codex/Grok/shell run instructions.
Built TDD via subagent-driven development (18 tasks + hardening + 2 review-caught fixes). 83 tests, validate.sh 12/12, proof-of-life verified end-to-end.
ultraswarm v2.3.0 — Claude-Model Token Optimization
Token-optimizes ultraswarm's internal Claude-model usage — the part you actually pay for — without losing quality. Informed by a deep analysis of the skill + router against the state of the art in LLM model routing (RouteLLM, NotDiamond, FrugalGPT cascades, GPT-5 router, Claude effort); the design already matched the dominant patterns, and this release sharpens it.
Changed
- Per-phase routing is now real, not aspirational. Phases 3 (merge) and 4 (report) delegate mechanical work to
Agent({ model: 'haiku' })subagents (merge escalates tosonnetonly on conflict). The old "Use Haiku for merge/report" note was inert — inline phases run on the session model (typically Opus) and a skill can't downshift its own main loop, so mechanical work was billed at Opus rates. This is the dominant share of a routine run's ~70–80k tokens. - High-risk adversarial QA → cost-aware cascade (FrugalGPT-style). Security lens always Opus (asymmetric risk); correctness/regression run Sonnet-first and escalate to Opus only on refute/borderline (
<75). Quorum (≥2), score (≥60), and zero-critical-refutation guarantees unchanged. Cuts most of the ~250–550k high-risk path on clean work. - Trimmed
enhancedImplPrompt~in half — the Bash-only wrapper never needed the intelligence scaffolding.
Added
- Fable 5 as an opt-in ceiling via
intelligence.maxIntelligence(default off). Flips only the security lens + expert-escalation Opus→Fable. Out of the hot path by default (Fable ≈ +30% tokens + premium price).fableis now a validclaudeModelsvalue.
Fixed
router.mjs: clarified thatcomplexityThresholds.expertis a validation ordering anchor only —getTiernever reads it. Validation message now listsfable.
Verification: router 18/18, harness 17/17, validate.sh 11/11.
ultraswarm v2.2.0 — Behavioral CI + Machine-Readable Gates
ultraswarm v2.2.0
A small, sharp release: the orchestration logic is now behaviorally tested in CI, and the validator speaks JSON. Both additions were produced or hardened by the swarm itself.
What's new
🧪 Workflow behavior harness (CI check [11])
scripts/workflow-harness.test.mjs — 16 node:test cases that extract the actual Workflow JS from SKILL.md and run it with mocked agent primitives, covering model-tier routing, adaptive QA depths, quorum and critical-refutation rules, tier escalation, exhaustion/tombstones, task immutability, and the dependency-wave guard. The embedded orchestration logic is now behaviorally tested on every push, not just parse-checked — a QA-gate regression breaks CI before it can burn tokens in a live run.
📋 validate.sh --json
Emits per-check results as a JSON array of {check, name, pass, detail} for CI dashboards and tooling; default output and exit codes are unchanged. Built by the swarm (grok, 2 attempts): the routine-tier QA review rejected attempt 1 for unescaped node -e interpolation and newline-unsafe JSON escaping — both real bugs — and attempt 2 fixed them with JSON.stringify-based escaping.
📚 README rewritten for v2.1+ reality
Every claim now traces to something measured or exercised in the live validation: dependency waves, both config override forms (flat + tiered), adaptive QA with the quorum/critical rules, the verified model-tier table with the model-ID-drift warning, measured cost calibration (the unmeasured "40–70% savings" claim is gone), the analyze mode, and a new troubleshooting entry for the hangs-on-bad-model-ID failure mode.
Upgrade
/plugin marketplace update ultraswarm
Then /reload-plugins or a new session. Full details in CHANGELOG.md.