Skip to content

Releases: fubak/ultraswarm

v3.3.0 — SmallHarness worker

15 Jun 10:56
ee75498

Choose a tag to compare

What's new

Added: small-harness built-in worker

SmallHarness is now a first-class ultraswarm worker — a terminal-first Rust coding agent with native MCP integration, multi-backend support (OpenAI, OpenRouter, Ollama, LM Studio, MLX, llama.cpp), and real-time cost tracking.

Default tier mapping:

Tier Backend Model
simple OpenAI gpt-4o-mini
moderate OpenRouter claude-sonnet-4-6
complex OpenRouter claude-opus-4-8
expert OpenRouter claude-opus-4-8

Backend and model are injected via BACKEND / AGENT_MODEL environment variables. Add OPENAI_API_KEY and OPENROUTER_API_KEY to workerEnvAllowlist to pass credentials through. Override any tier via the standard overrides config key.

Also includes a SmallHarness host skill — a SKILL.md enabling SmallHarness sessions to invoke ultraswarm as an orchestration tool in the other direction.

Bug fixed

The previous invocation template (from the in-development registry entry) used unrecognised CLI flags (--backend, --model, --approval never) and omitted --allow-tools, causing all tool calls to be silently denied. Fixed in this release.

v3.2.1 — forbiddenPaths + alias-pin fixes, e2e harness, coverage

14 Jun 00:38
a9e89f0

Choose a tag to compare

Patch release: two bug fixes surfaced by a new end-to-end test harness, plus a comprehensive test-coverage lift.

Fixed

  • forbiddenPaths bypass via new directories (security). The implement step listed worker output with git status --porcelain, which collapses a brand-new untracked directory to dir/. A worker writing a forbidden file into a new subdirectory (e.g. vault/leak.secret) was reported as vault/, slipped past the forbiddenPaths glob, and integrated. Now uses -uall so files are listed individually and enforced correctly. (Regression test proven to fail without the fix.)
  • Aliases could not be pinned in a plan. validatePlan validated task.cli against the built-in registry only, rejecting user-defined alias names even though routing supports explicit alias selection. It now validates against the effective registry (built-ins + configured aliases).

Added

  • Deterministic, network-free end-to-end test harness driving the real runner in-process (ULTRASWARM_BRAIN=mock seam + fake worker fixtures): complex multi-wave run → status → logs → export → merge, plus approval-gate and failure/retry paths.
  • Broad unit-coverage lift: suite 178 → 309 tests; overall line 90.6% → 96.8%, branch 77.3% → 87.3%.

Full diff: v3.2.0...v3.2.1

v3.2.0 — User-defined harness aliases

13 Jun 21:54
b87dbd5

Choose a tag to compare

User-defined harness aliases

Register your own CLI entries in config under a new top-level aliases key — generalizing the previously hardcoded pi-local. Configure the same CLI binary with different models, run multiple local LLMs each tuned for a job, and keep local-model harnesses lean.

{
  "enabled": ["codex", "pi-qwen-coder"],
  "aliases": {
    "pi-qwen-coder": {
      "extends": "pi",
      "specialty": "local coding, small refactors, unit tests",
      "maxTier": "moderate",
      "models": {
        "simple": { "model": "qwen3-coder:7b", "invocation": "pi -p --provider ollama --model qwen3-coder:7b --config ~/.pi/lean.json \"$(cat .ultraswarm-prompt.txt)\"" }
      }
    }
  }
}
  • extends inherits the base CLI's binary, timeout, effort flags, and capabilities; override only what differs.
  • Lean harness lives in the invocation string (--config, fewer flags) — local models often do better with less wrapping.
  • maxTier caps the tier an alias accepts; higher-tier tasks are clamped down so a small local model is never handed expert work.
  • Strictly opt-in — with no aliases, behavior is byte-identical to before.

Built on a new buildRegistry(config) seam (frozen built-ins + resolved aliases); resolveRoute, the worker manager, runner routing, and the decomposition roster all consult it. Full validation of alias entries. 173 tests, all 15 release checks green.

Full diff: v3.1.0...v3.2.0

v3.1.0 — Pi workers + per-task effort levels

13 Jun 19:18

Choose a tag to compare

New worker integrations and a per-task reasoning-effort axis.

Added

  • pi worker — the provider-agnostic pi coding CLI (Anthropic Claude spread: Haiku → Sonnet → Opus → Opus --thinking high). Headless via pi -p.
  • pi-local worker — always-on local/private worker driving Ollama models through the same pi binary, for fully offline-capable runs. (Requires a configured ollama provider and a local model that emits structured tool-calls — see README.)
  • Optional registry binary field so a logical worker can map to a different executable (pi-localpi).
  • Per-task effort levels — the decomposition brain assigns effort (off/low/medium/high/xhigh) per task, independent of model tier, defaulting to low. Injected per-CLI for codex/droid/pi via an effortFlags map + {{EFFORT}} slot.
  • Effort-first escalation — on QA failure the attempt loop climbs effort (low → medium → high) before stepping up the model tier. Routine tasks climb effort within their tier; high-risk/complex tasks use the full effort-then-tier ladder.

Behavior change

Because effort defaults to low and is decoupled from tier, expert-tier tasks now run the expert model at low effort and escalate on QA failure — no longer pinned to high. Pin a task with effort: "high" for maximum reasoning up front.

Validation

146 tests pass; repository validation and host-skill provenance lock green. Verified with live end-to-end runs: pi worker end-to-end, effort injection (codex/pi), and effort-first escalation on both the routine and intelligent paths.

v2.4.3 - Enhanced Codex Integration & Native Skill Architecture

13 Jun 03:32

Choose a tag to compare

Enhanced Codex Integration & Native Skill Architecture

This release introduces a proper skill-based integration for Codex CLI and improves cross-platform compatibility with enhanced documentation and validation.

🎯 Key Improvements

Native Codex Integration:

  • ✅ Proper installable skill for ~/.agents/skills/ultraswarm
  • ✅ Dedicated installation script with symlink-based auto-updating
  • ✅ Deprecated legacy AGENTS.md approach for better maintainability

Enhanced Validation & Documentation:

  • ✅ Added comprehensive validation checks for Codex skill contract and installer
  • ✅ Updated README with clear distinction between Claude Code and Codex usage
  • ✅ Cross-platform compatibility improvements with robust error handling

🔧 Installation

For Codex CLI (NEW):

git clone https://github.com/fubak/ultraswarm.git ~/projects/ultraswarm
cd ~/projects/ultraswarm && npm install
bash scripts/install-codex-skill.sh

Restart Codex, then invoke: $ultraswarm <task>

For Claude Code:

/plugin marketplace add fubak/ultraswarm  
/plugin install ultraswarm@ultraswarm

Invoke: /ultraswarm <task>

For Standalone CLI:

node ~/projects/ultraswarm/bin/ultraswarm.mjs --decompose "<task>" --yes

🧪 Quality Assurance

  • 99/99 tests passing - Full test suite coverage
  • 15/15 validation checks green - Comprehensive validation
  • End-to-end compatibility verified - Live testing across platforms
  • Robust error handling - Graceful failure modes with actionable guidance

📋 Architecture Improvements

  • Skill-based integration: Cleaner, more maintainable Codex integration
  • Enhanced documentation: Clear usage patterns for different platforms
  • Improved validation: Comprehensive checks for cross-platform compatibility
  • Better error handling: Clear guidance for common installation issues

This release makes ultraswarm more accessible and reliable across all supported platforms while maintaining the same powerful orchestration capabilities.

ultraswarm v2.4.2 — High-Risk Path Hardening (closes #13, #14)

12 Jun 23:11
ed7a6d3

Choose a tag to compare

The high-risk competition/escalation path now works under the documented config shape and fails cleanly. Verified with two live end-to-end runs.

Fixed

  • #13 — high-risk tasks no longer crash with "CLI name must be a non-empty string" when a worker fails early with no alternate, and retries no longer die with "a branch named … already exists". The competition/fallback paths gate on cli usability (a known worker resolvable via DEFAULT_REGISTRY/overrides, or an explicit registry entry) instead of cfg.registry alone — so high-risk tasks actually run under the documented enabled/overrides config (they previously always tombstoned). A missing/self alternate tombstones cleanly; stale worktree branches are pruned before re-creation.
  • #14 — a dependent of a failed high-risk task is blocked across waves and every task appears in the final report.

Added

  • High-risk integration tests + two live runs through bin (this path never ran live before): a failing high-risk task with a blocked dependent (no crash, complete report), and the full happy path — competing on codex vs grok → live Sonnet judge → 3-lens Opus adversarial QA → merged ✓. 99 tests.

ultraswarm v2.4.1 — Runner Hardening (closes #6–#12)

12 Jun 20:05
ec22036

Choose a tag to compare

The standalone runner now works end-to-end through its CLI entry path, with every runner issue (#6#12) closed and the bin seam under test. Started from a grok-CLI WIP branch (that made the runner executable); this finishes the job.

Fixed

  • #6--decompose produces valid plans (model_tier/risk enums + CLI roster in the prompt, plus normalization so model_tier:"haiku"simple, risk:"low"routine). The documented enabled+overrides config shape resolves worker commands — no hand-crafted registry needed.
  • #7 — external workers get the clean task prompt, not the orchestration wrapper.
  • #8 — worker launch failures classified (auth/transport/not-installed/timeout) with actionable hints (worker grok failed (auth) — run `grok login` ); worktree-auth limitation documented.
  • #9 — no-op / scaffolding-only worker output can no longer pass review or merge.
  • #10 — dependents of a failed task are reported blocked (dependency X did not merge) and never run blind; cascades across waves.
  • #11 — reports show per-task attempts, a merged/failed/blocked summary with success rate, and token-capture coverage.
  • #12 — host scaffolding (.ultraswarm-plan.json, config, .ultraswarm/, .grok/) no longer leaks into feature commits (mergeWave drops the redundant git add -A); .gitignore updated.
  • Silent-task-loss guard — an unknown CLI returns a loud cli_failed instead of throwing; bin prints a clean error + exit 1 on an invalid plan instead of a stack trace.

Added

  • End-to-end-through-bin seam tests (the coverage the v2.4.0 break slipped through), +13 tests overall (96 total).

Verified live: a real task runs worktree → worker → gates → live claude QA review → merge, the report shows the new metrics, and host scaffolding stays out of the commit.

ultraswarm v2.4.0 — Portable Host Runner

12 Jun 18:12

Choose a tag to compare

Portability release: ultraswarm now runs two co-equal ways — as the Claude Code /ultraswarm skill, or as a standalone CLI hosted from Codex, Grok, or any shell (no Claude Code required). Same orchestration core, identical behaviour; the standalone runner just trades the live /workflows UI for portability.

Added

  • Standalone host runner (bin/ultraswarm.mjs + lib/). A host-supplied (or fallback-decomposed) plan JSON runs through dependency waves → implement → adaptive QA → merge → report. Shares a host-agnostic pure core with the skill (router.mjs reused; QA cascade/competition lifted from SKILL.md, proven byte-for-byte by a parity harness). Impl wrappers are plain subprocesses — only the brain roles call an LLM.
    • Flags: --plan-file <json> · --decompose "<task>" (fallback) · --yes · --resume <id> (journaled).
    • Plan contract rejects unknown CLIs, bad tiers, dependency cycles, and unsafe task ids.
    • hosts/codex/AGENTS.md + hosts/grok/ultraswarm.md launchers.
  • claude -p brain adapter — the runner's brain defaults to your local authenticated claude CLI: no ANTHROPIC_API_KEY, no separate API billing, reusing your Claude Code auth. Falls back to the raw Anthropic API when claude isn't on PATH. Override with ULTRASWARM_BRAIN=claude-cli | anthropic-api. Live-smoked against claude 2.1.175.
  • package.json + deps (@anthropic-ai/sdk, ajv); CI runs npm ci; validate.sh check [12] parses bin/+lib/.

Fixed

  • Command-injection hardening (two security reviews): git plumbing on plan-derived values uses execFileSync + argv + --; task ids charset-validated at the boundary.
  • Brain tier→model-id resolution (caught by the final review): QA/judge/lens calls resolve tier labels to real model ids before hitting the brain.
  • README accuracy pass + concrete Codex/Grok/shell run instructions.

Built TDD via subagent-driven development (18 tasks + hardening + 2 review-caught fixes). 83 tests, validate.sh 12/12, proof-of-life verified end-to-end.

ultraswarm v2.3.0 — Claude-Model Token Optimization

12 Jun 14:59
fc65762

Choose a tag to compare

Token-optimizes ultraswarm's internal Claude-model usage — the part you actually pay for — without losing quality. Informed by a deep analysis of the skill + router against the state of the art in LLM model routing (RouteLLM, NotDiamond, FrugalGPT cascades, GPT-5 router, Claude effort); the design already matched the dominant patterns, and this release sharpens it.

Changed

  • Per-phase routing is now real, not aspirational. Phases 3 (merge) and 4 (report) delegate mechanical work to Agent({ model: 'haiku' }) subagents (merge escalates to sonnet only on conflict). The old "Use Haiku for merge/report" note was inert — inline phases run on the session model (typically Opus) and a skill can't downshift its own main loop, so mechanical work was billed at Opus rates. This is the dominant share of a routine run's ~70–80k tokens.
  • High-risk adversarial QA → cost-aware cascade (FrugalGPT-style). Security lens always Opus (asymmetric risk); correctness/regression run Sonnet-first and escalate to Opus only on refute/borderline (<75). Quorum (≥2), score (≥60), and zero-critical-refutation guarantees unchanged. Cuts most of the ~250–550k high-risk path on clean work.
  • Trimmed enhancedImplPrompt ~in half — the Bash-only wrapper never needed the intelligence scaffolding.

Added

  • Fable 5 as an opt-in ceiling via intelligence.maxIntelligence (default off). Flips only the security lens + expert-escalation Opus→Fable. Out of the hot path by default (Fable ≈ +30% tokens + premium price). fable is now a valid claudeModels value.

Fixed

  • router.mjs: clarified that complexityThresholds.expert is a validation ordering anchor only — getTier never reads it. Validation message now lists fable.

Verification: router 18/18, harness 17/17, validate.sh 11/11.

ultraswarm v2.2.0 — Behavioral CI + Machine-Readable Gates

11 Jun 12:15

Choose a tag to compare

ultraswarm v2.2.0

A small, sharp release: the orchestration logic is now behaviorally tested in CI, and the validator speaks JSON. Both additions were produced or hardened by the swarm itself.

What's new

🧪 Workflow behavior harness (CI check [11])

scripts/workflow-harness.test.mjs — 16 node:test cases that extract the actual Workflow JS from SKILL.md and run it with mocked agent primitives, covering model-tier routing, adaptive QA depths, quorum and critical-refutation rules, tier escalation, exhaustion/tombstones, task immutability, and the dependency-wave guard. The embedded orchestration logic is now behaviorally tested on every push, not just parse-checked — a QA-gate regression breaks CI before it can burn tokens in a live run.

📋 validate.sh --json

Emits per-check results as a JSON array of {check, name, pass, detail} for CI dashboards and tooling; default output and exit codes are unchanged. Built by the swarm (grok, 2 attempts): the routine-tier QA review rejected attempt 1 for unescaped node -e interpolation and newline-unsafe JSON escaping — both real bugs — and attempt 2 fixed them with JSON.stringify-based escaping.

📚 README rewritten for v2.1+ reality

Every claim now traces to something measured or exercised in the live validation: dependency waves, both config override forms (flat + tiered), adaptive QA with the quorum/critical rules, the verified model-tier table with the model-ID-drift warning, measured cost calibration (the unmeasured "40–70% savings" claim is gone), the analyze mode, and a new troubleshooting entry for the hangs-on-bad-model-ID failure mode.

Upgrade

/plugin marketplace update ultraswarm

Then /reload-plugins or a new session. Full details in CHANGELOG.md.