Skip to content

berrzebb/consensus-loop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

103 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

consensus-loop

DOI

AI writes code. A different AI reviews it. Nothing ships without consensus.

A Claude Code plugin that enforces a cross-model audit gate on every code change. Claude implements, GPT/Codex reviews, and a human-in-the-loop retrospective ensures the team learns from each cycle.

claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

That's it. All hooks, skills, agents, and MCP tools are auto-registered.


The Problem

AI coding tools generate code fast. They also generate bugs fast, skip tests, drift from requirements, and self-validate their own blind spots. Instruction-based corrections ("always write tests") fade across sessions. The model cannot reliably catch its own mistakes through self-review.

The Solution

Structure beats instruction. consensus-loop makes it structurally impossible to ship unreviewed code:

  1. You write → Claude implements in an isolated git worktree
  2. A different model reviews → GPT/Codex independently audits the evidence
  3. Nothing merges without consensus[APPROVED] requires auditor sign-off, not self-promotion
  4. The team learns → Mandatory retrospective after each cycle, session-gate enforced
planner → scout (RTM) → orchestrator → implementer (worktree) → verify → audit → retro → merge → loop

Quick Start

1. Install

claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

2. Configure

# Copy example config to your project
cp ~/.claude/plugins/cache/berrzebb-plugins/consensus-loop/*/examples/config.example.json \
   .claude/consensus-loop/config.json

# Copy prompt templates
cp -r ~/.claude/plugins/cache/berrzebb-plugins/consensus-loop/*/examples/templates/ \
      .claude/consensus-loop/templates/

Edit config.json — set your tags and paths:

{
  "consensus": {
    "watch_file": "docs/feedback/claude.md",
    "trigger_tag": "[REVIEW_NEEDED]",
    "agree_tag": "[APPROVED]",
    "pending_tag": "[CHANGES_REQUESTED]"
  }
}

3. Use

/consensus-loop:orchestrator     # Start a work session
/consensus-loop:planner          # Design new tracks interactively
/consensus-loop:verify           # Check done-criteria before submission
/consensus-audit                 # Trigger manual audit
/consensus-status                # Show current loop state

Real-World Reference: SoulFlow Orchestrator

consensus-loop was built to manage SoulFlow Orchestrator — a 32MB TypeScript codebase with 141 workflow nodes, 9 AI providers, and 188 deterministic tools.

Results from production use:

Metric Value
Tracks planned 17 (+ 2 parallel support tracks)
Tracks RTM-scanned 13 in 3 scout runs
Broken cross-track links found 8 (automatically, in one pass)
Orphan tests identified 7
Parallel workers per session Up to 3 (background, worktree-isolated)
Test suite 104 tests across 21 suites

What RTM looks like in practice:

A single scout run on 5 foundation tracks produced 3-way traceability matrices revealing:

  • Backend code: ~90% verified across all 5 tracks
  • Frontend: consistently wip (intentionally deferred to Track 15)
  • Concrete next steps: PA-5 (ArtifactStore extraction) and PAR-4 (workflow fanout) identified as the only true open items

The scout eliminated redundant exploration — implementers received pre-verified RTM rows and skipped straight to coding.

In action — orchestrator analyzing RTM state and proposing parallel distribution:

Orchestrator identifies unblocked tracks from RTM, checks scope overlap between candidates, and proposes 3 parallel agents

The orchestrator reads RTM state across all tracks, identifies 4 unblocked tracks (14, 17, P1, P2), checks file scope overlap between every pair (only P1 vs P2 has a dependency warning), and proposes 3 parallel agents with non-conflicting scopes.

Orchestrator distributing RTM-based work to parallel agents:

Orchestrator analyzes scope overlap, splits tasks into non-conflicting agents, and distributes RTM rows

The orchestrator detects that PA-7 and RP-4+SO-6 touch different directories, assigns them to separate agents, and each agent receives only its RTM open rows.

Parallel worktree agents executing in the background:

Two worktree-isolated agents running simultaneously with real-time status tracking

Agent A (PA-7 import boundary) and Agent B (RP-4+SO-6 binding tests) execute in isolated worktrees. The orchestrator tracks completion status and waits for both to finish before proceeding to merge.

Full cycle completion — done-criteria verification + evidence integration:

Implementer passes all 8 done-criteria categories, 105 tests pass, two parallel workers complete and proceed to evidence integration

Both parallel workers pass all done-criteria (CQ, T, CC, CL, CV — all PASS, 105 tests including 27 new + 78 regression). The orchestrator integrates evidence from both worktrees and proceeds to audit → retrospective → squash merge.

Audit trigger + retrospective gate enforcement:

Orchestrator triggers manual audit, agent recognizes retro-marker gate and defers retrospective until after APPROVED verdict

The orchestrator triggers /consensus-audit. The agent recognizes that retrospective must wait for [APPROVED] verdict (retro-marker.json → session-gate.mjs). Structural guardrails enforce protocol order — the agent cannot skip ahead.

Cross-model audit verdict — [CHANGES_REQUESTED] with specific evidence:

GPT/Codex auditor issues CHANGES_REQUESTED verdict citing missing file and scope mismatch, while independently verifying RTM rows

The independent auditor (GPT/Codex) issues [CHANGES_REQUESTED] citing a missing test file and scope mismatch. The second audit independently verifies RTM rows — "The files and tests cited by the RTM do exist." The agent then performs retrospective on the rejection, identifying what went wrong and what to improve.

Emergent double verification — main-branch audit catches what worktree verification missed:

Second audit classifies 5 rejections into infrastructure issues (CC-2/CV stale) vs substantive code issues (CC-1 claim-code mismatch), with correction plan

The main-branch audit discovers 3 substantive CC-1 issues that passed worktree-local verification: has_role gating mismatch, BMS25 score initialization, ordinal rank seed. This is the emergent double verification in action — two structurally independent verification passes catch different failure classes.

Correction cycle resolution — all CC-1 issues fixed, remaining issues classified:

Third audit shows all 3 CC-1 bugs resolved, remaining issues classified as infrastructure (CC-2 diff baseline) vs substantive (T-2 write path test)

After correction: all CC-1 claim-code mismatches resolved (has_role ✅, lexical_scores ✅, _last_scores ✅). Remaining issues cleanly classified — CC-2 is infrastructure (git diff baseline), T-2 is substantive (write path assertion missing). The protocol's correction cycle converges.

Final audit pass + full retrospective — protocol cycle complete:

Fourth audit passes all substantive criteria, CC-2 remains as known infrastructure gap, followed by structured 4-phase retrospective

All 5 RTM rows pass CQ, T, CC-1, CL, CV. Only CC-2 (infrastructure diff baseline) remains. The orchestrator proceeds to retrospective: what went well (parallel distribution, double verification), what went wrong (CC-2 gap, WIP commit missing, audit hook trigger), memory cleanup, and bidirectional feedback. The full protocol cycle — plan → scout → distribute → implement → verify → audit → correct → re-audit → retrospective → merge — is complete.

Session gate release + handoff — cycle complete, next session prepared:

Session-gate released via session-self-improvement-complete, session summary table showing outputs across code/audit/paper/discovery/memory, next session handoff to K2→K3→Track 15

echo "session-self-improvement-complete" releases the gate. Session summary: 8 files + 155 tests produced, 4 audit rounds completed, paper advanced v0.3→v0.4 with 8 Figures, emergent double verification discovered. Handoff specifies next tasks: K2 (Retriever Vector Closure) → K3 (Multimodal Reference) → Track 15 FE.

Handoff file update — session state persisted for next session:

Handoff file written with completed task states (K1/K4 done with agent_id, worktree paths, results), next tasks (K2 not-started), paper status, and commit summary table

The orchestrator writes session-handoff.md with full state: completed tasks (K1 4 files/48 tests, K4 4 files/105 tests), agent IDs, worktree branches, correction history, protocol changes, paper status (v0.4), and next session targets. This enables any future session to resume without re-exploration.


Full Cycle Walkthrough (Test Harness)

The test harness is a standalone TypeScript project (3 tracks, 9 work-breakdowns, 44 tests) built to validate every stage of the protocol in isolation. Each screenshot below shows a real execution — not a mockup.

Phase 1: Plan — Requirements + Track Design

The planner defines tracks with dependency ordering, work-breakdown items per track, verification scenarios, and intentionally planted defects for audit rejection testing.

Requirements definition showing 3 tracks (data-layer → service-layer → api-layer), 10 verification scenarios, and 3 planted defects mapped to specific WBs

3 tracks with sequential dependency (data → service → api), 9 work-breakdown items, 10 scenarios covering the full cycle. 3 planted defects (test-gap, security-drift, scope-mismatch) are assigned to specific WBs — the auditor must catch all three.

Phase 2: Build — Project Scaffold + Quality Gates

The implementer creates the project structure, implements source code, and passes all quality gates (tsc, eslint, vitest) before entering the consensus cycle.

Project structure showing src/data, src/service, src/api with 34 passing tests, file tree, and planted defect table

The project is a real TypeScript codebase — not stubs. 34 tests pass across 3 test files. The defect table maps each planted issue to its WB, expected rejection code, and exact file location.

Phase 3: Scout — Deterministic RTM Generation

The scout uses MCP tools (code_map, dependency_graph) to analyze the codebase and generate 3-way Requirements Traceability Matrices — Forward, Backward, and Bidirectional.

Scout executes code_map (17 symbols across 9 files) and dependency_graph (9 components), analyzing the actual codebase via MCP tools

No LLM inference at this stage — only deterministic tools. code_map extracts 17 symbols (functions, classes, interfaces, types) with exact line ranges. dependency_graph maps import chains and connected components. These facts feed the RTM.

Forward RTM with 4 rows for data-layer showing Exists/Impl/Test Case/Connected columns, Backward RTM tracing 3 test files to requirements, Bidirectional summary

Forward RTM maps each Req ID to its implementation file, verification status, test case, and downstream consumer. Backward RTM traces each test file back to its requirement — detecting orphan tests. The bidirectional summary reveals gaps: SL-2 has no direct test (the planted defect).

Phase 4: Audit — Cross-Model Rejection

The auditor (GPT/Codex) independently verifies each RTM row. When evidence claims don't match the codebase, specific rejection codes are issued with file:line evidence.

Auditor issues CHANGES_REQUESTED for SL-2 with rejection code test-gap, while SL-1 and SL-3 pass independently

SL-2 claimed fixed status but tests/service/validator.test.ts does not exist — T-1 violation. The auditor issues test-gap with a Completion Criteria Reset specifying exactly what to fix. SL-1 and SL-3 are judged independently and pass.

Phase 5: Correct — SendMessage Reuse + Re-Audit

The orchestrator sends corrections to the existing implementer agent via SendMessage (no new spawn). After correction, evidence is resubmitted and re-audited.

claude.md tag promoted from REVIEW_NEEDED to APPROVED after correction round 2, audit-history.jsonl records both rejection and approval entries

The correction cycle is visible in the diff: [REVIEW_NEEDED][APPROVED] tag promotion. The audit-history.jsonl shows the full trail — round 1 rejected (test-gap), round 2 approved. The tag in claude.md is promoted by respond.mjs, not by the implementer (no self-promotion).

Phase 6: Enforce — Scope Validation + Upstream Delay

Structural enforcement runs automatically — not as guidelines but as code. The orchestrator validates scope overlap before parallel distribution, and enforcement.mjs auto-blocks downstream tracks when upstream rejection count exceeds threshold.

dependency_graph + Grep reveals error-handler.ts imports Response/RouteHandler from routes.ts — scope overlap detected, parallel spawn blocked

AL-1 (routes.ts) and AL-2 (error-handler.ts) share types via import. The orchestrator detects this overlap and falls back to sequential execution — preventing merge conflicts that parallel worktrees would cause.

3 consecutive security rejections on AL-1 trigger blockDownstreamTasks(), AL-2 status updated to "blocked (upstream delay: AL-1 security rejected 3x)"

After 3 consecutive security rejections on AL-1, enforcement.mjs automatically blocks AL-2 (which depends on AL-1). The handoff is updated with the reason string. This prevents wasted work — downstream agents won't start until the upstream issue is resolved.

Results

Metric Value
Scenarios executed 10/10 pass
Planted defects caught 3/3 (test-gap, security, scope-mismatch)
Correction cycles 2 (SL-2 test-gap, AL-1 security)
Downstream auto-blocks 1 (AL-2 blocked by AL-1 upstream delay)
Tech debt auto-captured 4 items → work-catalog.md
Final test count 44 pass (4 files)
# Run the test harness yourself
cd test-harness && npm install && npm run quality

Lightweight Entry: Audit Gate Only

Don't need the full orchestration? Use just the audit gate:

What you get:

  • Every file edit → cross-model audit (async, non-blocking)
  • [trigger_tag][agree_tag] or [pending_tag] with specific file:line rejection codes
  • Quality rules (ESLint, npm audit) run inline on matching edits
  • Session gate blocks commits until retrospective completes

What you skip:

  • Orchestrator/implementer multi-agent workflow
  • Scout + RTM traceability
  • Work breakdown planning

How: Install the plugin normally, then disable the skills you don't need. The hook cycle (index.mjsaudit.mjsrespond.mjssession-gate.mjs) works independently of the orchestration layer.


How It Works

Full Development Cycle

planner ─── Interactive 6-phase requirement definition
    ↓
scout ─── dependency_graph + code_map → 3-way RTM (Forward/Backward/Bidirectional)
    ↓
orchestrator ─── Distribute Forward RTM rows → scope validation → parallel background spawn
    ↓
┌─── Track A (worktree) ──────┐  ┌─── Track B (worktree) ──────┐
│  implementer: RTM rows only  │  │  implementer: RTM rows only  │
│  → verify (8 categories)     │  │  → verify (8 categories)     │
│  → submit RTM-based evidence │  │  → submit RTM-based evidence │
│  → audit (async, background) │  │  → audit (async, background) │
│  [pending] → fix failed rows │  │  [approved] → WIP commit     │
│  [approved] → WIP commit     │  │                               │
└──────────────────────────────┘  └──────────────────────────────┘
    ↓
retrospective (session-gate enforced) → merge (squash) → handoff → next RTM row

Verification Categories (8)

# Category What it checks
1 Code Quality (CQ) Per-file eslint + tsc + forbidden patterns
2 Test (T) Test execution + direct test per claim + no regressions
3 Claim-Code (CC) Evidence matches git diff
4 Cross-Layer (CL) BE→FE contracts documented
5 Security (S) OWASP TOP 10 + input validation + auth guards
6 i18n (I) Locale keys in all supported locales
7 Frontend (FV) Page loads, DOM, console errors, build
8 Coverage (CV) Statement ≥ 85%, Branch ≥ 75% per changed file

Deterministic MCP Tools (7)

These tools provide facts, not inference — used by all roles:

Tool What it does
code_map Cached symbol index with line ranges
dependency_graph Import/export DAG, connected components, topological sort, cycle detection
audit_scan Pattern scan (type-safety, hardcoded strings, console.log)
coverage_map Per-file coverage percentages from vitest JSON
rtm_parse Parse RTM markdown → structured rows, filter by req_id/status
rtm_merge Row-level merge of worktree RTMs with conflict detection
audit_history Query persistent audit history — verdicts, rejection patterns, risk detection

Hook Cycle

Code Edit → PostToolUse (index.mjs)
    ├─ watch_file + trigger_tag → spawn audit (detached, async)
    ├─ gpt.md newer → auto-sync (promote/demote tags)
    ├─ planning file → normalize
    └─ quality rule match → run check inline

Audit runs in background. Hook returns immediately. No blocking.


Key Design Decisions

1. Structure over instruction. Behavioral constraints enforced by code (session-gate, audit.lock) are more reliable than behavioral constraints enforced by prompts. You can't instruct Claude to consistently catch test-gap across sessions. But you can build a gate that makes it structurally impossible to proceed until a peer model confirms.

2. Facts over inference. The 6 MCP tools provide deterministic data — file existence, import chains, coverage percentages, symbol indices. Models judge; tools measure. This makes results stable across model changes.

3. Policy as data. All audit criteria, rejection codes, and evidence formats are in editable markdown files (templates/references/). To change audit standards, edit a file. No code changes.

4. Fail-open safety. Every hook fails open — errors pass through silently. The system never locks you out. session-gate.mjs errors → pass. Audit failures → pass. Config missing → graceful defaults.

5. Scout once, implement many. The scout generates a Requirements Traceability Matrix (RTM) once per track. All subsequent agents work from those facts, not from re-exploration. Cost: ~8K tokens (one-time). Savings: ~5K tokens per worker per round.


Architecture

Roles

Role What it does Model
Planner Interactive 6-phase requirement definition Opus
Scout Read-only 3-way RTM generation using deterministic tools Opus
Orchestrator Task distribution, agent tracking, correction cycles Inherited
Implementer Code in worktree, test, submit evidence, handle corrections Sonnet
Auditor Independent per-row verification of RTM evidence GPT/Codex

Skills (5)

Skill Purpose
consensus-loop:orchestrator Session orchestration — scout, distribute, track, correct
consensus-loop:verify Done-criteria verification (8 categories)
consensus-loop:merge Squash-merge worktree with structured commit
consensus-loop:planner Interactive track definition + work breakdown
consensus-loop:guide Evidence package writing guide

Agents (2)

Agent Purpose
consensus-loop:implementer Headless worker in worktree — code, test, evidence
consensus-loop:scout Read-only RTM generator — 3-way traceability

Porting to Another Project

# 1. Install
claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

# 2. Configure (edit tags + paths)
# 3. Edit templates/references/ for your team's policies

Minimal config for English projects:

{
  "plugin": { "locale": "en" },
  "consensus": {
    "watch_file": "docs/review/author.md",
    "trigger_tag": "[REVIEW_NEEDED]",
    "agree_tag": "[APPROVED]",
    "pending_tag": "[CHANGES_REQUESTED]"
  }
}

Contributing

Contributor Contributions
@berrzebb Core architecture, RTM system, MCP tools, multi-agent orchestration
@dandacompany Security fixes (#1 shell injection, #2 plugin support), locale path traversal + ESM require fix

License

MIT

About

Claude Code plugin — structural guardrails for multi-agent software development. RTM-based evidence, cross-model adversarial audit, 7 MCP tools, worktree isolation, HITL retrospective. DOI: 10.5281/zenodo.19108370

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors