consensus-loop

AI writes code. A different AI reviews it. Nothing ships without consensus.

A Claude Code plugin that enforces a cross-model audit gate on every code change. Claude implements, GPT/Codex reviews, and a human-in-the-loop retrospective ensures the team learns from each cycle.

claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

That's it. All hooks, skills, agents, and MCP tools are auto-registered.

The Problem

AI coding tools generate code fast. They also generate bugs fast, skip tests, drift from requirements, and self-validate their own blind spots. Instruction-based corrections ("always write tests") fade across sessions. The model cannot reliably catch its own mistakes through self-review.

The Solution

Structure beats instruction. consensus-loop makes it structurally impossible to ship unreviewed code:

You write → Claude implements in an isolated git worktree
A different model reviews → GPT/Codex independently audits the evidence
Nothing merges without consensus → [APPROVED] requires auditor sign-off, not self-promotion
The team learns → Mandatory retrospective after each cycle, session-gate enforced

planner → scout (RTM) → orchestrator → implementer (worktree) → verify → audit → retro → merge → loop

Quick Start

1. Install

claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

2. Configure

# Copy example config to your project
cp ~/.claude/plugins/cache/berrzebb-plugins/consensus-loop/*/examples/config.example.json \
   .claude/consensus-loop/config.json

# Copy prompt templates
cp -r ~/.claude/plugins/cache/berrzebb-plugins/consensus-loop/*/examples/templates/ \
      .claude/consensus-loop/templates/

Edit config.json — set your tags and paths:

{
  "consensus": {
    "watch_file": "docs/feedback/claude.md",
    "trigger_tag": "[REVIEW_NEEDED]",
    "agree_tag": "[APPROVED]",
    "pending_tag": "[CHANGES_REQUESTED]"
  }
}

3. Use

/consensus-loop:orchestrator     # Start a work session
/consensus-loop:planner          # Design new tracks interactively
/consensus-loop:verify           # Check done-criteria before submission
/consensus-audit                 # Trigger manual audit
/consensus-status                # Show current loop state

Real-World Reference: SoulFlow Orchestrator

consensus-loop was built to manage SoulFlow Orchestrator — a 32MB TypeScript codebase with 141 workflow nodes, 9 AI providers, and 188 deterministic tools.

Results from production use:

Metric	Value
Tracks planned	17 (+ 2 parallel support tracks)
Tracks RTM-scanned	13 in 3 scout runs
Broken cross-track links found	8 (automatically, in one pass)
Orphan tests identified	7
Parallel workers per session	Up to 3 (background, worktree-isolated)
Test suite	104 tests across 21 suites

What RTM looks like in practice:

A single scout run on 5 foundation tracks produced 3-way traceability matrices revealing:

Backend code: ~90% verified across all 5 tracks
Frontend: consistently wip (intentionally deferred to Track 15)
Concrete next steps: PA-5 (ArtifactStore extraction) and PAR-4 (workflow fanout) identified as the only true open items

The scout eliminated redundant exploration — implementers received pre-verified RTM rows and skipped straight to coding.

In action — orchestrator analyzing RTM state and proposing parallel distribution:

The orchestrator reads RTM state across all tracks, identifies 4 unblocked tracks (14, 17, P1, P2), checks file scope overlap between every pair (only P1 vs P2 has a dependency warning), and proposes 3 parallel agents with non-conflicting scopes.

Orchestrator distributing RTM-based work to parallel agents:

The orchestrator detects that PA-7 and RP-4+SO-6 touch different directories, assigns them to separate agents, and each agent receives only its RTM open rows.

Parallel worktree agents executing in the background:

Agent A (PA-7 import boundary) and Agent B (RP-4+SO-6 binding tests) execute in isolated worktrees. The orchestrator tracks completion status and waits for both to finish before proceeding to merge.

Full cycle completion — done-criteria verification + evidence integration:

Both parallel workers pass all done-criteria (CQ, T, CC, CL, CV — all PASS, 105 tests including 27 new + 78 regression). The orchestrator integrates evidence from both worktrees and proceeds to audit → retrospective → squash merge.

Audit trigger + retrospective gate enforcement:

The orchestrator triggers /consensus-audit. The agent recognizes that retrospective must wait for [APPROVED] verdict (retro-marker.json → session-gate.mjs). Structural guardrails enforce protocol order — the agent cannot skip ahead.

Cross-model audit verdict — [CHANGES_REQUESTED] with specific evidence:

The independent auditor (GPT/Codex) issues [CHANGES_REQUESTED] citing a missing test file and scope mismatch. The second audit independently verifies RTM rows — "The files and tests cited by the RTM do exist." The agent then performs retrospective on the rejection, identifying what went wrong and what to improve.

Emergent double verification — main-branch audit catches what worktree verification missed:

The main-branch audit discovers 3 substantive CC-1 issues that passed worktree-local verification: has_role gating mismatch, BMS25 score initialization, ordinal rank seed. This is the emergent double verification in action — two structurally independent verification passes catch different failure classes.

Correction cycle resolution — all CC-1 issues fixed, remaining issues classified:

After correction: all CC-1 claim-code mismatches resolved (has_role ✅, lexical_scores ✅, _last_scores ✅). Remaining issues cleanly classified — CC-2 is infrastructure (git diff baseline), T-2 is substantive (write path assertion missing). The protocol's correction cycle converges.

Final audit pass + full retrospective — protocol cycle complete:

All 5 RTM rows pass CQ, T, CC-1, CL, CV. Only CC-2 (infrastructure diff baseline) remains. The orchestrator proceeds to retrospective: what went well (parallel distribution, double verification), what went wrong (CC-2 gap, WIP commit missing, audit hook trigger), memory cleanup, and bidirectional feedback. The full protocol cycle — plan → scout → distribute → implement → verify → audit → correct → re-audit → retrospective → merge — is complete.

Session gate release + handoff — cycle complete, next session prepared:

echo "session-self-improvement-complete" releases the gate. Session summary: 8 files + 155 tests produced, 4 audit rounds completed, paper advanced v0.3→v0.4 with 8 Figures, emergent double verification discovered. Handoff specifies next tasks: K2 (Retriever Vector Closure) → K3 (Multimodal Reference) → Track 15 FE.

Handoff file update — session state persisted for next session:

The orchestrator writes session-handoff.md with full state: completed tasks (K1 4 files/48 tests, K4 4 files/105 tests), agent IDs, worktree branches, correction history, protocol changes, paper status (v0.4), and next session targets. This enables any future session to resume without re-exploration.

Full Cycle Walkthrough (Test Harness)

The test harness is a standalone TypeScript project (3 tracks, 9 work-breakdowns, 44 tests) built to validate every stage of the protocol in isolation. Each screenshot below shows a real execution — not a mockup.

Phase 1: Plan — Requirements + Track Design

The planner defines tracks with dependency ordering, work-breakdown items per track, verification scenarios, and intentionally planted defects for audit rejection testing.

3 tracks with sequential dependency (data → service → api), 9 work-breakdown items, 10 scenarios covering the full cycle. 3 planted defects (test-gap, security-drift, scope-mismatch) are assigned to specific WBs — the auditor must catch all three.

Phase 2: Build — Project Scaffold + Quality Gates

The implementer creates the project structure, implements source code, and passes all quality gates (tsc, eslint, vitest) before entering the consensus cycle.

The project is a real TypeScript codebase — not stubs. 34 tests pass across 3 test files. The defect table maps each planted issue to its WB, expected rejection code, and exact file location.

Phase 3: Scout — Deterministic RTM Generation

The scout uses MCP tools (code_map, dependency_graph) to analyze the codebase and generate 3-way Requirements Traceability Matrices — Forward, Backward, and Bidirectional.

No LLM inference at this stage — only deterministic tools. code_map extracts 17 symbols (functions, classes, interfaces, types) with exact line ranges. dependency_graph maps import chains and connected components. These facts feed the RTM.

Forward RTM maps each Req ID to its implementation file, verification status, test case, and downstream consumer. Backward RTM traces each test file back to its requirement — detecting orphan tests. The bidirectional summary reveals gaps: SL-2 has no direct test (the planted defect).

Phase 4: Audit — Cross-Model Rejection

The auditor (GPT/Codex) independently verifies each RTM row. When evidence claims don't match the codebase, specific rejection codes are issued with file:line evidence.

SL-2 claimed fixed status but tests/service/validator.test.ts does not exist — T-1 violation. The auditor issues test-gap with a Completion Criteria Reset specifying exactly what to fix. SL-1 and SL-3 are judged independently and pass.

Phase 5: Correct — SendMessage Reuse + Re-Audit

The orchestrator sends corrections to the existing implementer agent via SendMessage (no new spawn). After correction, evidence is resubmitted and re-audited.

The correction cycle is visible in the diff: [REVIEW_NEEDED] → [APPROVED] tag promotion. The audit-history.jsonl shows the full trail — round 1 rejected (test-gap), round 2 approved. The tag in claude.md is promoted by respond.mjs, not by the implementer (no self-promotion).

Phase 6: Enforce — Scope Validation + Upstream Delay

Structural enforcement runs automatically — not as guidelines but as code. The orchestrator validates scope overlap before parallel distribution, and enforcement.mjs auto-blocks downstream tracks when upstream rejection count exceeds threshold.

AL-1 (routes.ts) and AL-2 (error-handler.ts) share types via import. The orchestrator detects this overlap and falls back to sequential execution — preventing merge conflicts that parallel worktrees would cause.

After 3 consecutive security rejections on AL-1, enforcement.mjs automatically blocks AL-2 (which depends on AL-1). The handoff is updated with the reason string. This prevents wasted work — downstream agents won't start until the upstream issue is resolved.

Results

Metric	Value
Scenarios executed	10/10 pass
Planted defects caught	3/3 (test-gap, security, scope-mismatch)
Correction cycles	2 (SL-2 test-gap, AL-1 security)
Downstream auto-blocks	1 (AL-2 blocked by AL-1 upstream delay)
Tech debt auto-captured	4 items → work-catalog.md
Final test count	44 pass (4 files)

# Run the test harness yourself
cd test-harness && npm install && npm run quality

Lightweight Entry: Audit Gate Only

Don't need the full orchestration? Use just the audit gate:

What you get:

Every file edit → cross-model audit (async, non-blocking)
[trigger_tag] → [agree_tag] or [pending_tag] with specific file:line rejection codes
Quality rules (ESLint, npm audit) run inline on matching edits
Session gate blocks commits until retrospective completes

What you skip:

Orchestrator/implementer multi-agent workflow
Scout + RTM traceability
Work breakdown planning

How: Install the plugin normally, then disable the skills you don't need. The hook cycle (index.mjs → audit.mjs → respond.mjs → session-gate.mjs) works independently of the orchestration layer.

How It Works

Full Development Cycle

planner ─── Interactive 6-phase requirement definition
    ↓
scout ─── dependency_graph + code_map → 3-way RTM (Forward/Backward/Bidirectional)
    ↓
orchestrator ─── Distribute Forward RTM rows → scope validation → parallel background spawn
    ↓
┌─── Track A (worktree) ──────┐  ┌─── Track B (worktree) ──────┐
│  implementer: RTM rows only  │  │  implementer: RTM rows only  │
│  → verify (8 categories)     │  │  → verify (8 categories)     │
│  → submit RTM-based evidence │  │  → submit RTM-based evidence │
│  → audit (async, background) │  │  → audit (async, background) │
│  [pending] → fix failed rows │  │  [approved] → WIP commit     │
│  [approved] → WIP commit     │  │                               │
└──────────────────────────────┘  └──────────────────────────────┘
    ↓
retrospective (session-gate enforced) → merge (squash) → handoff → next RTM row

Verification Categories (8)

#	Category	What it checks
1	Code Quality (CQ)	Per-file eslint + tsc + forbidden patterns
2	Test (T)	Test execution + direct test per claim + no regressions
3	Claim-Code (CC)	Evidence matches git diff
4	Cross-Layer (CL)	BE→FE contracts documented
5	Security (S)	OWASP TOP 10 + input validation + auth guards
6	i18n (I)	Locale keys in all supported locales
7	Frontend (FV)	Page loads, DOM, console errors, build
8	Coverage (CV)	Statement ≥ 85%, Branch ≥ 75% per changed file

Deterministic MCP Tools (7)

These tools provide facts, not inference — used by all roles:

Tool	What it does
`code_map`	Cached symbol index with line ranges
`dependency_graph`	Import/export DAG, connected components, topological sort, cycle detection
`audit_scan`	Pattern scan (type-safety, hardcoded strings, console.log)
`coverage_map`	Per-file coverage percentages from vitest JSON
`rtm_parse`	Parse RTM markdown → structured rows, filter by req_id/status
`rtm_merge`	Row-level merge of worktree RTMs with conflict detection
`audit_history`	Query persistent audit history — verdicts, rejection patterns, risk detection

Hook Cycle

Code Edit → PostToolUse (index.mjs)
    ├─ watch_file + trigger_tag → spawn audit (detached, async)
    ├─ gpt.md newer → auto-sync (promote/demote tags)
    ├─ planning file → normalize
    └─ quality rule match → run check inline

Audit runs in background. Hook returns immediately. No blocking.

Key Design Decisions

1. Structure over instruction. Behavioral constraints enforced by code (session-gate, audit.lock) are more reliable than behavioral constraints enforced by prompts. You can't instruct Claude to consistently catch test-gap across sessions. But you can build a gate that makes it structurally impossible to proceed until a peer model confirms.

2. Facts over inference. The 6 MCP tools provide deterministic data — file existence, import chains, coverage percentages, symbol indices. Models judge; tools measure. This makes results stable across model changes.

3. Policy as data. All audit criteria, rejection codes, and evidence formats are in editable markdown files (templates/references/). To change audit standards, edit a file. No code changes.

4. Fail-open safety. Every hook fails open — errors pass through silently. The system never locks you out. session-gate.mjs errors → pass. Audit failures → pass. Config missing → graceful defaults.

5. Scout once, implement many. The scout generates a Requirements Traceability Matrix (RTM) once per track. All subsequent agents work from those facts, not from re-exploration. Cost: ~8K tokens (one-time). Savings: ~5K tokens per worker per round.

Architecture

Roles

Role	What it does	Model
Planner	Interactive 6-phase requirement definition	Opus
Scout	Read-only 3-way RTM generation using deterministic tools	Opus
Orchestrator	Task distribution, agent tracking, correction cycles	Inherited
Implementer	Code in worktree, test, submit evidence, handle corrections	Sonnet
Auditor	Independent per-row verification of RTM evidence	GPT/Codex

Skills (5)

Skill	Purpose
`consensus-loop:orchestrator`	Session orchestration — scout, distribute, track, correct
`consensus-loop:verify`	Done-criteria verification (8 categories)
`consensus-loop:merge`	Squash-merge worktree with structured commit
`consensus-loop:planner`	Interactive track definition + work breakdown
`consensus-loop:guide`	Evidence package writing guide

Agents (2)

Agent	Purpose
`consensus-loop:implementer`	Headless worker in worktree — code, test, evidence
`consensus-loop:scout`	Read-only RTM generator — 3-way traceability

Porting to Another Project

# 1. Install
claude plugin marketplace add berrzebb/claude-plugins
claude plugin install consensus-loop@berrzebb-plugins

# 2. Configure (edit tags + paths)
# 3. Edit templates/references/ for your team's policies

Minimal config for English projects:

{
  "plugin": { "locale": "en" },
  "consensus": {
    "watch_file": "docs/review/author.md",
    "trigger_tag": "[REVIEW_NEEDED]",
    "agree_tag": "[APPROVED]",
    "pending_tag": "[CHANGES_REQUESTED]"
  }
}

Contributing

Contributor	Contributions
@berrzebb	Core architecture, RTM system, MCP tools, multi-agent orchestration
@dandacompany	Security fixes (#1 shell injection, #2 plugin support), locale path traversal + ESM require fix

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.claude-plugin		.claude-plugin
agents		agents
assets		assets
commands		commands
docs		docs
examples		examples
hooks		hooks
locales		locales
scripts		scripts
skills		skills
templates/references		templates/references
test-harness		test-harness
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
CLAUDE.md		CLAUDE.md
README.md		README.md
audit.mjs		audit.mjs
cli-runner.mjs		cli-runner.mjs
consensus-loop-paper.pdf		consensus-loop-paper.pdf
context.mjs		context.mjs
handoff-writer.mjs		handoff-writer.mjs
i18n.mjs		i18n.mjs
index.mjs		index.mjs
pre-compact.mjs		pre-compact.mjs
respond.mjs		respond.mjs
retrospective.mjs		retrospective.mjs
session-gate.mjs		session-gate.mjs
session-start.mjs		session-start.mjs
session-stop.mjs		session-stop.mjs
subagent-stop.mjs		subagent-stop.mjs

Folders and files

Latest commit

History

Repository files navigation

consensus-loop

The Problem

The Solution

Quick Start

1. Install

2. Configure

3. Use

Real-World Reference: SoulFlow Orchestrator

Full Cycle Walkthrough (Test Harness)

Phase 1: Plan — Requirements + Track Design

Phase 2: Build — Project Scaffold + Quality Gates

Phase 3: Scout — Deterministic RTM Generation

Phase 4: Audit — Cross-Model Rejection

Phase 5: Correct — SendMessage Reuse + Re-Audit

Phase 6: Enforce — Scope Validation + Upstream Delay

Results

Lightweight Entry: Audit Gate Only

How It Works

Full Development Cycle

Verification Categories (8)

Deterministic MCP Tools (7)

Hook Cycle

Key Design Decisions

Architecture

Roles

Skills (5)

Agents (2)

Porting to Another Project

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages