Skip to content

experiment: normative layer interoperability test — Tower of Babel vs formal semantics #189

@mdproctor

Description

@mdproctor

Hypothesis

Without a normative layer, independently implemented LLM agents will express the same coordination concepts (decline, delegation, failure, handoff) using incompatible semantics — producing a Tower of Babel effect where the system cannot distinguish between fundamentally different situations. With CaseHub's normative layer, the same agents produce semantically consistent, auditable, interoperable outcomes regardless of which LLM powers them.

This experiment tests that hypothesis empirically, using LangChain4j as the baseline (no normative layer) and CaseHub + Qhorus as the treatment (normative layer enforced at protocol level).


Why this scenario — and not a simpler one

A code review with one handoff only exercises DECLINE and HANDOFF — two of nine speech act types and a small fraction of the commitment lifecycle. A skeptic can fairly say that is just two extra status values. The scenario below is chosen because every meaningful speech act type becomes load-bearing, and every distinction in the commitment lifecycle has immediate operational consequence. It is also a scenario every engineer has lived through.


Scenario: Production Database Corruption During Deployment

A migration script deployed to production has a bug that corrupts a subset of user records. Five agents must coordinate to diagnose, recover, and communicate — under time pressure, with a 30-minute P0 SLA, and with the deployment window closing.

Agents:

  • Detection (Claude): monitors error rates, creates initial alert, opens the case
  • Diagnosis (Gemini): investigates root cause — needs DB access it may not have
  • Rollback (Claude, different instance): attempts to reverse the deployment
  • Recovery (GPT-4): recovers corrupted records from backup if rollback fails
  • Communication (Claude, third instance): notifies affected users — cannot act until scope is known

Supervisor (Claude, fourth instance): the incident commander. Holds the overarching obligation and must know at any point what is being worked, what has failed, what is stalled, and who is accountable.

Why each speech act type is exercised

Speech act Who sends it Why it matters here
QUERY Diagnosis → all agents "Who has production DB read access?" — must ask before committing
RESPONSE DBA-capable agent replies Confirms capability before obligation is created
COMMAND Supervisor → Diagnosis Formal obligation created: you are accountable for diagnosing this
ACKNOWLEDGED Diagnosis → Supervisor Confirms active acceptance — not just queued
STATUS Rollback → all "Migration is partially reversible, attempting schema rollback" — 30-second updates
DECLINE Communication → Supervisor Cannot notify users yet — scope still unknown. Deliberately chosen, not a failure
FAILURE Rollback → Supervisor Attempted rollback, schema change is irreversible. Tried and could not complete — distinct from DECLINE
HANDOFF Rollback → Recovery Formal transfer of obligation to Recovery agent with full context
DONE Recovery → Supervisor Records recovered, integrity verified
EVENT All agents Telemetry: tool call durations, DB query times, records checked

Why the FAILURE/DECLINE distinction is the critical test

When Rollback fails and Communication declines, both look identical without a normative layer — the system sees two agents that did not complete their work. The correct response is completely different:

  • FAILURE (Rollback): the rollback strategy is exhausted. Escalate to Recovery immediately. Do not retry.
  • DECLINE (Communication): the agent is fine and waiting. When Recovery completes and scope is known, re-issue the COMMAND.

During a live P0, getting this wrong costs minutes. Minutes measured in affected users and lost revenue.


Variant A — LangChain4j (no normative layer)

Setup

Five LangChain4j @AiService agents. Communication via natural language and free-form tool calls. No protocol enforced. Each agent has a reportStatus(message) tool and a requestHelp(agentId, reason, context) tool.

System prompts (neutral — no CaseHub concepts)

Rollback agent:

You are a deployment rollback specialist. When asked to reverse a deployment:
- Attempt the rollback
- Report your progress and outcome
- If you cannot complete the rollback, explain why and suggest next steps

Communication agent:

You are responsible for notifying affected users.
- Only send notifications when you know the scope of the incident
- If you cannot act yet, say so and wait

Supervisor agent (measurement instrument):

You are observing a production incident response. Five agents are coordinating.
You will receive a transcript of their communications.

Answer these questions from the transcript alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?

What to record:

  • Raw tool call transcripts from all five agents
  • All natural language messages exchanged
  • Supervisor answers to the 6 questions above
  • Run 5 times with identical inputs. Record variance across runs.

Variant B — CaseHub + Qhorus (normative layer enforced)

Setup

The same five LLM agents interacting exclusively via Qhorus MCP tools. Speech act types are enforced by the protocol — an agent cannot express a failure as free text; it must call fail_commitment(commitmentId, reason). The commitment store records the full obligation state of all five agents at all times.

Qhorus channel setup

Channel: case-p0-db-corruption-2026042801/coordination
Agents registered: detection, diagnosis, rollback, recovery, communication
Case opened by: casehub-assisteddev (Detection agent triggered)

Initial COMMAND chain

COMMAND: casehub-assisteddev → diagnosis
  "Investigate database corruption. Scope and root cause required within 15 minutes."
  correlationId: "p0-diagnosis"
  → OPEN commitment on diagnosis agent

COMMAND: casehub-assisteddev → rollback
  "Attempt deployment rollback while diagnosis proceeds in parallel."
  correlationId: "p0-rollback"
  → OPEN commitment on rollback agent

COMMAND: casehub-assisteddev → communication
  "Prepare user notification. Do not send until scope confirmed."
  correlationId: "p0-comms"
  → OPEN commitment on communication agent

Agent system prompts (MCP-aware)

All agents:

You are connected to a Qhorus coordination channel for a P0 production incident.
Use the typed MCP tools for all coordination:

- acknowledge_commitment: when you begin active work
- send_message(type=STATUS): every 30 seconds during active work
- send_message(type=QUERY): before taking any action requiring capabilities you may not have
- fulfill_commitment + send_message(type=DONE): when your obligation is complete
- decline_commitment: if you deliberately cannot act (scope unknown, access denied by policy)
- fail_commitment: if you attempted work and could not complete it (technical failure)
- send_message(type=HANDOFF, target=X): if transferring your obligation to another agent

Do not use free text to express coordination decisions. Use the typed tools.

What the supervisor reads (the ledger, not the chat)

list_ledger_entries(channelName: "case-p0-db-corruption-2026042801/coordination")
get_obligation_chain(correlationId: "p0-rollback")
list_stalled_obligations(channelName: "...")

Supervisor system prompt:

You are reviewing a production incident response. You have access to the Qhorus
commitment ledger for the incident — not the chat transcript.

Answer these questions from the ledger alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?

Measurement Criteria

Primary metrics

Metric Variant A (LangChain4j) Variant B (CaseHub) How measured
FAILURE vs DECLINE distinction Can the supervisor distinguish Rollback's failure from Communication's decline? Always — FAILED vs DECLINED are distinct commitment states Supervisor answer to Q1+Q2
Accountability clarity Who holds the recovery obligation? Consistent across 5 runs? Always — HANDOFF creates explicit new commitment Supervisor answer to Q3
SLA breach moment Can the supervisor identify precisely when and who? Always — EXPIRED state + timestamp Supervisor answer to Q4
Supervisor consistency across runs % of 5 runs where supervisor gives identical answers to all 6 questions Expected: 100% — ledger is identical Count differing answers
Trust signal quality What signal exists for future routing decisions? FAILED rollback → review rollback agent's reliability; FULFILLED recovery → reinforce Inspect what trust model receives
Post-incident forensics What can a 6-month-later reviewer reconstruct? Full obligation chain, what each agent knew, when they acted Supervisor answer to Q5

Secondary metrics

  • Supervisor confidence (hedged vs definitive answers)
  • Time for supervisor to answer (longer = record is less clear)
  • Whether supervisor's Q6 recommendation (who to review) is correct and consistent

The Fixed Incident (identical across all 10 runs)

Trigger: Error rate spikes to 340% at 14:03:22. Detection agent observes.

The corruption: A migration that adds a user_preferences JSONB column runs an UPDATE statement with a flawed WHERE clause, corrupting preferences for users created between 2024-01-01 and 2024-06-30 (~180,000 records).

The rollback constraint: The migration added the column (reversible) but the UPDATE statement modified data (not reversible via schema rollback alone). The schema can be rolled back; the data cannot.

The recovery path: A daily backup from 02:00 contains clean records. Recovery must extract the affected rows, validate them, and re-apply. Estimated time: 45 minutes.

The SLA: P0 SLA requires initial diagnosis within 15 minutes and user notification within 30 minutes of incident open. The experiment runs for 60 simulated minutes.

What should happen:

  1. Diagnosis identifies root cause within 15 minutes ✓
  2. Rollback attempts schema reversal, discovers data is not recoverable via rollback → FAILED at ~20 min
  3. Rollback HANDOFFs to Recovery
  4. Communication DECLINES initial notification command (scope not yet confirmed)
  5. At 30 minutes: notification SLA breaches. Communication commitment → EXPIRED
  6. Recovery completes at ~45 minutes → FULFILLED
  7. Communication re-receives COMMAND with confirmed scope → FULFILLED

Expected Results

Question Variant A expected Variant B expected
Q1: Rollback failure vs decline Variable across runs — natural language varies Consistent — FAILED state, always
Q2: Communication decline vs failure Often confused — both look like "didn't do it" Consistent — DECLINED state, always
Q3: Recovery accountability Often unclear — who owns it after handoff? Consistent — HANDOFF created explicit new commitment
Q4: SLA breach moment Often missed or incorrect Consistent — EXPIRED timestamp is precise
Q5: 6-month forensics "Rollback agent said it couldn't do it" Full obligation chain with named parties, timestamps, what each agent knew
Q6: Who to review Inconsistent — may blame wrong agent Consistent — Rollback FAILED (investigate), Communication DECLINED correctly (no review needed)

The key result: Variant A will conflate Rollback's FAILURE with Communication's DECLINE — because without formal semantics, both are just "didn't complete." In a post-incident review, this means the wrong agent gets blamed. In a live incident, this means the wrong recovery action is taken.


Implementation Notes

LangChain4j setup (Variant A)

  • langchain4j-anthropic (Claude agents), langchain4j-open-ai (GPT-4 Recovery agent)
  • Simple IncidentOrchestrator that routes messages between agents
  • All tool calls and message content logged to structured JSON for supervisor to read
  • Version: LangChain4j 0.38+

CaseHub setup (Variant B)

  • Qhorus with H2 in-memory (test isolation, deterministic replay)
  • Five agent sessions connected via MCP (quarkus-qhorus MCP endpoint)
  • quarkus-qhorus-testing InMemory stores for fast reset between runs
  • Commitment ledger captured via list_ledger_entries after each run
  • Obligation chain via get_obligation_chain(correlationId) for supervisor

Timing simulation

  • Real LLM calls will not simulate 45-minute recovery — compress time by having agents report simulated elapsed time in their STATUS messages and DONE content
  • The important thing is the sequence and speech act types, not wall-clock duration

Success Criteria

The hypothesis is confirmed if:

  1. Variant B supervisor answers are consistent across all 5 runs on all 6 questions
  2. Variant A supervisor answers vary on at least Q1, Q2, or Q4 across the 5 runs
  3. Variant A cannot distinguish FAILURE from DECLINE; Variant B always can

A strong result would additionally show that Variant B supervisor — reading the ledger cold, with no prior CaseHub knowledge — produces the same accountability assessment as a domain expert reading the same ledger. This validates not just that the normative layer is consistent, but that it is interpretable by any LLM that understands the underlying formal semantics.


Related issues

  • engine#101 — LLM Supervisor Mode (this experiment validates the semantic foundation supervisor mode relies on)
  • engine#102 — Use case patterns (empirical proof-of-concept for interoperability claims)
  • quarkus-qhorus#123 — commitment outcomes → LedgerAttestation (required for trust signal measurement in Variant B)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions