Hypothesis
Without a normative layer, independently implemented LLM agents will express the same coordination concepts (decline, delegation, failure, handoff) using incompatible semantics — producing a Tower of Babel effect where the system cannot distinguish between fundamentally different situations. With CaseHub's normative layer, the same agents produce semantically consistent, auditable, interoperable outcomes regardless of which LLM powers them.
This experiment tests that hypothesis empirically, using LangChain4j as the baseline (no normative layer) and CaseHub + Qhorus as the treatment (normative layer enforced at protocol level).
Why this scenario — and not a simpler one
A code review with one handoff only exercises DECLINE and HANDOFF — two of nine speech act types and a small fraction of the commitment lifecycle. A skeptic can fairly say that is just two extra status values. The scenario below is chosen because every meaningful speech act type becomes load-bearing, and every distinction in the commitment lifecycle has immediate operational consequence. It is also a scenario every engineer has lived through.
Scenario: Production Database Corruption During Deployment
A migration script deployed to production has a bug that corrupts a subset of user records. Five agents must coordinate to diagnose, recover, and communicate — under time pressure, with a 30-minute P0 SLA, and with the deployment window closing.
Agents:
- Detection (Claude): monitors error rates, creates initial alert, opens the case
- Diagnosis (Gemini): investigates root cause — needs DB access it may not have
- Rollback (Claude, different instance): attempts to reverse the deployment
- Recovery (GPT-4): recovers corrupted records from backup if rollback fails
- Communication (Claude, third instance): notifies affected users — cannot act until scope is known
Supervisor (Claude, fourth instance): the incident commander. Holds the overarching obligation and must know at any point what is being worked, what has failed, what is stalled, and who is accountable.
Why each speech act type is exercised
| Speech act |
Who sends it |
Why it matters here |
| QUERY |
Diagnosis → all agents |
"Who has production DB read access?" — must ask before committing |
| RESPONSE |
DBA-capable agent replies |
Confirms capability before obligation is created |
| COMMAND |
Supervisor → Diagnosis |
Formal obligation created: you are accountable for diagnosing this |
| ACKNOWLEDGED |
Diagnosis → Supervisor |
Confirms active acceptance — not just queued |
| STATUS |
Rollback → all |
"Migration is partially reversible, attempting schema rollback" — 30-second updates |
| DECLINE |
Communication → Supervisor |
Cannot notify users yet — scope still unknown. Deliberately chosen, not a failure |
| FAILURE |
Rollback → Supervisor |
Attempted rollback, schema change is irreversible. Tried and could not complete — distinct from DECLINE |
| HANDOFF |
Rollback → Recovery |
Formal transfer of obligation to Recovery agent with full context |
| DONE |
Recovery → Supervisor |
Records recovered, integrity verified |
| EVENT |
All agents |
Telemetry: tool call durations, DB query times, records checked |
Why the FAILURE/DECLINE distinction is the critical test
When Rollback fails and Communication declines, both look identical without a normative layer — the system sees two agents that did not complete their work. The correct response is completely different:
- FAILURE (Rollback): the rollback strategy is exhausted. Escalate to Recovery immediately. Do not retry.
- DECLINE (Communication): the agent is fine and waiting. When Recovery completes and scope is known, re-issue the COMMAND.
During a live P0, getting this wrong costs minutes. Minutes measured in affected users and lost revenue.
Variant A — LangChain4j (no normative layer)
Setup
Five LangChain4j @AiService agents. Communication via natural language and free-form tool calls. No protocol enforced. Each agent has a reportStatus(message) tool and a requestHelp(agentId, reason, context) tool.
System prompts (neutral — no CaseHub concepts)
Rollback agent:
You are a deployment rollback specialist. When asked to reverse a deployment:
- Attempt the rollback
- Report your progress and outcome
- If you cannot complete the rollback, explain why and suggest next steps
Communication agent:
You are responsible for notifying affected users.
- Only send notifications when you know the scope of the incident
- If you cannot act yet, say so and wait
Supervisor agent (measurement instrument):
You are observing a production incident response. Five agents are coordinating.
You will receive a transcript of their communications.
Answer these questions from the transcript alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?
What to record:
- Raw tool call transcripts from all five agents
- All natural language messages exchanged
- Supervisor answers to the 6 questions above
- Run 5 times with identical inputs. Record variance across runs.
Variant B — CaseHub + Qhorus (normative layer enforced)
Setup
The same five LLM agents interacting exclusively via Qhorus MCP tools. Speech act types are enforced by the protocol — an agent cannot express a failure as free text; it must call fail_commitment(commitmentId, reason). The commitment store records the full obligation state of all five agents at all times.
Qhorus channel setup
Channel: case-p0-db-corruption-2026042801/coordination
Agents registered: detection, diagnosis, rollback, recovery, communication
Case opened by: casehub-assisteddev (Detection agent triggered)
Initial COMMAND chain
COMMAND: casehub-assisteddev → diagnosis
"Investigate database corruption. Scope and root cause required within 15 minutes."
correlationId: "p0-diagnosis"
→ OPEN commitment on diagnosis agent
COMMAND: casehub-assisteddev → rollback
"Attempt deployment rollback while diagnosis proceeds in parallel."
correlationId: "p0-rollback"
→ OPEN commitment on rollback agent
COMMAND: casehub-assisteddev → communication
"Prepare user notification. Do not send until scope confirmed."
correlationId: "p0-comms"
→ OPEN commitment on communication agent
Agent system prompts (MCP-aware)
All agents:
You are connected to a Qhorus coordination channel for a P0 production incident.
Use the typed MCP tools for all coordination:
- acknowledge_commitment: when you begin active work
- send_message(type=STATUS): every 30 seconds during active work
- send_message(type=QUERY): before taking any action requiring capabilities you may not have
- fulfill_commitment + send_message(type=DONE): when your obligation is complete
- decline_commitment: if you deliberately cannot act (scope unknown, access denied by policy)
- fail_commitment: if you attempted work and could not complete it (technical failure)
- send_message(type=HANDOFF, target=X): if transferring your obligation to another agent
Do not use free text to express coordination decisions. Use the typed tools.
What the supervisor reads (the ledger, not the chat)
list_ledger_entries(channelName: "case-p0-db-corruption-2026042801/coordination")
get_obligation_chain(correlationId: "p0-rollback")
list_stalled_obligations(channelName: "...")
Supervisor system prompt:
You are reviewing a production incident response. You have access to the Qhorus
commitment ledger for the incident — not the chat transcript.
Answer these questions from the ledger alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?
Measurement Criteria
Primary metrics
| Metric |
Variant A (LangChain4j) |
Variant B (CaseHub) |
How measured |
| FAILURE vs DECLINE distinction |
Can the supervisor distinguish Rollback's failure from Communication's decline? |
Always — FAILED vs DECLINED are distinct commitment states |
Supervisor answer to Q1+Q2 |
| Accountability clarity |
Who holds the recovery obligation? Consistent across 5 runs? |
Always — HANDOFF creates explicit new commitment |
Supervisor answer to Q3 |
| SLA breach moment |
Can the supervisor identify precisely when and who? |
Always — EXPIRED state + timestamp |
Supervisor answer to Q4 |
| Supervisor consistency across runs |
% of 5 runs where supervisor gives identical answers to all 6 questions |
Expected: 100% — ledger is identical |
Count differing answers |
| Trust signal quality |
What signal exists for future routing decisions? |
FAILED rollback → review rollback agent's reliability; FULFILLED recovery → reinforce |
Inspect what trust model receives |
| Post-incident forensics |
What can a 6-month-later reviewer reconstruct? |
Full obligation chain, what each agent knew, when they acted |
Supervisor answer to Q5 |
Secondary metrics
- Supervisor confidence (hedged vs definitive answers)
- Time for supervisor to answer (longer = record is less clear)
- Whether supervisor's Q6 recommendation (who to review) is correct and consistent
The Fixed Incident (identical across all 10 runs)
Trigger: Error rate spikes to 340% at 14:03:22. Detection agent observes.
The corruption: A migration that adds a user_preferences JSONB column runs an UPDATE statement with a flawed WHERE clause, corrupting preferences for users created between 2024-01-01 and 2024-06-30 (~180,000 records).
The rollback constraint: The migration added the column (reversible) but the UPDATE statement modified data (not reversible via schema rollback alone). The schema can be rolled back; the data cannot.
The recovery path: A daily backup from 02:00 contains clean records. Recovery must extract the affected rows, validate them, and re-apply. Estimated time: 45 minutes.
The SLA: P0 SLA requires initial diagnosis within 15 minutes and user notification within 30 minutes of incident open. The experiment runs for 60 simulated minutes.
What should happen:
- Diagnosis identifies root cause within 15 minutes ✓
- Rollback attempts schema reversal, discovers data is not recoverable via rollback → FAILED at ~20 min
- Rollback HANDOFFs to Recovery
- Communication DECLINES initial notification command (scope not yet confirmed)
- At 30 minutes: notification SLA breaches. Communication commitment → EXPIRED
- Recovery completes at ~45 minutes → FULFILLED
- Communication re-receives COMMAND with confirmed scope → FULFILLED
Expected Results
| Question |
Variant A expected |
Variant B expected |
| Q1: Rollback failure vs decline |
Variable across runs — natural language varies |
Consistent — FAILED state, always |
| Q2: Communication decline vs failure |
Often confused — both look like "didn't do it" |
Consistent — DECLINED state, always |
| Q3: Recovery accountability |
Often unclear — who owns it after handoff? |
Consistent — HANDOFF created explicit new commitment |
| Q4: SLA breach moment |
Often missed or incorrect |
Consistent — EXPIRED timestamp is precise |
| Q5: 6-month forensics |
"Rollback agent said it couldn't do it" |
Full obligation chain with named parties, timestamps, what each agent knew |
| Q6: Who to review |
Inconsistent — may blame wrong agent |
Consistent — Rollback FAILED (investigate), Communication DECLINED correctly (no review needed) |
The key result: Variant A will conflate Rollback's FAILURE with Communication's DECLINE — because without formal semantics, both are just "didn't complete." In a post-incident review, this means the wrong agent gets blamed. In a live incident, this means the wrong recovery action is taken.
Implementation Notes
LangChain4j setup (Variant A)
langchain4j-anthropic (Claude agents), langchain4j-open-ai (GPT-4 Recovery agent)
- Simple
IncidentOrchestrator that routes messages between agents
- All tool calls and message content logged to structured JSON for supervisor to read
- Version: LangChain4j 0.38+
CaseHub setup (Variant B)
- Qhorus with H2 in-memory (test isolation, deterministic replay)
- Five agent sessions connected via MCP (
quarkus-qhorus MCP endpoint)
quarkus-qhorus-testing InMemory stores for fast reset between runs
- Commitment ledger captured via
list_ledger_entries after each run
- Obligation chain via
get_obligation_chain(correlationId) for supervisor
Timing simulation
- Real LLM calls will not simulate 45-minute recovery — compress time by having agents report simulated elapsed time in their STATUS messages and DONE content
- The important thing is the sequence and speech act types, not wall-clock duration
Success Criteria
The hypothesis is confirmed if:
- Variant B supervisor answers are consistent across all 5 runs on all 6 questions
- Variant A supervisor answers vary on at least Q1, Q2, or Q4 across the 5 runs
- Variant A cannot distinguish FAILURE from DECLINE; Variant B always can
A strong result would additionally show that Variant B supervisor — reading the ledger cold, with no prior CaseHub knowledge — produces the same accountability assessment as a domain expert reading the same ledger. This validates not just that the normative layer is consistent, but that it is interpretable by any LLM that understands the underlying formal semantics.
Related issues
- engine#101 — LLM Supervisor Mode (this experiment validates the semantic foundation supervisor mode relies on)
- engine#102 — Use case patterns (empirical proof-of-concept for interoperability claims)
- quarkus-qhorus#123 — commitment outcomes → LedgerAttestation (required for trust signal measurement in Variant B)
Hypothesis
Without a normative layer, independently implemented LLM agents will express the same coordination concepts (decline, delegation, failure, handoff) using incompatible semantics — producing a Tower of Babel effect where the system cannot distinguish between fundamentally different situations. With CaseHub's normative layer, the same agents produce semantically consistent, auditable, interoperable outcomes regardless of which LLM powers them.
This experiment tests that hypothesis empirically, using LangChain4j as the baseline (no normative layer) and CaseHub + Qhorus as the treatment (normative layer enforced at protocol level).
Why this scenario — and not a simpler one
A code review with one handoff only exercises DECLINE and HANDOFF — two of nine speech act types and a small fraction of the commitment lifecycle. A skeptic can fairly say that is just two extra status values. The scenario below is chosen because every meaningful speech act type becomes load-bearing, and every distinction in the commitment lifecycle has immediate operational consequence. It is also a scenario every engineer has lived through.
Scenario: Production Database Corruption During Deployment
A migration script deployed to production has a bug that corrupts a subset of user records. Five agents must coordinate to diagnose, recover, and communicate — under time pressure, with a 30-minute P0 SLA, and with the deployment window closing.
Agents:
Supervisor (Claude, fourth instance): the incident commander. Holds the overarching obligation and must know at any point what is being worked, what has failed, what is stalled, and who is accountable.
Why each speech act type is exercised
Why the FAILURE/DECLINE distinction is the critical test
When Rollback fails and Communication declines, both look identical without a normative layer — the system sees two agents that did not complete their work. The correct response is completely different:
During a live P0, getting this wrong costs minutes. Minutes measured in affected users and lost revenue.
Variant A — LangChain4j (no normative layer)
Setup
Five LangChain4j
@AiServiceagents. Communication via natural language and free-form tool calls. No protocol enforced. Each agent has areportStatus(message)tool and arequestHelp(agentId, reason, context)tool.System prompts (neutral — no CaseHub concepts)
Rollback agent:
Communication agent:
Supervisor agent (measurement instrument):
What to record:
Variant B — CaseHub + Qhorus (normative layer enforced)
Setup
The same five LLM agents interacting exclusively via Qhorus MCP tools. Speech act types are enforced by the protocol — an agent cannot express a failure as free text; it must call
fail_commitment(commitmentId, reason). The commitment store records the full obligation state of all five agents at all times.Qhorus channel setup
Initial COMMAND chain
Agent system prompts (MCP-aware)
All agents:
What the supervisor reads (the ledger, not the chat)
Supervisor system prompt:
Measurement Criteria
Primary metrics
Secondary metrics
The Fixed Incident (identical across all 10 runs)
Trigger: Error rate spikes to 340% at 14:03:22. Detection agent observes.
The corruption: A migration that adds a
user_preferencesJSONB column runs an UPDATE statement with a flawed WHERE clause, corrupting preferences for users created between 2024-01-01 and 2024-06-30 (~180,000 records).The rollback constraint: The migration added the column (reversible) but the UPDATE statement modified data (not reversible via schema rollback alone). The schema can be rolled back; the data cannot.
The recovery path: A daily backup from 02:00 contains clean records. Recovery must extract the affected rows, validate them, and re-apply. Estimated time: 45 minutes.
The SLA: P0 SLA requires initial diagnosis within 15 minutes and user notification within 30 minutes of incident open. The experiment runs for 60 simulated minutes.
What should happen:
Expected Results
The key result: Variant A will conflate Rollback's FAILURE with Communication's DECLINE — because without formal semantics, both are just "didn't complete." In a post-incident review, this means the wrong agent gets blamed. In a live incident, this means the wrong recovery action is taken.
Implementation Notes
LangChain4j setup (Variant A)
langchain4j-anthropic(Claude agents),langchain4j-open-ai(GPT-4 Recovery agent)IncidentOrchestratorthat routes messages between agentsCaseHub setup (Variant B)
quarkus-qhorusMCP endpoint)quarkus-qhorus-testingInMemory stores for fast reset between runslist_ledger_entriesafter each runget_obligation_chain(correlationId)for supervisorTiming simulation
Success Criteria
The hypothesis is confirmed if:
A strong result would additionally show that Variant B supervisor — reading the ledger cold, with no prior CaseHub knowledge — produces the same accountability assessment as a domain expert reading the same ledger. This validates not just that the normative layer is consistent, but that it is interpretable by any LLM that understands the underlying formal semantics.
Related issues