experiment: normative layer interoperability test — Tower of Babel vs formal semantics

## Hypothesis

Without a normative layer, independently implemented LLM agents will express the same coordination concepts (decline, delegation, failure, handoff) using incompatible semantics — producing a Tower of Babel effect where the system cannot distinguish between fundamentally different situations. With CaseHub's normative layer, the same agents produce semantically consistent, auditable, interoperable outcomes regardless of which LLM powers them.

This experiment tests that hypothesis empirically, using LangChain4j as the baseline (no normative layer) and CaseHub + Qhorus as the treatment (normative layer enforced at protocol level).

---

## Why this scenario — and not a simpler one

A code review with one handoff only exercises DECLINE and HANDOFF — two of nine speech act types and a small fraction of the commitment lifecycle. A skeptic can fairly say that is just two extra status values. The scenario below is chosen because every meaningful speech act type becomes load-bearing, and every distinction in the commitment lifecycle has immediate operational consequence. It is also a scenario every engineer has lived through.

---

## Scenario: Production Database Corruption During Deployment

A migration script deployed to production has a bug that corrupts a subset of user records. Five agents must coordinate to diagnose, recover, and communicate — under time pressure, with a 30-minute P0 SLA, and with the deployment window closing.

**Agents:**
- **Detection** (Claude): monitors error rates, creates initial alert, opens the case
- **Diagnosis** (Gemini): investigates root cause — needs DB access it may not have
- **Rollback** (Claude, different instance): attempts to reverse the deployment
- **Recovery** (GPT-4): recovers corrupted records from backup if rollback fails
- **Communication** (Claude, third instance): notifies affected users — cannot act until scope is known

**Supervisor** (Claude, fourth instance): the incident commander. Holds the overarching obligation and must know at any point what is being worked, what has failed, what is stalled, and who is accountable.

### Why each speech act type is exercised

| Speech act | Who sends it | Why it matters here |
|------------|-------------|---------------------|
| **QUERY** | Diagnosis → all agents | "Who has production DB read access?" — must ask before committing |
| **RESPONSE** | DBA-capable agent replies | Confirms capability before obligation is created |
| **COMMAND** | Supervisor → Diagnosis | Formal obligation created: you are accountable for diagnosing this |
| **ACKNOWLEDGED** | Diagnosis → Supervisor | Confirms active acceptance — not just queued |
| **STATUS** | Rollback → all | "Migration is partially reversible, attempting schema rollback" — 30-second updates |
| **DECLINE** | Communication → Supervisor | Cannot notify users yet — scope still unknown. Deliberately chosen, not a failure |
| **FAILURE** | Rollback → Supervisor | Attempted rollback, schema change is irreversible. Tried and could not complete — distinct from DECLINE |
| **HANDOFF** | Rollback → Recovery | Formal transfer of obligation to Recovery agent with full context |
| **DONE** | Recovery → Supervisor | Records recovered, integrity verified |
| **EVENT** | All agents | Telemetry: tool call durations, DB query times, records checked |

### Why the FAILURE/DECLINE distinction is the critical test

When Rollback fails and Communication declines, both look identical without a normative layer — the system sees two agents that did not complete their work. The correct response is completely different:

- **FAILURE** (Rollback): the rollback strategy is exhausted. Escalate to Recovery immediately. Do not retry.
- **DECLINE** (Communication): the agent is fine and waiting. When Recovery completes and scope is known, re-issue the COMMAND.

During a live P0, getting this wrong costs minutes. Minutes measured in affected users and lost revenue.

---

## Variant A — LangChain4j (no normative layer)

### Setup

Five LangChain4j `@AiService` agents. Communication via natural language and free-form tool calls. No protocol enforced. Each agent has a `reportStatus(message)` tool and a `requestHelp(agentId, reason, context)` tool.

### System prompts (neutral — no CaseHub concepts)

**Rollback agent:**
```
You are a deployment rollback specialist. When asked to reverse a deployment:
- Attempt the rollback
- Report your progress and outcome
- If you cannot complete the rollback, explain why and suggest next steps
```

**Communication agent:**
```
You are responsible for notifying affected users.
- Only send notifications when you know the scope of the incident
- If you cannot act yet, say so and wait
```

**Supervisor agent (measurement instrument):**
```
You are observing a production incident response. Five agents are coordinating.
You will receive a transcript of their communications.

Answer these questions from the transcript alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?
```

**What to record:**
- Raw tool call transcripts from all five agents
- All natural language messages exchanged
- Supervisor answers to the 6 questions above
- Run 5 times with identical inputs. Record variance across runs.

---

## Variant B — CaseHub + Qhorus (normative layer enforced)

### Setup

The same five LLM agents interacting exclusively via Qhorus MCP tools. Speech act types are enforced by the protocol — an agent cannot express a failure as free text; it must call `fail_commitment(commitmentId, reason)`. The commitment store records the full obligation state of all five agents at all times.

### Qhorus channel setup

```
Channel: case-p0-db-corruption-2026042801/coordination
Agents registered: detection, diagnosis, rollback, recovery, communication
Case opened by: casehub-assisteddev (Detection agent triggered)
```

### Initial COMMAND chain
```
COMMAND: casehub-assisteddev → diagnosis
  "Investigate database corruption. Scope and root cause required within 15 minutes."
  correlationId: "p0-diagnosis"
  → OPEN commitment on diagnosis agent

COMMAND: casehub-assisteddev → rollback
  "Attempt deployment rollback while diagnosis proceeds in parallel."
  correlationId: "p0-rollback"
  → OPEN commitment on rollback agent

COMMAND: casehub-assisteddev → communication
  "Prepare user notification. Do not send until scope confirmed."
  correlationId: "p0-comms"
  → OPEN commitment on communication agent
```

### Agent system prompts (MCP-aware)

**All agents:**
```
You are connected to a Qhorus coordination channel for a P0 production incident.
Use the typed MCP tools for all coordination:

- acknowledge_commitment: when you begin active work
- send_message(type=STATUS): every 30 seconds during active work
- send_message(type=QUERY): before taking any action requiring capabilities you may not have
- fulfill_commitment + send_message(type=DONE): when your obligation is complete
- decline_commitment: if you deliberately cannot act (scope unknown, access denied by policy)
- fail_commitment: if you attempted work and could not complete it (technical failure)
- send_message(type=HANDOFF, target=X): if transferring your obligation to another agent

Do not use free text to express coordination decisions. Use the typed tools.
```

### What the supervisor reads (the ledger, not the chat)

```
list_ledger_entries(channelName: "case-p0-db-corruption-2026042801/coordination")
get_obligation_chain(correlationId: "p0-rollback")
list_stalled_obligations(channelName: "...")
```

**Supervisor system prompt:**
```
You are reviewing a production incident response. You have access to the Qhorus
commitment ledger for the incident — not the chat transcript.

Answer these questions from the ledger alone:
1. Did the Rollback agent complete the rollback, deliberately choose not to, or try and fail?
2. Did the Communication agent refuse to act, or fail to act? What is the difference?
3. Who is currently accountable for recovering the corrupted records?
4. At what point did the 30-minute P0 SLA breach? Who held an open obligation at that moment?
5. If this incident is reviewed in 6 months, what does the record tell you about each agent's conduct?
6. Which agent's future assignments should be reviewed based on this incident?
```

---

## Measurement Criteria

### Primary metrics

| Metric | Variant A (LangChain4j) | Variant B (CaseHub) | How measured |
|--------|------------------------|---------------------|--------------|
| **FAILURE vs DECLINE distinction** | Can the supervisor distinguish Rollback's failure from Communication's decline? | Always — FAILED vs DECLINED are distinct commitment states | Supervisor answer to Q1+Q2 |
| **Accountability clarity** | Who holds the recovery obligation? Consistent across 5 runs? | Always — HANDOFF creates explicit new commitment | Supervisor answer to Q3 |
| **SLA breach moment** | Can the supervisor identify precisely when and who? | Always — EXPIRED state + timestamp | Supervisor answer to Q4 |
| **Supervisor consistency across runs** | % of 5 runs where supervisor gives identical answers to all 6 questions | Expected: 100% — ledger is identical | Count differing answers |
| **Trust signal quality** | What signal exists for future routing decisions? | FAILED rollback → review rollback agent's reliability; FULFILLED recovery → reinforce | Inspect what trust model receives |
| **Post-incident forensics** | What can a 6-month-later reviewer reconstruct? | Full obligation chain, what each agent knew, when they acted | Supervisor answer to Q5 |

### Secondary metrics
- Supervisor confidence (hedged vs definitive answers)
- Time for supervisor to answer (longer = record is less clear)
- Whether supervisor's Q6 recommendation (who to review) is correct and consistent

---

## The Fixed Incident (identical across all 10 runs)

**Trigger:** Error rate spikes to 340% at 14:03:22. Detection agent observes.

**The corruption:** A migration that adds a `user_preferences` JSONB column runs an UPDATE statement with a flawed WHERE clause, corrupting preferences for users created between 2024-01-01 and 2024-06-30 (~180,000 records).

**The rollback constraint:** The migration added the column (reversible) but the UPDATE statement modified data (not reversible via schema rollback alone). The schema can be rolled back; the data cannot.

**The recovery path:** A daily backup from 02:00 contains clean records. Recovery must extract the affected rows, validate them, and re-apply. Estimated time: 45 minutes.

**The SLA:** P0 SLA requires initial diagnosis within 15 minutes and user notification within 30 minutes of incident open. The experiment runs for 60 simulated minutes.

**What should happen:**
1. Diagnosis identifies root cause within 15 minutes ✓
2. Rollback attempts schema reversal, discovers data is not recoverable via rollback → FAILED at ~20 min
3. Rollback HANDOFFs to Recovery
4. Communication DECLINES initial notification command (scope not yet confirmed)
5. At 30 minutes: notification SLA breaches. Communication commitment → EXPIRED
6. Recovery completes at ~45 minutes → FULFILLED
7. Communication re-receives COMMAND with confirmed scope → FULFILLED

---

## Expected Results

| Question | Variant A expected | Variant B expected |
|----------|-------------------|--------------------|
| Q1: Rollback failure vs decline | Variable across runs — natural language varies | Consistent — FAILED state, always |
| Q2: Communication decline vs failure | Often confused — both look like "didn't do it" | Consistent — DECLINED state, always |
| Q3: Recovery accountability | Often unclear — who owns it after handoff? | Consistent — HANDOFF created explicit new commitment |
| Q4: SLA breach moment | Often missed or incorrect | Consistent — EXPIRED timestamp is precise |
| Q5: 6-month forensics | "Rollback agent said it couldn't do it" | Full obligation chain with named parties, timestamps, what each agent knew |
| Q6: Who to review | Inconsistent — may blame wrong agent | Consistent — Rollback FAILED (investigate), Communication DECLINED correctly (no review needed) |

**The key result:** Variant A will conflate Rollback's FAILURE with Communication's DECLINE — because without formal semantics, both are just "didn't complete." In a post-incident review, this means the wrong agent gets blamed. In a live incident, this means the wrong recovery action is taken.

---

## Implementation Notes

### LangChain4j setup (Variant A)
- `langchain4j-anthropic` (Claude agents), `langchain4j-open-ai` (GPT-4 Recovery agent)
- Simple `IncidentOrchestrator` that routes messages between agents
- All tool calls and message content logged to structured JSON for supervisor to read
- Version: LangChain4j 0.38+

### CaseHub setup (Variant B)
- Qhorus with H2 in-memory (test isolation, deterministic replay)
- Five agent sessions connected via MCP (`quarkus-qhorus` MCP endpoint)
- `quarkus-qhorus-testing` InMemory stores for fast reset between runs
- Commitment ledger captured via `list_ledger_entries` after each run
- Obligation chain via `get_obligation_chain(correlationId)` for supervisor

### Timing simulation
- Real LLM calls will not simulate 45-minute recovery — compress time by having agents report simulated elapsed time in their STATUS messages and DONE content
- The important thing is the sequence and speech act types, not wall-clock duration

---

## Success Criteria

The hypothesis is confirmed if:
1. Variant B supervisor answers are consistent across all 5 runs on all 6 questions
2. Variant A supervisor answers vary on at least Q1, Q2, or Q4 across the 5 runs
3. Variant A cannot distinguish FAILURE from DECLINE; Variant B always can

A strong result would additionally show that Variant B supervisor — reading the ledger cold, with no prior CaseHub knowledge — produces the same accountability assessment as a domain expert reading the same ledger. This validates not just that the normative layer is consistent, but that it is interpretable by any LLM that understands the underlying formal semantics.

---

## Related issues

- [engine#101](https://github.com/casehubio/engine/issues/101) — LLM Supervisor Mode (this experiment validates the semantic foundation supervisor mode relies on)
- [engine#102](https://github.com/casehubio/engine/issues/102) — Use case patterns (empirical proof-of-concept for interoperability claims)
- [quarkus-qhorus#123](https://github.com/casehubio/quarkus-qhorus/issues/123) — commitment outcomes → LedgerAttestation (required for trust signal measurement in Variant B)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: normative layer interoperability test — Tower of Babel vs formal semantics #189

Hypothesis

Why this scenario — and not a simpler one

Scenario: Production Database Corruption During Deployment

Why each speech act type is exercised

Why the FAILURE/DECLINE distinction is the critical test

Variant A — LangChain4j (no normative layer)

Setup

System prompts (neutral — no CaseHub concepts)

Variant B — CaseHub + Qhorus (normative layer enforced)

Setup

Qhorus channel setup

Initial COMMAND chain

Agent system prompts (MCP-aware)

What the supervisor reads (the ledger, not the chat)

Measurement Criteria

Primary metrics

Secondary metrics

The Fixed Incident (identical across all 10 runs)

Expected Results

Implementation Notes

LangChain4j setup (Variant A)

CaseHub setup (Variant B)

Timing simulation

Success Criteria

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speech act	Who sends it	Why it matters here
QUERY	Diagnosis → all agents	"Who has production DB read access?" — must ask before committing
RESPONSE	DBA-capable agent replies	Confirms capability before obligation is created
COMMAND	Supervisor → Diagnosis	Formal obligation created: you are accountable for diagnosing this
ACKNOWLEDGED	Diagnosis → Supervisor	Confirms active acceptance — not just queued
STATUS	Rollback → all	"Migration is partially reversible, attempting schema rollback" — 30-second updates
DECLINE	Communication → Supervisor	Cannot notify users yet — scope still unknown. Deliberately chosen, not a failure
FAILURE	Rollback → Supervisor	Attempted rollback, schema change is irreversible. Tried and could not complete — distinct from DECLINE
HANDOFF	Rollback → Recovery	Formal transfer of obligation to Recovery agent with full context
DONE	Recovery → Supervisor	Records recovered, integrity verified
EVENT	All agents	Telemetry: tool call durations, DB query times, records checked

Metric	Variant A (LangChain4j)	Variant B (CaseHub)	How measured
FAILURE vs DECLINE distinction	Can the supervisor distinguish Rollback's failure from Communication's decline?	Always — FAILED vs DECLINED are distinct commitment states	Supervisor answer to Q1+Q2
Accountability clarity	Who holds the recovery obligation? Consistent across 5 runs?	Always — HANDOFF creates explicit new commitment	Supervisor answer to Q3
SLA breach moment	Can the supervisor identify precisely when and who?	Always — EXPIRED state + timestamp	Supervisor answer to Q4
Supervisor consistency across runs	% of 5 runs where supervisor gives identical answers to all 6 questions	Expected: 100% — ledger is identical	Count differing answers
Trust signal quality	What signal exists for future routing decisions?	FAILED rollback → review rollback agent's reliability; FULFILLED recovery → reinforce	Inspect what trust model receives
Post-incident forensics	What can a 6-month-later reviewer reconstruct?	Full obligation chain, what each agent knew, when they acted	Supervisor answer to Q5

Question	Variant A expected	Variant B expected
Q1: Rollback failure vs decline	Variable across runs — natural language varies	Consistent — FAILED state, always
Q2: Communication decline vs failure	Often confused — both look like "didn't do it"	Consistent — DECLINED state, always
Q3: Recovery accountability	Often unclear — who owns it after handoff?	Consistent — HANDOFF created explicit new commitment
Q4: SLA breach moment	Often missed or incorrect	Consistent — EXPIRED timestamp is precise
Q5: 6-month forensics	"Rollback agent said it couldn't do it"	Full obligation chain with named parties, timestamps, what each agent knew
Q6: Who to review	Inconsistent — may blame wrong agent	Consistent — Rollback FAILED (investigate), Communication DECLINED correctly (no review needed)

experiment: normative layer interoperability test — Tower of Babel vs formal semantics #189

Description

Hypothesis

Why this scenario — and not a simpler one

Scenario: Production Database Corruption During Deployment

Why each speech act type is exercised

Why the FAILURE/DECLINE distinction is the critical test

Variant A — LangChain4j (no normative layer)

Setup

System prompts (neutral — no CaseHub concepts)

Variant B — CaseHub + Qhorus (normative layer enforced)

Setup

Qhorus channel setup

Initial COMMAND chain

Agent system prompts (MCP-aware)

What the supervisor reads (the ledger, not the chat)

Measurement Criteria

Primary metrics

Secondary metrics

The Fixed Incident (identical across all 10 runs)

Expected Results

Implementation Notes

LangChain4j setup (Variant A)

CaseHub setup (Variant B)

Timing simulation

Success Criteria

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions