Skip to content

Pipelined 2PC with durability barrier#4728

Merged
aasoni merged 5 commits intojdetter/tpccfrom
tyler/2pc-pipelined-v2
Mar 30, 2026
Merged

Pipelined 2PC with durability barrier#4728
aasoni merged 5 commits intojdetter/tpccfrom
tyler/2pc-pipelined-v2

Conversation

@cloutiertyler
Copy link
Copy Markdown
Contributor

Summary

  • Split 2PC into two consecutive rounds: Round 1 (memory commit, lock released) and Round 2 (persistence, no locks held)
  • Durability barrier prevents tainted transactions from reaching disk while 2PC is pending
  • PREPARE PERSIST commitlog entry contains only reducer inputs (st_2pc_state); COMMIT PERSIST contains actual row changes
  • Confirmed-read clients naturally wait for COMMIT PERSIST durability via the durable offset mechanism
  • TLA+ spec (in private repo) verified with 6 invariants including BarrierSafety, 60 distinct states

TODOs

  • Trigger module restart on persistence abort to flush tainted in-memory state
  • Retry limit for send_prepared_to_persist_to_coordinator
  • Handle prepare_id mismatch in recovery (new vs original ID)

Test plan

  • Existing 2PC smoke tests (cross_db_2pc, cross_db_2pc_recovery)
  • TLA+ verification
  • Verify barrier prevents COMMIT PERSIST from reaching disk prematurely
  • Verify confirmed-read clients dont see changes until COMMIT PERSIST is durable

Split the 2PC protocol into two consecutive rounds:
  Round 1 (Memory): commit to in-memory datastore, release lock immediately.
  Round 2 (Persist): persist to disk without holding locks.

Key changes:

Participant (B):
- Signal PREPARED immediately (no disk I/O before responding)
- On COMMIT: flush st_2pc_state marker (reducer inputs only) as PREPARE
  PERSIST, set durability barrier, then commit reducer changes to memory.
  The barrier defers the reducer's TxData from reaching disk.
- After PREPARE PERSIST is durable: signal PREPARED_TO_PERSIST
- After COMMIT_PERSIST: clear barrier (reducer changes flush as COMMIT
  PERSIST), wait for durability, delete marker.

Coordinator (A):
- Set durability barrier before commit (coordinator log deferred from disk)
- Send COMMIT without waiting for durability
- Wait for PREPARED_TO_PERSIST from all participants
- Clear barrier (coordinator log flushes), wait for durability
- Send COMMIT_PERSIST to participants

Durability barrier:
- Transactions at or below the barrier offset pass through to the durability
  worker; transactions above are accumulated in a pending list.
- Supports multiple concurrent barriers (BTreeSet of active offsets).
- clear_durability_barrier flushes pending up to the new minimum.
- abort_durability_barrier discards all pending (pipeline flush on abort).

New HTTP endpoints:
- POST /2pc/prepared-to-persist/:prepare_id (B notifies A)
- POST /2pc/commit-persist/:prepare_id (A tells B to finalize)

PreparedTransactions moved from ModuleHost to ReplicaContext so both actor
code and HTTP handlers can access it. Added Round 2 channels
(commit_persist_sender) and coordinator-side persist waiters.

Recovery uses Shub's approach: re-run reducer from stored inputs in
st_2pc_state, query coordinator for decision.

TODOs:
- Trigger module restart on persistence abort to flush tainted in-memory state
- Retry limit for send_prepared_to_persist_to_coordinator
- Handle prepare_id mismatch in recovery (new vs original ID)
When Round 2 of pipelined 2PC aborts (participant or coordinator),
abort_durability_barrier discards all deferred transactions, then panic
triggers module restart via the existing on_panic/defer_on_unwind
mechanism. On next access, the module is re-created from the commitlog,
which does not contain the tainted data.
let auth_token = replica_ctx.call_reducer_auth_token.clone();
{
let handle = tokio::runtime::Handle::current();
block_on_scoped(&handle, send_prepared_to_persist_to_coordinator(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be blocking here, because I'm going to lock up the whole thread.

// Step 4: wait for coordinator's decision (B never aborts on its own).
let commit = Self::wait_for_2pc_decision(decision_rx, &prepare_id, coordinator_identity, &replica_ctx);
// Step 10: wait for COMMIT_PERSIST from coordinator.
let persist_commit = Self::wait_for_commit_persist(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should not be blocking.

// Without this, A could delete its coordinator log entry while B's commit
// is still in-memory — a B crash at that point would leave the tx uncommitted
// with no way to recover (A has already forgotten it committed).
// Step 12: wait for COMMIT PERSIST durability (offset N+1 fsynced).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not also be blocking.

// ═══ WRITE LOCK RELEASED ═══════════════════════════════
// ── Round 2: Persistence Commit ────────────────────────

// Step 8: wait for PREPARE PERSIST durability (offset N fsynced).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be blocking.

@aasoni aasoni changed the base branch from shub/2pc-regular to jdetter/tpcc March 30, 2026 18:14
@aasoni aasoni merged commit 39ac0e2 into jdetter/tpcc Mar 30, 2026
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants