Skip to content

[Safety Bug] State Divergence Due to Unprotected EXECUTED Instances in handleCommit #27

@SherlockShemol

Description

@SherlockShemol

Summary

A safety violation can occur when a leader executes a command on the Fast Path before broadcasting the Commit message, then crashes. Recovery may assign different dependencies to the command, causing different replicas to execute commands in different orders, leading to state machine divergence.

Bug Type

Attribute Value
Type Safety Violation (State Divergence)
Severity 🔴 Critical
Threat Model CFT (Compliant)

Problematic Code

Problem 1: Race Condition Between Execution and Commit Broadcast

File: src/epaxos/epaxos.go, Lines 1050-1071

// handlePreAcceptReply function
if inst.lb.preAcceptOKs >= r.N/2 && inst.lb.allEqual && allCommitted && isInitialBallot(inst.ballot) {
    // ① Status set to COMMITTED first
    r.InstanceSpace[pareply.Replica][pareply.Instance].Status = epaxosproto.COMMITTED  // Line 1053
    r.updateCommitted(pareply.Replica)
    
    // ... reply to clients ...
    
    r.recordInstanceMetadata(inst)
    r.sync()
    
    // ② Commit message sent AFTER status change
    r.bcastCommit(pareply.Replica, pareply.Instance, inst.Cmds, inst.Seq, inst.Deps)  // Line 1071
}

The executeCommands goroutine (line 429) may see COMMITTED status and execute the command before bcastCommit is called.

Problem 2: Recovery Can Recalculate Dependencies

File: src/epaxos/epaxos.go, Lines 1518-1522

// handlePrepareReply function
if conf, q, i := r.findPreAcceptConflicts(ir.cmds, preply.Replica, preply.Instance, ir.seq, ir.deps); conf {
    if r.InstanceSpace[q][i].Status >= epaxosproto.COMMITTED {
        // Found committed conflict! Restart Phase1 with NEW deps
        r.startPhase1(preply.Replica, preply.Instance, inst.ballot, inst.lb.clientProposals, ir.cmds, len(ir.cmds))
        return
    }
}

Problem 3: startPhase1 Recalculates Dependencies

File: src/epaxos/epaxos.go, Lines 831-837

func (r *Replica) startPhase1(...) {
    seq := int32(0)
    var deps [DS]int32
    for q := 0; q < r.N; q++ {
        deps[q] = -1
    }
    
    // ⚠️ Recalculates deps based on CURRENT conflict map!
    seq, deps, _ = r.updateAttributes(cmds, seq, deps, replica, instance)  // Line 837
}

Problem 4: handleCommit Does Not Protect EXECUTED Instances

File: src/epaxos/epaxos.go, Lines 1279-1281

// handleCommit function
if inst != nil {
    // ⚠️ NO CHECK for EXECUTED status!
    inst.Seq = commit.Seq       // Overwrites Seq
    inst.Deps = commit.Deps     // Overwrites Deps
    inst.Status = epaxosproto.COMMITTED  // Downgrades from EXECUTED!
}

Trigger Conditions

All of the following must occur:

  1. Leader (R0) achieves Fast Path for command X with deps=[]
  2. executeCommands goroutine executes X before bcastCommit (race condition)
  3. R0 crashes before or during bcastCommit
  4. A conflicting command Y is proposed and committed on other replicas
  5. Recovery for X discovers Y (COMMITTED) via findPreAcceptConflicts
  6. startPhase1 recalculates X's deps to include Y: deps=[Y]
  7. X is committed with deps=[Y]
  8. R0 recovers and receives Commit(X, deps=[Y])
  9. handleCommit overwrites X's deps (X was already EXECUTED with deps=[])

Attack Scenario Timeline

T0: R0 proposes X (key=K, value=100)
    Fast Path: all responders return deps=[]

T1: R0 sets X.Status = COMMITTED (line 1053)

T2: executeCommands goroutine sees COMMITTED
    Executes X: state[K] = 100
    X.Status = EXECUTED

T3: R0 crashes (before bcastCommit at line 1071)
    Commit(X, deps=[]) message NOT sent

T4: R1 proposes Y (key=K, value=200)
    Y sees X as PREACCEPTED
    Y.deps = [X]
    Y commits and executes

T5: R2 initiates recovery for X
    TryPreAccept finds Y (COMMITTED)
    startPhase1 recalculates: deps=[Y]
    X committed with deps=[Y]

T6: R0 recovers
    Receives Commit(X, deps=[Y])
    handleCommit overwrites X.Deps = [Y]
    X.Status = COMMITTED (downgraded from EXECUTED!)

State Divergence Result

R0 (Early Executor):
  Execution order: X → Y
  Final state: K = 200

R1 (Normal Executor):
  X.deps = [Y], Y.deps = [X]
  Cycle detected: X ↔ Y
  Order by Seq: Y.Seq < X.Seq
  Execution order: Y → X
  Final state: K = 100

❌ SAFETY VIOLATION: State machines diverged!
   R0: K = 200
   R1: K = 100

Impact

  • Different replicas may execute the same commands in different orders
  • State machines diverge permanently
  • Violates the fundamental EPaxos guarantee of consistent execution order

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions