Skip to content

Barrier implementation improve reliability#149

Merged
anhthii merged 11 commits intomasterfrom
barrier-implementation-improve-reliability
Mar 19, 2026
Merged

Barrier implementation improve reliability#149
anhthii merged 11 commits intomasterfrom
barrier-implementation-improve-reliability

Conversation

@anhthii
Copy link
Contributor

@anhthii anhthii commented Mar 16, 2026

Fix: NATS "No Responders" & Node Unresponsiveness After Signing Failure

Every Change Explained

1. pkg/mpc/session.go — The core fixes

Added doneCh and doneOnce to session struct:

doneCh   chan struct{}
doneOnce sync.Once

Why: Before this, there was no way to tell a running Sign() or Reshare() goroutine to stop. When an error occurred, the error handler goroutine exited, but Sign() was stuck forever in select { case <-outCh: ... case <-endCh: ... } — neither channel would ever receive again after a failure. doneCh gives us a broadcast signal: closing it wakes up every goroutine selecting on it simultaneously.

Added sendErr() method:

func (s *session) sendErr(err error) {
    select {
    case s.ErrCh <- err:
    case <-s.doneCh:
    }
}

Why: The old code did s.ErrCh <- err directly. ErrCh was unbuffered, so this blocks until someone reads. The error handler goroutine was the only reader, and it exited after the first error. So when handleTssMessage tried to send a SECOND error (e.g., failing to reach peer B after already failing to reach peer C), it blocked forever. This blocked the Sign() goroutine (since handleTssMessage is called from Sign's select case), which meant the goroutine leaked permanently with its NATS subscriptions still active. sendErr uses a select so that if the session is stopped (doneCh closed), the send is abandoned instead of blocking.

Replaced every s.ErrCh <- ... with s.sendErr(...):
Why: Every error producer in the session had the same blocking problem. handleTssMessage, receiveP2PTssMessage, receiveBroadcastTssMessage, receiveTssMessage, subscribeFromPeersAsync, subscribeBroadcastAsync — all of them. Any one of them could deadlock the session.

Added Stop() and Done() methods:

func (s *session) Stop() {
    s.doneOnce.Do(func() { close(s.doneCh) })
}
func (s *session) Done() <-chan struct{} {
    return s.doneCh
}

Why: Stop() is the trigger to terminate a session. sync.Once ensures closing doneCh twice doesn't panic. Multiple callers can safely call Stop() — the error handler, the Sign method on success, or Close.

Added WaitForPeersReady() — the readiness barrier:

func (s *session) WaitForPeersReady() error {
    // 1. Subscribe to our own barrier topic
    // 2. For each peer, send NATS request to their barrier topic
    // 3. If peer responds, they're subscribed. If not, retry until timeout.
}

Why: This is the fix for the "no responders" bug. The old code used time.Sleep(200ms) hoping all peers would finish subscribing in time. This is fundamentally unreliable because:

  • Network latency varies
  • Under load, goroutine scheduling delays increase
  • ECDSA session init is faster than EDDSA, hitting the race more often
  • Concurrent sessions (reshare ECDSA + EDDSA) make timing even worse

The barrier works by verification, not assumption. Each node subscribes to a barrier topic (proving its direct subscriptions are also active — they're set up before the barrier). Then each node pings every peer's barrier topic using nats.Request(). If the peer hasn't subscribed yet, "no responders" is returned and we retry. Only when ALL peers respond do we proceed. This is deterministic — no timing assumptions.

Fixed Close() with nil checks:

if s.broadcastSub != nil {
    s.broadcastSub.Unsubscribe()
}

Why: subscribeBroadcastAsync() runs in a goroutine. If Close() was called before that goroutine completed (e.g., due to an early error), s.broadcastSub was nil and Unsubscribe() would panic, crashing the node. The old Close() also returned early on the first unsubscribe error, leaking remaining subscriptions. Now it logs errors but continues cleaning up all subscriptions.

Added barrierSub field:
Why: The barrier subscription needs to be cleaned up in Close() like all other subscriptions, otherwise it leaks.


2. pkg/mpc/ecdsa_signing_session.go

Buffered ErrCh:

ErrCh: make(chan error, 1),

Why: With buffer size 1, the first error send never blocks regardless of whether the error handler has started reading yet. This prevents a subtle race where party.Start() errors before the error handler goroutine is scheduled.

Initialized doneCh:

doneCh: make(chan struct{}),

Why: Without initialization, doneCh is nil. A select on a nil channel blocks forever, so sendErr and the case <-s.doneCh in Sign() would never work.

Added case <-s.doneCh to Sign():

select {
case <-s.doneCh:
    logger.Info("ECDSA signing session stopped", ...)
    return
case msg := <-s.outCh:
    ...
case sig := <-s.endCh:
    ...
}

Why: This is the goroutine leak fix. Before, Sign() only selected on outCh and endCh. After an error killed the TSS protocol, neither channel would ever produce again. Sign() was stuck forever, holding NATS subscriptions open, consuming memory, and (critically) holding a reference to the session that could never be garbage collected. With doneCh, when the error handler calls session.Stop(), doneCh closes, Sign() returns, and the goroutine is freed.

Changed s.ErrCh <- err to s.sendErr(err) in party.Start and verify:
Why: Same deadlock prevention as in session.go. If Sign() already exited (session stopped), writing to ErrCh would block forever.


3. pkg/mpc/eddsa_signing_session.go

Exact same changes as ECDSA. Both session types had identical bugs.


4. pkg/mpc/ecdsa_keygen_session.go & pkg/mpc/eddsa_keygen_session.go

Buffered ErrCh + doneCh:
Why: Keygen sessions embed session and use the same sendErr and Close methods. Without doneCh initialized, any sendErr call in the keygen path would block on the nil channel select. Without the buffered ErrCh, keygen errors could deadlock too (same pattern, just less likely because keygen is less error-prone than signing).

Added WaitForPeersReady() to KeyGenSession interface:
Why: The event consumer calls WaitForPeersReady() on keygen sessions before starting GenerateKey(). The method is implemented on the embedded session struct, but Go interfaces need it declared.


5. pkg/mpc/ecdsa_resharing_session.go & pkg/mpc/eddsa_resharing_session.go

Same buffered ErrCh + doneCh + interface additions:
Why: The reshare bug reported by treeforest was identical to the signing bug. The reshare protocol sends direct messages between peers, and if a peer hasn't subscribed yet (warmup race), "no responders" errors cascade. After the error, the Reshare goroutine leaks, and the node stops processing future reshare events. The WaitForPeersReady() and Stop() additions to the ReshareSession interface let the event consumer use the barrier and clean shutdown.


6. pkg/eventconsumer/event_consumer.go

Replaced warmUpSession() with WaitForPeersReady() in signing:

// Old:
ec.warmUpSession()  // time.Sleep(200ms)

// New:
if err := session.WaitForPeersReady(); err != nil {
    ec.handleSigningSessionError(...)
    return
}

Why: The 200ms sleep was the root cause of the "no responders" errors. It assumed all peers would finish subscribing within 200ms. Under any load, network latency, or scheduling pressure, this fails. The barrier verifies instead of assuming.

Same replacement in keygen and reshare paths:
Why: All three MPC operations (keygen, sign, reshare) had the same warmup race condition. The reshare concurrent ECDSA+EDDSA failure reported by treeforest was this exact bug.

Error handler now calls session.Stop() and done():

case err := <-session.ErrChan():
    ec.handleSigningSessionError(...)
    session.Stop()  // closes doneCh → Sign() exits
    done()          // cancels context → error goroutine exits
    return

Why: Before, the error handler just returned without stopping anything. Sign() kept running (stuck forever). The context was never canceled. Now: session.Stop() wakes up Sign() via doneCh, and done() cancels the context so any future select on ctx.Done() also exits. Clean teardown of the entire session on error.


7. pkg/mpc/registry.go — Health check resilience

Added consecutive failure counter:

failureCount := make(map[string]int)
const maxFailures = 5

// On failure:
failureCount[peerID]++
if failureCount[peerID] >= maxFailures {
    // evict from Consul
}
// On success:
delete(failureCount, peerID)

Why: The old health check deleted a peer's Consul key after just 2 failed NATS requests (~6 seconds). A single network blip would evict a peer, trigger ECDH re-exchange, and cascade into "not ready" state across the cluster. With 5 consecutive failures required (~25 seconds of continuous unresponsiveness), transient issues are tolerated while genuinely dead peers are still detected.

Increased retry attempts from 2 to 3 with 1s delay:
Why: Gives the NATS connection more time to recover from brief interruptions before counting a failure.


8. cmd/mpcium/main.go — NATS connection hardening

TCP keepalive via custom dialer:

nats.Dialer(&net.Dialer{
    Timeout:   5 * time.Second,
    KeepAlive: 30 * time.Second,
}),

Why: AWS NAT Gateway/NLB silently drops TCP connections idle for ~350 seconds. TCP keepalive probes count as wire activity, preventing the idle timeout from triggering. Without this, the NATS connection could be silently killed, and neither the client nor server would know until the next message attempt.

NATS application-level pings:

nats.PingInterval(20 * time.Second),
nats.MaxPingsOutstanding(3),

Why: Even with TCP keepalive, the NATS client needs to detect dead connections at the application level. Default is 2-minute ping interval. With 20s pings and 3 outstanding max, a dead connection is detected within 60 seconds — well before the AWS 350s timeout.

Reconnect buffer:

nats.ReconnectBufSize(16 * 1024 * 1024),

Why: During NATS reconnection, published messages are buffered. The default buffer is small. If a signing operation produces messages during a brief reconnect, a larger buffer prevents message loss.

DisconnectErrHandler replaces DisconnectHandler:
Why: The old handler didn't capture the disconnect reason. The new one logs the error, which is critical for diagnosing whether disconnects are caused by network issues, server shutdown, or idle timeouts.

anhthii added 7 commits March 15, 2026 22:16
Add custom dialer with TCP keepalive (30s), NATS ping interval (20s/3
outstanding), 16MB reconnect buffer, and custom inbox prefix to prevent
AWS NAT Gateway idle timeout from silently killing connections.
Require 5 consecutive health check failures (~25s) before removing a
peer from Consul, preventing transient NATS reconnections from
cascading into unnecessary peer eviction and ECDH re-exchange.
…fecycle improvements

- Route MPC results to specific clients via clientID in topic subjects,
  replacing shared consumers with per-client ephemeral consumers
- Replace warmUpSession() sleep with WaitForPeersReady() barrier that
  verifies all peers have active subscriptions before starting protocol
- Add graceful session shutdown with doneCh/sendErr to prevent goroutine
  leaks and blocking on error channels
- Add JetStream MaxAckPending for backpressure and InProgress() to
  prevent premature redelivery during long MPC operations
- Use atomic tryAddSession/removeSession for duplicate session detection
…bytes to 100MB

- Remove per-broker WithMaxAge overrides in favor of the new default
- Add WithMaxBytes broker option for configurable stream size limits
- Increase message queue max bytes from 10MB to 100MB
The warmUpSession sleep was replaced by the proper WaitForPeersReady
barrier handshake but the old code and config were never cleaned up.
@socket-security
Copy link

socket-security bot commented Mar 16, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedgithub.com/​nats-io/​nats.go@​v1.49.074100100100100

View full report

@anhthii anhthii force-pushed the barrier-implementation-improve-reliability branch from 072630f to b78706b Compare March 18, 2026 16:53
@anhthii anhthii force-pushed the barrier-implementation-improve-reliability branch from ce4d8b6 to 7fd26b9 Compare March 18, 2026 17:07
@anhthii anhthii changed the title [Draft] Barrier implementation improve reliability Barrier implementation improve reliability Mar 19, 2026
@anhthii anhthii merged commit bfe405b into master Mar 19, 2026
26 checks passed
@anhthii anhthii deleted the barrier-implementation-improve-reliability branch March 19, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants