Barrier implementation improve reliability by anhthii · Pull Request #149 · fystack/mpcium

anhthii · 2026-03-16T10:10:35Z

Fix: NATS "No Responders" & Node Unresponsiveness After Signing Failure

Every Change Explained

1. `pkg/mpc/session.go` — The core fixes

Added doneCh and doneOnce to session struct:

doneCh   chan struct{}
doneOnce sync.Once

Why: Before this, there was no way to tell a running Sign() or Reshare() goroutine to stop. When an error occurred, the error handler goroutine exited, but Sign() was stuck forever in select { case <-outCh: ... case <-endCh: ... } — neither channel would ever receive again after a failure. doneCh gives us a broadcast signal: closing it wakes up every goroutine selecting on it simultaneously.

Added sendErr() method:

func (s *session) sendErr(err error) {
    select {
    case s.ErrCh <- err:
    case <-s.doneCh:
    }
}

Why: The old code did s.ErrCh <- err directly. ErrCh was unbuffered, so this blocks until someone reads. The error handler goroutine was the only reader, and it exited after the first error. So when handleTssMessage tried to send a SECOND error (e.g., failing to reach peer B after already failing to reach peer C), it blocked forever. This blocked the Sign() goroutine (since handleTssMessage is called from Sign's select case), which meant the goroutine leaked permanently with its NATS subscriptions still active. sendErr uses a select so that if the session is stopped (doneCh closed), the send is abandoned instead of blocking.

Replaced every s.ErrCh <- ... with s.sendErr(...):
Why: Every error producer in the session had the same blocking problem. handleTssMessage, receiveP2PTssMessage, receiveBroadcastTssMessage, receiveTssMessage, subscribeFromPeersAsync, subscribeBroadcastAsync — all of them. Any one of them could deadlock the session.

Added Stop() and Done() methods:

func (s *session) Stop() {
    s.doneOnce.Do(func() { close(s.doneCh) })
}
func (s *session) Done() <-chan struct{} {
    return s.doneCh
}

Why: Stop() is the trigger to terminate a session. sync.Once ensures closing doneCh twice doesn't panic. Multiple callers can safely call Stop() — the error handler, the Sign method on success, or Close.

Added WaitForPeersReady() — the readiness barrier:

func (s *session) WaitForPeersReady() error {
    // 1. Subscribe to our own barrier topic
    // 2. For each peer, send NATS request to their barrier topic
    // 3. If peer responds, they're subscribed. If not, retry until timeout.
}

Why: This is the fix for the "no responders" bug. The old code used time.Sleep(200ms) hoping all peers would finish subscribing in time. This is fundamentally unreliable because:

Network latency varies
Under load, goroutine scheduling delays increase
ECDSA session init is faster than EDDSA, hitting the race more often
Concurrent sessions (reshare ECDSA + EDDSA) make timing even worse

The barrier works by verification, not assumption. Each node subscribes to a barrier topic (proving its direct subscriptions are also active — they're set up before the barrier). Then each node pings every peer's barrier topic using nats.Request(). If the peer hasn't subscribed yet, "no responders" is returned and we retry. Only when ALL peers respond do we proceed. This is deterministic — no timing assumptions.

Fixed Close() with nil checks:

if s.broadcastSub != nil {
    s.broadcastSub.Unsubscribe()
}

Why: subscribeBroadcastAsync() runs in a goroutine. If Close() was called before that goroutine completed (e.g., due to an early error), s.broadcastSub was nil and Unsubscribe() would panic, crashing the node. The old Close() also returned early on the first unsubscribe error, leaking remaining subscriptions. Now it logs errors but continues cleaning up all subscriptions.

Added barrierSub field:
Why: The barrier subscription needs to be cleaned up in Close() like all other subscriptions, otherwise it leaks.

2. `pkg/mpc/ecdsa_signing_session.go`

Buffered ErrCh:

ErrCh: make(chan error, 1),

Why: With buffer size 1, the first error send never blocks regardless of whether the error handler has started reading yet. This prevents a subtle race where party.Start() errors before the error handler goroutine is scheduled.

Initialized doneCh:

doneCh: make(chan struct{}),

Why: Without initialization, doneCh is nil. A select on a nil channel blocks forever, so sendErr and the case <-s.doneCh in Sign() would never work.

Added case <-s.doneCh to Sign():

select {
case <-s.doneCh:
    logger.Info("ECDSA signing session stopped", ...)
    return
case msg := <-s.outCh:
    ...
case sig := <-s.endCh:
    ...
}

Why: This is the goroutine leak fix. Before, Sign() only selected on outCh and endCh. After an error killed the TSS protocol, neither channel would ever produce again. Sign() was stuck forever, holding NATS subscriptions open, consuming memory, and (critically) holding a reference to the session that could never be garbage collected. With doneCh, when the error handler calls session.Stop(), doneCh closes, Sign() returns, and the goroutine is freed.

Changed s.ErrCh <- err to s.sendErr(err) in party.Start and verify:
Why: Same deadlock prevention as in session.go. If Sign() already exited (session stopped), writing to ErrCh would block forever.

3. `pkg/mpc/eddsa_signing_session.go`

Exact same changes as ECDSA. Both session types had identical bugs.

4. `pkg/mpc/ecdsa_keygen_session.go` & `pkg/mpc/eddsa_keygen_session.go`

Buffered ErrCh + doneCh:
Why: Keygen sessions embed session and use the same sendErr and Close methods. Without doneCh initialized, any sendErr call in the keygen path would block on the nil channel select. Without the buffered ErrCh, keygen errors could deadlock too (same pattern, just less likely because keygen is less error-prone than signing).

Added WaitForPeersReady() to KeyGenSession interface:
Why: The event consumer calls WaitForPeersReady() on keygen sessions before starting GenerateKey(). The method is implemented on the embedded session struct, but Go interfaces need it declared.

5. `pkg/mpc/ecdsa_resharing_session.go` & `pkg/mpc/eddsa_resharing_session.go`

Same buffered ErrCh + doneCh + interface additions:
Why: The reshare bug reported by treeforest was identical to the signing bug. The reshare protocol sends direct messages between peers, and if a peer hasn't subscribed yet (warmup race), "no responders" errors cascade. After the error, the Reshare goroutine leaks, and the node stops processing future reshare events. The WaitForPeersReady() and Stop() additions to the ReshareSession interface let the event consumer use the barrier and clean shutdown.

6. `pkg/eventconsumer/event_consumer.go`

Replaced warmUpSession() with WaitForPeersReady() in signing:

// Old:
ec.warmUpSession()  // time.Sleep(200ms)

// New:
if err := session.WaitForPeersReady(); err != nil {
    ec.handleSigningSessionError(...)
    return
}

Why: The 200ms sleep was the root cause of the "no responders" errors. It assumed all peers would finish subscribing within 200ms. Under any load, network latency, or scheduling pressure, this fails. The barrier verifies instead of assuming.

Same replacement in keygen and reshare paths:
Why: All three MPC operations (keygen, sign, reshare) had the same warmup race condition. The reshare concurrent ECDSA+EDDSA failure reported by treeforest was this exact bug.

Error handler now calls session.Stop() and done():

case err := <-session.ErrChan():
    ec.handleSigningSessionError(...)
    session.Stop()  // closes doneCh → Sign() exits
    done()          // cancels context → error goroutine exits
    return

Why: Before, the error handler just returned without stopping anything. Sign() kept running (stuck forever). The context was never canceled. Now: session.Stop() wakes up Sign() via doneCh, and done() cancels the context so any future select on ctx.Done() also exits. Clean teardown of the entire session on error.

7. `pkg/mpc/registry.go` — Health check resilience

Added consecutive failure counter:

failureCount := make(map[string]int)
const maxFailures = 5

// On failure:
failureCount[peerID]++
if failureCount[peerID] >= maxFailures {
    // evict from Consul
}
// On success:
delete(failureCount, peerID)

Why: The old health check deleted a peer's Consul key after just 2 failed NATS requests (~6 seconds). A single network blip would evict a peer, trigger ECDH re-exchange, and cascade into "not ready" state across the cluster. With 5 consecutive failures required (~25 seconds of continuous unresponsiveness), transient issues are tolerated while genuinely dead peers are still detected.

Increased retry attempts from 2 to 3 with 1s delay:
Why: Gives the NATS connection more time to recover from brief interruptions before counting a failure.

8. `cmd/mpcium/main.go` — NATS connection hardening

TCP keepalive via custom dialer:

nats.Dialer(&net.Dialer{
    Timeout:   5 * time.Second,
    KeepAlive: 30 * time.Second,
}),

Why: AWS NAT Gateway/NLB silently drops TCP connections idle for ~350 seconds. TCP keepalive probes count as wire activity, preventing the idle timeout from triggering. Without this, the NATS connection could be silently killed, and neither the client nor server would know until the next message attempt.

NATS application-level pings:

nats.PingInterval(20 * time.Second),
nats.MaxPingsOutstanding(3),

Why: Even with TCP keepalive, the NATS client needs to detect dead connections at the application level. Default is 2-minute ping interval. With 20s pings and 3 outstanding max, a dead connection is detected within 60 seconds — well before the AWS 350s timeout.

Reconnect buffer:

nats.ReconnectBufSize(16 * 1024 * 1024),

Why: During NATS reconnection, published messages are buffered. The default buffer is small. If a signing operation produces messages during a brief reconnect, a larger buffer prevents message loss.

DisconnectErrHandler replaces DisconnectHandler:
Why: The old handler didn't capture the disconnect reason. The new one logs the error, which is critical for diagnosing whether disconnects are caused by network issues, server shutdown, or idle timeouts.

Add custom dialer with TCP keepalive (30s), NATS ping interval (20s/3 outstanding), 16MB reconnect buffer, and custom inbox prefix to prevent AWS NAT Gateway idle timeout from silently killing connections.

Require 5 consecutive health check failures (~25s) before removing a peer from Consul, preventing transient NATS reconnections from cascading into unnecessary peer eviction and ECDH re-exchange.

…fecycle improvements - Route MPC results to specific clients via clientID in topic subjects, replacing shared consumers with per-client ephemeral consumers - Replace warmUpSession() sleep with WaitForPeersReady() barrier that verifies all peers have active subscriptions before starting protocol - Add graceful session shutdown with doneCh/sendErr to prevent goroutine leaks and blocking on error channels - Add JetStream MaxAckPending for backpressure and InProgress() to prevent premature redelivery during long MPC operations - Use atomic tryAddSession/removeSession for duplicate session detection

…bytes to 100MB - Remove per-broker WithMaxAge overrides in favor of the new default - Add WithMaxBytes broker option for configurable stream size limits - Increase message queue max bytes from 10MB to 100MB

The warmUpSession sleep was replaced by the proper WaitForPeersReady barrier handshake but the old code and config were never cleaned up.

socket-security · 2026-03-16T10:11:09Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	github.com/nats-io/nats.go@v1.49.0

View full report

anhthii added 7 commits March 15, 2026 22:16

Improve NATS connection resilience with TCP keepalive and ping intervals

20ea41b

Add custom dialer with TCP keepalive (30s), NATS ping interval (20s/3 outstanding), 16MB reconnect buffer, and custom inbox prefix to prevent AWS NAT Gateway idle timeout from silently killing connections.

Add consecutive failure threshold before evicting peers from Consul

4684c76

Require 5 consecutive health check failures (~25s) before removing a peer from Consul, preventing transient NATS reconnections from cascading into unnecessary peer eviction and ECDH re-exchange.

Update Go module dependencies

c6e9175

Temp code

575c412

Update JetStream stream defaults: increase max age to 7 days and max …

6b8d963

…bytes to 100MB - Remove per-broker WithMaxAge overrides in favor of the new default - Add WithMaxBytes broker option for configurable stream size limits - Increase message queue max bytes from 10MB to 100MB

Remove dead warmUpSession code and config

5ea0a74

The warmUpSession sleep was replaced by the proper WaitForPeersReady barrier handshake but the old code and config were never cleaned up.

Fix e2e

b78706b

anhthii force-pushed the barrier-implementation-improve-reliability branch from 072630f to b78706b Compare March 18, 2026 16:53

Go mod tidy

7fd26b9

anhthii force-pushed the barrier-implementation-improve-reliability branch from ce4d8b6 to 7fd26b9 Compare March 18, 2026 17:07

Bump go version

83884a7

anhthii changed the title ~~[Draft] Barrier implementation improve reliability~~ Barrier implementation improve reliability Mar 19, 2026

Enhance messaging reliability and error handling

349ed99

anhthii mentioned this pull request Mar 19, 2026

Enhance messaging reliability and error handling #148

Closed

anhthii merged commit bfe405b into master Mar 19, 2026
26 checks passed

anhthii deleted the barrier-implementation-improve-reliability branch March 19, 2026 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Barrier implementation improve reliability#149

Barrier implementation improve reliability#149
anhthii merged 11 commits intomasterfrom
barrier-implementation-improve-reliability

anhthii commented Mar 16, 2026 •

edited

Loading

Uh oh!

socket-security bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anhthii commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: NATS "No Responders" & Node Unresponsiveness After Signing Failure

Every Change Explained

1. pkg/mpc/session.go — The core fixes

2. pkg/mpc/ecdsa_signing_session.go

3. pkg/mpc/eddsa_signing_session.go

4. pkg/mpc/ecdsa_keygen_session.go & pkg/mpc/eddsa_keygen_session.go

5. pkg/mpc/ecdsa_resharing_session.go & pkg/mpc/eddsa_resharing_session.go

6. pkg/eventconsumer/event_consumer.go

7. pkg/mpc/registry.go — Health check resilience

8. cmd/mpcium/main.go — NATS connection hardening

Uh oh!

socket-security bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anhthii commented Mar 16, 2026 •

edited

Loading

1. `pkg/mpc/session.go` — The core fixes

2. `pkg/mpc/ecdsa_signing_session.go`

3. `pkg/mpc/eddsa_signing_session.go`

4. `pkg/mpc/ecdsa_keygen_session.go` & `pkg/mpc/eddsa_keygen_session.go`

5. `pkg/mpc/ecdsa_resharing_session.go` & `pkg/mpc/eddsa_resharing_session.go`

6. `pkg/eventconsumer/event_consumer.go`

7. `pkg/mpc/registry.go` — Health check resilience

8. `cmd/mpcium/main.go` — NATS connection hardening

socket-security bot commented Mar 16, 2026 •

edited

Loading