Summary
mpcium v0.3.5 nodes panic with crypto/ecdh.(*PublicKey).Bytes nil pointer deref during ECDH session bootstrap when one peer restarts mid-handshake. Auto-recovers (Docker --restart policy + ECDH retrigger logic eventually agree) so it's self-healing, but worth fixing because:
- Each panic adds ~10-15s before the node rejoins quorum.
- During a planned rolling restart, the panic+recover cycle effectively forces a slower roll than the operator anticipates.
goroutine 29 [running] panics propagate as panic: runtime error to stderr — gets noisy in log shipping.
Reproduction
3-node cluster on mpcium:v0.3.5 (built from upstream Dockerfile, distroless image). Sequential docker restart mpcium-nodeN for N=0,1,2 with ~30s gap between each.
Repro hit ~50% of restarts in a 6-restart window (3 nodes × 2 cycles during a config change).
Stack trace
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xcd5ebf]
goroutine 29 [running]:
crypto/ecdh.(*PublicKey).Bytes(...)
/usr/local/go/src/crypto/ecdh/ecdh.go:72
github.com/fystack/mpcium/pkg/mpc.(*ecdhSession).BroadcastPublicKey(0xc00001c800)
/src/pkg/mpc/key_exchange_session.go:155 +0x3f
github.com/fystack/mpcium/pkg/mpc.(*registry).triggerECDHExchange(0xc00017c340)
/src/pkg/mpc/registry.go:164 +0x44
created by github.com/fystack/mpcium/pkg/mpc.(*registry).registerReadyPairs in goroutine 52
/src/pkg/mpc/registry.go:123 +0x278
Likely cause (speculative)
registerReadyPairs spawns goroutines that call triggerECDHExchange → BroadcastPublicKey. If a peer restarts after registerReadyPairs schedules the broadcast goroutine but before BroadcastPublicKey reads the peer's PublicKey from whatever shared state holds it, the PublicKey is nil. crypto/ecdh.(*PublicKey).Bytes doesn't tolerate a nil receiver.
A nil-check at key_exchange_session.go:155 (or wherever the receiver is dereffed) would convert the panic to a clean error returnable to the registry — let it retry the broadcast on the next ECDH retrigger pass instead of crashing the process.
Environment
- mpcium v0.3.5 (
mpcium-cli version confirms)
- distroless image built from upstream
Dockerfile
- 3 nodes co-located on a single VPS (will move to separate hosts in a future op — not relevant to this bug)
- NATS v2.x, Consul 1.15.4
Happy to dig further if useful.
Summary
mpcium v0.3.5 nodes panic with
crypto/ecdh.(*PublicKey).Bytesnil pointer deref during ECDH session bootstrap when one peer restarts mid-handshake. Auto-recovers (Docker--restartpolicy + ECDH retrigger logic eventually agree) so it's self-healing, but worth fixing because:goroutine 29 [running]panics propagate aspanic: runtime errorto stderr — gets noisy in log shipping.Reproduction
3-node cluster on
mpcium:v0.3.5(built from upstream Dockerfile, distroless image). Sequentialdocker restart mpcium-nodeNfor N=0,1,2 with ~30s gap between each.Repro hit ~50% of restarts in a 6-restart window (3 nodes × 2 cycles during a config change).
Stack trace
Likely cause (speculative)
registerReadyPairsspawns goroutines that calltriggerECDHExchange→BroadcastPublicKey. If a peer restarts afterregisterReadyPairsschedules the broadcast goroutine but beforeBroadcastPublicKeyreads the peer's PublicKey from whatever shared state holds it, the PublicKey is nil.crypto/ecdh.(*PublicKey).Bytesdoesn't tolerate a nil receiver.A nil-check at
key_exchange_session.go:155(or wherever the receiver is dereffed) would convert the panic to a clean error returnable to the registry — let it retry the broadcast on the next ECDH retrigger pass instead of crashing the process.Environment
mpcium-cli versionconfirms)DockerfileHappy to dig further if useful.