Skip to content

v0.3.5 nil-pointer panic in ecdhSession.BroadcastPublicKey on rolling restart #158

@nicleerocks

Description

@nicleerocks

Summary

mpcium v0.3.5 nodes panic with crypto/ecdh.(*PublicKey).Bytes nil pointer deref during ECDH session bootstrap when one peer restarts mid-handshake. Auto-recovers (Docker --restart policy + ECDH retrigger logic eventually agree) so it's self-healing, but worth fixing because:

  • Each panic adds ~10-15s before the node rejoins quorum.
  • During a planned rolling restart, the panic+recover cycle effectively forces a slower roll than the operator anticipates.
  • goroutine 29 [running] panics propagate as panic: runtime error to stderr — gets noisy in log shipping.

Reproduction

3-node cluster on mpcium:v0.3.5 (built from upstream Dockerfile, distroless image). Sequential docker restart mpcium-nodeN for N=0,1,2 with ~30s gap between each.

Repro hit ~50% of restarts in a 6-restart window (3 nodes × 2 cycles during a config change).

Stack trace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xcd5ebf]

goroutine 29 [running]:
crypto/ecdh.(*PublicKey).Bytes(...)
	/usr/local/go/src/crypto/ecdh/ecdh.go:72
github.com/fystack/mpcium/pkg/mpc.(*ecdhSession).BroadcastPublicKey(0xc00001c800)
	/src/pkg/mpc/key_exchange_session.go:155 +0x3f
github.com/fystack/mpcium/pkg/mpc.(*registry).triggerECDHExchange(0xc00017c340)
	/src/pkg/mpc/registry.go:164 +0x44
created by github.com/fystack/mpcium/pkg/mpc.(*registry).registerReadyPairs in goroutine 52
	/src/pkg/mpc/registry.go:123 +0x278

Likely cause (speculative)

registerReadyPairs spawns goroutines that call triggerECDHExchangeBroadcastPublicKey. If a peer restarts after registerReadyPairs schedules the broadcast goroutine but before BroadcastPublicKey reads the peer's PublicKey from whatever shared state holds it, the PublicKey is nil. crypto/ecdh.(*PublicKey).Bytes doesn't tolerate a nil receiver.

A nil-check at key_exchange_session.go:155 (or wherever the receiver is dereffed) would convert the panic to a clean error returnable to the registry — let it retry the broadcast on the next ECDH retrigger pass instead of crashing the process.

Environment

  • mpcium v0.3.5 (mpcium-cli version confirms)
  • distroless image built from upstream Dockerfile
  • 3 nodes co-located on a single VPS (will move to separate hosts in a future op — not relevant to this bug)
  • NATS v2.x, Consul 1.15.4

Happy to dig further if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions