Skip to content

v0.9.0-rc.36

Choose a tag to compare

@github-actions github-actions released this 07 Jun 04:30
· 19 commits to main since this release
v0.9.0-rc.36
5bee2e1

Determinism fix: stop the CRDT gossip-amplification storm + sync-channel idle-timeout wedge (#239). Under sustained writes the mesh could wedge and stop converging (timeouts; multi-second dead time) — a violation of the "every transaction deterministic, ≤ a couple seconds" SLA.

Fixed

  • Gossip-amplification storm (dominant) — automerge_sync.rs. On sync convergence the receive path (Issue #435) reset the peer SyncState to a fresh state keeping only shared_heads, discarding the negotiated quiescence (their_heads/last_sent_heads/have_responded). A converged doc then re-emitted a sync message on the next generate_sync_message; with the transitive gossip pusher (peat#891) that no-op send re-fired gossip_tx on every peer → re-broadcast → an unbounded gossip-amplification storm (~285 initiate/s on one unchanged doc) that starved new writes after a couple transactions. Fix: retain the converged (quiescent) SyncState and prune only the field that actually accumulates over a document's edit history — sent_hashes (prune_converged_sync_state). Quiescence is governed by the heads fields, not sent_hashes, so this bounds memory (the real #435 concern) without re-negotiating. On a 7-node failover lab this collapsed per-converged-doc traffic from 735 (≤5,700 worst) to ~10–12 delta-sends, and restored deterministic convergence (direct + failover-relay + 80 sustained writes + idle: zero timeouts).
  • Sync-channel receive-idle timeout — sync_channel.rs. The receiver wrapped the wait for the next frame in a 30 s idle timeout, tearing down healthy idle channels whose receiver was then never re-armed → delivery wedged after any quiet period. Removed it; the loop blocks for the next frame and QUIC keepalive (1 s) / max_idle_timeout (5 s) handles genuine connection death (surfacing as a stream error → reconnect).

Both fixes ship with deterministic guard tests (converged_resync_is_quiescent_after_issue_435_reset, receive_loop_delivers_frame_after_idle_beyond_old_recv_timeout). The bandwidth-cycled convergence sweep is tracked separately (#240 — bandwidth-bound transfers + a realistic per-cell tolerance), independent of this fix.