v0.9.0-rc.34
·
23 commits
to main
since this release
CRDT convergence under failover: QUIC keepalive on the sync endpoint + configurable blob-stall timeout. Two fixes plus CI hardening. The headline is #226 — the AutomergeBackend::with_iroh sync/blob endpoint was missing the tactical QUIC keepalive every IrohTransport endpoint already applies, so a half-open connection under dual-C2 failover stalled convergence to reachable peers in multiples of the 30 s app-level read timeout (peat-sim 7n-dual-c2: ~30/60/90 s outliers collapsed to a ~5–7 s worst case). Ships alongside the #227 configurable stall-timeout knob (consumed by peat-node) and CI serialization for the #222 P2P flake (#229).
Fixed
fetch_blobpeer-health demotion is nowStalled-only (#227, follows #137). A peer was demoted on bothStalledandErroredfetch outcomes, but iroh-blobs surfaces "peer unreachable" and "peer reachable but doesn't have the blob yet" identically asError("Unable to download …"). Demoting onErroredpunished a healthy relay still staging the content (e.g. the surviving C2 in a dual-C2 failover) and, with a low stall timeout, drove a demote/oscillate thrash. Now onlyStalled(no response within the threshold) and stream-open failure demote a peer; an errored attempt is treated as a transient content-availability condition.- CRDT sync now detects half-open/dead connections in ~5s instead of ~30s (#226).
AutomergeBackend::with_iroh— the endpoint backing the persistent CRDT sync channels and the blob store on everypeat-node/peat-protocolconsumer — was built without the tactical QUIC keepalive + idle-timeout (1s keepalive / 5smax_idle_timeout, Issue #315) that everyIrohTransportendpoint already applies. With no transport-level liveness, a half-open connection (a peer that became unreachable, or a transient path disruption at a topology failover) was detected only by the 30s app-levelSyncChannel::RECV_TIMEOUT. Because the on-change sync loop pushes a local write by enqueuing onto each peer's persistent channel, the write then sat undelivered on the half-open stream until that 30s timeout forced a reconnect and re-sync — so convergence to reachable peers stalled in multiples of 30s under dual-C2 failover (observed end-to-end in the peat-sim 7n-dual-c2 experiment: ~30/60/90s outliers on otherwise sub-100ms document sync, blob-independent and size-independent). Applying the samecreate_tactical_transport_config()closes the gap; the 1s keepalive additionally keeps a healthy-but-idle channel verified so it is never mistaken for stale. This samecreate_tactical_transport_config()is also applied toNetworkedIrohBlobStore::build_endpoint_with_hooks(thepeat-mesh-nodebinary's endpoint), so the keepalive guarantee is uniform across every peat-mesh-built iroh endpoint.
Added
AutomergeBackendConfig::download_stall_timeout: Option<Duration>(#227). Threads a configurable per-attempt blob-download stall threshold through to the blob store (was theIrohConfigdefault only).Nonepreserves the 30 s default. Lets a consumer (e.g. peat-node) shrink the first-fetch dead-peer wait for redundant-peer deployments. Additive-but-source-breaking forAutomergeBackendConfigstruct-literal callers (like the earliercipherfield).