Skip to content

v0.9.0-rc.34

Choose a tag to compare

@github-actions github-actions released this 06 Jun 00:38
· 23 commits to main since this release
v0.9.0-rc.34
abacabf

CRDT convergence under failover: QUIC keepalive on the sync endpoint + configurable blob-stall timeout. Two fixes plus CI hardening. The headline is #226 — the AutomergeBackend::with_iroh sync/blob endpoint was missing the tactical QUIC keepalive every IrohTransport endpoint already applies, so a half-open connection under dual-C2 failover stalled convergence to reachable peers in multiples of the 30 s app-level read timeout (peat-sim 7n-dual-c2: ~30/60/90 s outliers collapsed to a ~5–7 s worst case). Ships alongside the #227 configurable stall-timeout knob (consumed by peat-node) and CI serialization for the #222 P2P flake (#229).

Fixed

  • fetch_blob peer-health demotion is now Stalled-only (#227, follows #137). A peer was demoted on both Stalled and Errored fetch outcomes, but iroh-blobs surfaces "peer unreachable" and "peer reachable but doesn't have the blob yet" identically as Error("Unable to download …"). Demoting on Errored punished a healthy relay still staging the content (e.g. the surviving C2 in a dual-C2 failover) and, with a low stall timeout, drove a demote/oscillate thrash. Now only Stalled (no response within the threshold) and stream-open failure demote a peer; an errored attempt is treated as a transient content-availability condition.
  • CRDT sync now detects half-open/dead connections in ~5s instead of ~30s (#226). AutomergeBackend::with_iroh — the endpoint backing the persistent CRDT sync channels and the blob store on every peat-node / peat-protocol consumer — was built without the tactical QUIC keepalive + idle-timeout (1s keepalive / 5s max_idle_timeout, Issue #315) that every IrohTransport endpoint already applies. With no transport-level liveness, a half-open connection (a peer that became unreachable, or a transient path disruption at a topology failover) was detected only by the 30s app-level SyncChannel::RECV_TIMEOUT. Because the on-change sync loop pushes a local write by enqueuing onto each peer's persistent channel, the write then sat undelivered on the half-open stream until that 30s timeout forced a reconnect and re-sync — so convergence to reachable peers stalled in multiples of 30s under dual-C2 failover (observed end-to-end in the peat-sim 7n-dual-c2 experiment: ~30/60/90s outliers on otherwise sub-100ms document sync, blob-independent and size-independent). Applying the same create_tactical_transport_config() closes the gap; the 1s keepalive additionally keeps a healthy-but-idle channel verified so it is never mistaken for stale. This same create_tactical_transport_config() is also applied to NetworkedIrohBlobStore::build_endpoint_with_hooks (the peat-mesh-node binary's endpoint), so the keepalive guarantee is uniform across every peat-mesh-built iroh endpoint.

Added

  • AutomergeBackendConfig::download_stall_timeout: Option<Duration> (#227). Threads a configurable per-attempt blob-download stall threshold through to the blob store (was the IrohConfig default only). None preserves the 30 s default. Lets a consumer (e.g. peat-node) shrink the first-fetch dead-peer wait for redundant-peer deployments. Additive-but-source-breaking for AutomergeBackendConfig struct-literal callers (like the earlier cipher field).