A pure performance release — nothing on the wire moves
v0.27.1 ships no new systems, no new SDK surface, and no protocol changes. Every change either replaces an O(shards) operation with an O(1) atomic, swaps an O(n) full-scan for an index read, deletes an allocation, or corrects a benchmark fixture that was reporting fiction. The work is recorded in full in docs/misc/PERF_AUDIT_2026_06_08_BENCHMARK_WINS.md; this log is the operator-facing summary.
The organizing observation, the same shape as v0.27's: the substrate was answering cheap questions expensively. len(), node_count(), and stats() are called on admission gates and per-selection hot paths, and the default DashMap shards to 4 × num_cpus (128 on a 32-thread host), so every one of those calls locked and summed 128 shards regardless of how few entries the map held — an ~950 ns fixed cost to read a number the code could have maintained as it went. v0.27.1 maintains it as it goes.
DashMap::len() was a 128-shard walk on hot paths
The cross-cutting fix. Five subsystems carried AtomicUsize (and AtomicU64) counters that are now maintained exactly on every insert / remove / eviction, replacing the per-shard walk:
LocalGraph(swarm.rs) —num_nodes/num_edges/num_seen. The hot one: theseen_pingwavessoft-cap gate ran on every accepted pingwave, paying the shard walk per admission.local_graph/on_pingwave_duplicatedrops from 974 ns → 16 ns (~60×).ProximityGraph(behavior/proximity.rs) —num_nodes/num_edges/num_seen.MetadataStore(behavior/metadata.rs) —node_count, andstats()now reads its inverted indexes (status / tier / continent) instead of full-scanning every node with aStringallocation per entry.FailureDetector(failure.rs) —num_nodes, pluscheck_all()now reads the monotonic clock once per sweep instead of once per node.RoutingTable(route.rs) —num_routes/num_streams, including the per-novel-stream admission gate.
node_count() / len() / stats() reads collapse from ~950 ns to a sub-nanosecond atomic load. The FailureDetector per-status (healthy / suspected / failed) tally is deliberately kept as a scan — it's observability-only and node status is mutated in place, so a maintained per-status counter would silently drift. The scan is always exact.
Capability serialize — a one-word fix
sorted_tag_vec sorted capability tags with sort_by_key(|t| t.to_string()), which re-renders each Tag to a String on every comparison (~N log N allocations). Switched to sort_by_cached_key, which renders each tag exactly once (N allocations). Output order is byte-identical, so signed CapabilityAnnouncement bytes stay stable across peers — pinned by a regression test. capability_set/serialize drops 65.3 µs → 9.6 µs (~6.8×); capability_announcement/serialize 71.7 µs → 11.8 µs (~6.1×).
API registry — O(1) counts, index-derived stats, allocation-free path match
ApiRegistry (behavior/api.rs) got the same treatment plus an allocation fix:
len()/is_empty()/stats().total_nodesand the register capacity gate now readnode_count/total_endpointsatomics.api_registry_basic/len: 1.42 µs → 0.20 ns.stats()readsapis_by_namefrom theby_api_nameinverted index (provider count per name, skipping empty buckets) rather than full-scanning every node and schema with aStringclone per schema.api_registry_basic/stats: ~201 ms → ~7 µs.find_by_endpointcalledmatches_path(..).is_some(), allocating twoVecs + aHashMap+ aStringper endpoint per node just to extract a bool. A new allocation-freeApiEndpoint::path_matches() -> boolreplaces it at the three params-discarding call sites (the full scan is retained — it's correct for endpoints whose first path segment is a parameter, which a prefix index would miss).api_registry_query/find_by_endpoint: 6.98 ms → 1.88 ms (~3.7×), all from dropped allocation.
stats()'s apis_by_name is now distinct provider nodes per API name (the index is a provider set); this differs from the old per-schema-instance count only when one node advertises the same API name in two schemas — a degenerate case, documented and pinned by a test.
Load balancer — snapshot selection, right-sized hash ring
LoadBalancer::select (behavior/loadbalance.rs) is a per-dispatch hot path in GroupCoordinator, and get_available_endpoints iterated the endpoints DashMap via DashMap::iter — a 128-shard walk regardless of endpoint count.
- Endpoint snapshot. The authoritative
DashMapis kept for point lookups (reservation, health/metric updates);select/stats/endpoints/endpoint_countnow iterate a flatArcSwap<Vec<Arc<EndpointState>>>snapshot rebuilt only when the endpoint set changes. Per-endpoint atomic state (health, connections, circuit) stays live through the sharedArcs.lb_strategies/round_robin: 8.24 µs → ~340 ns (~24×);lb_scaling/select/10: 5.59 µs → ~370 ns (~15×). - Right-sized hash ring.
consistent_hashselection walks the separatehash_ringDashMap, which the snapshot doesn't cover; it was over-sharded the same way. Pinning it to 8 shards (HASH_RING_SHARDS) cutlb_strategies/consistent_hash~20% (49.1 µs → 39.8 µs), no new invariants.
A documented experiment (in the audit, "Snapshot vs. right-sized DashMap") confirmed the snapshot is not over-engineering: replacing it with a merely right-sized endpoints DashMap regressed select ~2× (a wait-free ArcSwap load over a contiguous Vec beats locking even 8 shards over scattered HashMap buckets on the iterate-heavy path). The snapshot stays; only the ring — which it doesn't cover — was right-sized.
Concurrency hardening (correctness, shipped with the perf work)
The dual-store and counter changes drew a review pass that closed five latent races before they could ship:
LoadBalancermembership lock —add_endpoint/remove_endpointnow serialize the map mutation + snapshot rebuild under aMutex, so concurrent membership changes can't store a stale snapshot last (which would silently drop a just-added endpoint from rotation). Off the hot path;selectonly reads.- Removed-endpoint flag — an
EndpointState.removedbit, set on removal and checked inis_available(), so a selector reading a snapshot taken just before a concurrent removal filters the gone endpoint out instead of burning a reservation retry into a transient falseNoEndpointsAvailable. ApiRegistry::registermade atomic per node — the read-old / re-index / insert sequence now runs under a singlenodesentry lock (mirroringMetadataStore::upsert), so concurrent re-registration of the same node can't drifttotal_endpoints(which, decremented withfetch_sub, could otherwise underflow to a huge value).ApiRegistry::cleardrains instead ofstore(0)— per-key decrement through the same chokepoints the live paths use, so a concurrentunregisterracingclearcan't underflow the counters.RoutingTable::get_stream_statsgated on the cap — it created astream_statsentry for any id unconditionally, bypassing theMAX_STREAM_STATSsoft cap therecord_*paths enforce; now gated, returningOption.
All five carry regression tests (including multi-thread stress tests for the counter races).
Benchmark fixtures — corrections, not wins
Three of the largest "before" numbers were never real production costs — they were shared, growing Criterion fixtures bleeding into each other. The audit's §7 records them so nobody chases the wrong number, and the O(1)/fixture work makes them moot:
failure_detector/check_all(670 ms),failure_detector/stats(198 ms), andmetadata_store_basic/stats(169 ms) were inflated by theheartbeat_new/register_newbenches ballooning a shared detector/store that the laterstats/check_allclosures reused.check_allis genuinely O(n), so its bench got a dedicatedgrowth_detector; thestats/lennumbers are moot post-rework because those methods are now O(1) regardless of map size. Post-fix: check_all 16.7 µs, stats 16 µs, metadata stats 15.9 µs.
Measured results
Full table in the audit doc. Headline figures (Intel i9-14900K, Criterion defaults):
| Benchmark | Before | After | Change |
|---|---|---|---|
local_graph/node_count |
958 ns | 0.20 ns | ~4770× |
local_graph/stats |
2.89 µs | 0.33 ns | ~8850× |
local_graph/on_pingwave_duplicate |
974 ns | 16 ns | ~60× |
metadata_store_basic/len |
956 ns | 0.20 ns | ~4750× |
routing_table/aggregate_stats |
13.1 µs | 6.07 µs | ~2.2× |
capability_set/serialize |
65.3 µs | 9.63 µs | ~6.8× |
api_registry_basic/len |
1.42 µs | 0.20 ns | ~6970× |
api_registry_query/find_by_endpoint |
6.98 ms | 1.88 ms | ~3.7× |
lb_strategies/round_robin |
8.24 µs | ~340 ns | ~24× |
lb_scaling/select/10 |
5.59 µs | ~370 ns | ~15× |
lb_strategies/consistent_hash |
50.6 µs | 39.8 µs | ~1.27× |
Absolute "after" figures on the sub-µs select/lb rows carry ±40–50% run-to-run variance on the dev box; they're representative, not precise, and the audit's re-verification note documents the spread. The multipliers and the order-of-magnitude wins are stable.
SIMD crypto (documented, opt-in). The audit's highest-leverage item — the ChaCha20-Poly1305 AEAD running on the software backend rather than AVX2 — is documented but deliberately not enforced in committed config: a baked-in +avx2 floor would SIGILL on pre-AVX2 x86-64 and is meaningless on ARM. Operators opt in per target class via RUSTFLAGS="-C target-feature=+avx2" (or target-cpu=native); default builds keep the software path, so nothing regresses and the ~5–10× data-path win is unlocked per deploy. See §1 of the audit.
Breaking changes
None on the wire, and none to behavior. v0.27.1 interoperates with v0.27.0 peers freely.
One minor source-level API refinement: RoutingTable::get_stream_stats now returns Option<Ref<…>> instead of Ref<…> (it returns None for a novel stream id once MAX_STREAM_STATS is reached, closing an unbounded-growth path). The type is re-exported, so an external caller would need to handle the Option; there are no in-tree callers outside tests.
How to upgrade
Drop-in. Bump the dependency to 0.27.1 — no source changes required for the common case, no atomic peer roll, no config changes. The performance wins apply automatically. Two optional levers:
- SIMD crypto: rebuild the x86-64 target class with
RUSTFLAGS="-C target-feature=+avx2"to unlock the AEAD fast path. Default builds are unchanged. get_stream_statscallers (if any exist downstream) add anOptionmatch /expect.
Dependency updates
Routine patch bumps only — no major or minor version changes, no behavioral surface change. The wasm-bindgen family and the js-sys/web-sys pair move together as usual:
http (1.4.1 → 1.4.2), js-sys (0.3.99 → 0.3.100), uuid (1.23.2 → 1.23.3), wasm-bindgen / wasm-bindgen-macro / wasm-bindgen-macro-support / wasm-bindgen-shared (all 0.2.122 → 0.2.123), web-sys (0.3.99 → 0.3.100). Cargo.lock carries the exact pinned versions.