feat(cluster): implement replica bootstrap#3163
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (9.07%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #3163 +/- ##
============================================
- Coverage 74.46% 71.09% -3.37%
Complexity 943 943
============================================
Files 1188 1190 +2
Lines 106543 102840 -3703
Branches 83560 79874 -3686
============================================
- Hits 79332 73111 -6221
- Misses 24459 26776 +2317
- Partials 2752 2953 +201
🚀 New features to boost your workflow:
|
39ce6cb to
a1d20d7
Compare
hubcio
left a comment
There was a problem hiding this comment.
out of diff findings:
core/partitions/src/messages_writer.rs:60-63 — let _ = file.sync_all().await.map_err(...) swallows IO error on the file_exists open path. EIO at boot indicates dying media, but bootstrap proceeds with a stale view of disk. on master HEAD already, but server-ng makes this path live for every partition. propagate the error.
core/partitions/src/iggy_index_writer.rs:53 — let _ = file.sync_all().await; same pattern, no map_err at all. propagate.
core/partitions/src/messages_writer.rs:137 — let chunk_vec: Vec<_> = chunk.to_vec(); allocates a Vec per chunk per save_frozen_batches. steady-state per-batch alloc on what is now the primary write path. cache a reusable Vec in the writer or refactor the compio iov call to take a borrowed slice.
core/shard/src/lib.rs:622 — let namespaces: Vec<_> = planes.1.0.namespaces().copied().collect(); allocates a fresh Vec per inbox-frame iteration. core/shard/src/router.rs:270 calls process_loopback after every dispatched frame in steady state. hoist a namespaces_buf: Vec<u64> like the existing loopback_buf, or short-circuit when partitions / loopback drained are empty.
core/shard/src/router.rs:94-103 and 199-208 — shards_table.shard_for(...).unwrap_or_else(|| { warn!(...); 0 }) silently falls back to shard 0. single-shard server-ng masks this today; multi-shard makes it a silent routing-bug hider. drop the fallback or fail loud.
core/binary_protocol/src/consensus/message.rs:347 — transmute_header does a 256B header copy + 256B memset + Message::try_from(owned) re-validate after the closure already wrote a known-valid header. that's a 3rd validation per request (1st transport recv, 2nd try_into_typed, 3rd here). reached per-client-message via the new make_deferred_*_handler closures at core/server-ng/src/bootstrap.rs:1086,1116. add a transmute_header_unchecked that skips the trailing TryFrom validation and rely on the closure invariant.
core/metadata/src/stm/stream.rs:615-676 — Snapshotable::to_snapshot reads round_robin_counter with Ordering::Relaxed during snapshot fill. concurrent producer increments race the snapshot read. single-shard today: not exploitable. document the invariant.
core/common/src/types/streaming_stats.rs counter ordering — counter fetch_add / fetch_sub use Ordering::AcqRel even though the counters synchronize no other data. Relaxed is correct; AcqRel forces unnecessary ldar / stlr fences on aarch64. per-batch produce path. confirmed via method names like _inconsistent and load_for_snapshot that no consumer relies on the AcqRel half.
|
Addressed all of the correctness issues, I've skipped the optimizations nits, as this code will change quite a few times before it's going to be used by main server binary. |
hubcio
left a comment
There was a problem hiding this comment.
additional findings cite code outside this PR's diff hunks:
core/journal/src/prepare_journal.rs:143 - open() does not detect WAL exceeding slot capacity
file not in this PR but the recovery rewrite exposes this gap. live append cannot wrap the 1024-slot ring (panic at prepare_journal.rs:411-443:422-431; SnapshotCoordinator::checkpoint_if_needed at core/metadata/src/impls/metadata.rs:243-260 with CHECKPOINT_MARGIN=64 forces snapshot near capacity). live runtime cannot lose ops. but the rebuild loop at lines 143-184 silently overwrites slot N % SLOT_COUNT when reading a WAL file containing more entries than the ring can hold (manual recovery, copy from peer, mismatched binary). iter_headers_from(0) then returns only the last 1024 slots. recovery.rs:142 (the recovery caller in this PR) does not detect this. fix: at open, bail when first slot's op > expected replay_from, or when slot count exceeds SLOT_COUNT. defensive; cheap.
core/consensus/src/impls.rs:732 - set_view should take &self
view: Cell<u32> (line 504). set_last_prepare_checksum(&self) (line 779) and set_log_view(&self) (line 788) take &self to mutate identical Cell fields. set_view is the only odd one out at &mut self. Cell allows mutation through shared refs; &mut is not load-bearing. caller bootstrap.rs:251 has let mut consensus = VsrConsensus::new(...) solely because of this method, then later setters work on &self. fix: pub fn set_view(&self, view: u32). one-line.
core/consensus/src/impls.rs:750 - pipeline_mut is dead code
pub const fn pipeline_mut(&mut self) -> &mut RefCell<P>. rg pipeline_mut returns 1 match, the definition itself, with no callers anywhere in core/. unused / dead code should be removed. returning &mut RefCell is doubly redundant: RefCell already provides interior mutability via borrow_mut. the author TODO at line 741 already flags the borrow-across-await footgun on the related pipeline() accessor; same risk applies here, plus the &mut requirement. fix: delete.
Implement
Replicabootstrap logic. Currently it does not handle cases where replicas are out of sync (needsState Transferto be implemented).