feat(cluster): implement replica bootstrap by numinnex · Pull Request #3163 · apache/iggy

numinnex · 2026-04-24T15:12:26Z

Implement Replica bootstrap logic. Currently it does not handle cases where replicas are out of sync (needs State Transfer to be implemented).

codecov · 2026-04-24T15:27:48Z

Codecov Report

❌ Patch coverage is 9.07850% with 1332 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.09%. Comparing base (239d7ee) to head (d39a1f0).

Files with missing lines	Patch %	Lines
core/server-ng/src/bootstrap.rs	0.00%	1142 Missing ⚠️
core/server-ng/src/config_writer.rs	0.00%	71 Missing ⚠️
core/server-ng/src/main.rs	0.00%	33 Missing ⚠️
core/shard/src/router.rs	0.00%	17 Missing ⚠️
core/common/src/types/streaming_stats.rs	50.00%	12 Missing ⚠️
core/configs/src/server_ng_config/validators.rs	41.17%	7 Missing and 3 partials ⚠️
core/consensus/src/impls.rs	0.00%	10 Missing ⚠️
core/configs/src/server_ng_config/displays.rs	0.00%	7 Missing ⚠️
core/partitions/src/iggy_index_writer.rs	0.00%	7 Missing ⚠️
core/partitions/src/messages_writer.rs	0.00%	6 Missing ⚠️
... and 5 more

❌ Your patch check has failed because the patch coverage (9.07%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #3163      +/-   ##
============================================
- Coverage     74.46%   71.09%   -3.37%     
  Complexity      943      943              
============================================
  Files          1188     1190       +2     
  Lines        106543   102840    -3703     
  Branches      83560    79874    -3686     
============================================
- Hits          79332    73111    -6221     
- Misses        24459    26776    +2317     
- Partials       2752     2953     +201

Components	Coverage Δ
Rust Core	`71.40% <9.07%> (-4.33%)`	⬇️
Java SDK	`60.14% <ø> (ø)`
C# SDK	`69.13% <ø> (-0.33%)`	⬇️
Python SDK	`81.43% <ø> (ø)`
Node SDK	`91.53% <ø> (+0.12%)`	⬆️
Go SDK	`39.80% <ø> (ø)`

Files with missing lines	Coverage Δ
core/configs/src/server_config/server.rs	`80.28% <100.00%> (+6.20%)`	⬆️
core/configs/src/server_ng_config/defaults.rs	`100.00% <100.00%> (ø)`
core/configs/src/server_ng_config/server_ng.rs	`47.05% <100.00%> (+6.07%)`	⬆️
core/message_bus/src/replica/io.rs	`83.09% <100.00%> (+0.18%)`	⬆️
core/metadata/src/stm/mod.rs	`41.50% <100.00%> (+1.70%)`	⬆️
core/metadata/src/stm/mux.rs	`92.63% <100.00%> (+0.96%)`	⬆️
core/partitions/src/lib.rs	`0.00% <ø> (ø)`
core/common/src/sharding/namespace.rs	`95.00% <93.75%> (+29.37%)`	⬆️
core/message_bus/src/lib.rs	`92.08% <0.00%> (-0.82%)`	⬇️
core/metadata/src/impls/recovery.rs	`74.12% <88.88%> (+3.49%)`	⬆️
... and 12 more

... and 115 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hubcio

out of diff findings:

core/partitions/src/messages_writer.rs:60-63 — let _ = file.sync_all().await.map_err(...) swallows IO error on the file_exists open path. EIO at boot indicates dying media, but bootstrap proceeds with a stale view of disk. on master HEAD already, but server-ng makes this path live for every partition. propagate the error.

core/partitions/src/iggy_index_writer.rs:53 — let _ = file.sync_all().await; same pattern, no map_err at all. propagate.

core/partitions/src/messages_writer.rs:137 — let chunk_vec: Vec<_> = chunk.to_vec(); allocates a Vec per chunk per save_frozen_batches. steady-state per-batch alloc on what is now the primary write path. cache a reusable Vec in the writer or refactor the compio iov call to take a borrowed slice.

core/shard/src/lib.rs:622 — let namespaces: Vec<_> = planes.1.0.namespaces().copied().collect(); allocates a fresh Vec per inbox-frame iteration. core/shard/src/router.rs:270 calls process_loopback after every dispatched frame in steady state. hoist a namespaces_buf: Vec<u64> like the existing loopback_buf, or short-circuit when partitions / loopback drained are empty.

core/shard/src/router.rs:94-103 and 199-208 — shards_table.shard_for(...).unwrap_or_else(|| { warn!(...); 0 }) silently falls back to shard 0. single-shard server-ng masks this today; multi-shard makes it a silent routing-bug hider. drop the fallback or fail loud.

core/binary_protocol/src/consensus/message.rs:347 — transmute_header does a 256B header copy + 256B memset + Message::try_from(owned) re-validate after the closure already wrote a known-valid header. that's a 3rd validation per request (1st transport recv, 2nd try_into_typed, 3rd here). reached per-client-message via the new make_deferred_*_handler closures at core/server-ng/src/bootstrap.rs:1086,1116. add a transmute_header_unchecked that skips the trailing TryFrom validation and rely on the closure invariant.

core/metadata/src/stm/stream.rs:615-676 — Snapshotable::to_snapshot reads round_robin_counter with Ordering::Relaxed during snapshot fill. concurrent producer increments race the snapshot read. single-shard today: not exploitable. document the invariant.

core/common/src/types/streaming_stats.rs counter ordering — counter fetch_add / fetch_sub use Ordering::AcqRel even though the counters synchronize no other data. Relaxed is correct; AcqRel forces unnecessary ldar / stlr fences on aarch64. per-batch produce path. confirmed via method names like _inconsistent and load_for_snapshot that no consumer relies on the AcqRel half.

numinnex · 2026-05-06T08:40:39Z

Addressed all of the correctness issues, I've skipped the optimizations nits, as this code will change quite a few times before it's going to be used by main server binary.

hubcio

additional findings cite code outside this PR's diff hunks:

core/journal/src/prepare_journal.rs:143 - open() does not detect WAL exceeding slot capacity

file not in this PR but the recovery rewrite exposes this gap. live append cannot wrap the 1024-slot ring (panic at prepare_journal.rs:411-443:422-431; SnapshotCoordinator::checkpoint_if_needed at core/metadata/src/impls/metadata.rs:243-260 with CHECKPOINT_MARGIN=64 forces snapshot near capacity). live runtime cannot lose ops. but the rebuild loop at lines 143-184 silently overwrites slot N % SLOT_COUNT when reading a WAL file containing more entries than the ring can hold (manual recovery, copy from peer, mismatched binary). iter_headers_from(0) then returns only the last 1024 slots. recovery.rs:142 (the recovery caller in this PR) does not detect this. fix: at open, bail when first slot's op > expected replay_from, or when slot count exceeds SLOT_COUNT. defensive; cheap.

core/consensus/src/impls.rs:732 - set_view should take &self

view: Cell<u32> (line 504). set_last_prepare_checksum(&self) (line 779) and set_log_view(&self) (line 788) take &self to mutate identical Cell fields. set_view is the only odd one out at &mut self. Cell allows mutation through shared refs; &mut is not load-bearing. caller bootstrap.rs:251 has let mut consensus = VsrConsensus::new(...) solely because of this method, then later setters work on &self. fix: pub fn set_view(&self, view: u32). one-line.

core/consensus/src/impls.rs:750 - pipeline_mut is dead code

pub const fn pipeline_mut(&mut self) -> &mut RefCell<P>. rg pipeline_mut returns 1 match, the definition itself, with no callers anywhere in core/. unused / dead code should be removed. returning &mut RefCell is doubly redundant: RefCell already provides interior mutability via borrow_mut. the author TODO at line 741 already flags the borrow-across-await footgun on the related pipeline() accessor; same risk applies here, plus the &mut requirement. fix: delete.

numinnex added 5 commits May 4, 2026 14:44

temp

a0e2860

tmp v2

df1cd74

finito

dada052

fix cargo machete & sort

06ee923

server-ng config

a1d20d7

numinnex force-pushed the replica_bootstrap branch from 39ce6cb to a1d20d7 Compare May 4, 2026 13:18

numinnex added 2 commits May 4, 2026 16:04

temp

ff8ac9c

works

991523b

hubcio requested changes May 5, 2026

View reviewed changes

numinnex and others added 2 commits May 6, 2026 10:20

Merge branch 'master' into replica_bootstrap

7437014

address code review nits

cc31a42

hubcio requested changes May 6, 2026

View reviewed changes

numinnex and others added 3 commits May 7, 2026 15:21

address review comments

f06d15f

slimmer error

59a9231

Merge branch 'master' into replica_bootstrap

d39a1f0

hubcio approved these changes May 12, 2026

View reviewed changes

mmodzelewski approved these changes May 12, 2026

View reviewed changes

spetz approved these changes May 12, 2026

View reviewed changes

numinnex merged commit d5918aa into master May 12, 2026
80 checks passed

numinnex deleted the replica_bootstrap branch May 12, 2026 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): implement replica bootstrap#3163

feat(cluster): implement replica bootstrap#3163
numinnex merged 12 commits into
masterfrom
replica_bootstrap

numinnex commented Apr 24, 2026

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

hubcio left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

numinnex commented May 6, 2026

Uh oh!

hubcio left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

numinnex commented Apr 24, 2026

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hubcio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

numinnex commented May 6, 2026

Uh oh!

hubcio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 24, 2026 •

edited

Loading