fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only#600
Conversation
Previously ApplyMutations/DeletePrefixAt used s.fsmApplyWriteOpts unconditionally, which made ELASTICKV_FSM_SYNC_MODE=nosync affect direct (non-raft) callers that do NOT have raft-log replay as a durability backstop. Concrete production path: distribution.EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations. If the process crashed before Pebble flushed, those acknowledged writes could be lost with no raft entry to re-apply. This change splits the API: - ApplyMutations / DeletePrefixAt: always pebble.Sync. Safe for any caller, including those that bypass raft (catalog bootstrap, admin snapshots, migrations, tests). - ApplyMutationsRaft / DeletePrefixAtRaft: governed by s.fsmApplyWriteOpts (ELASTICKV_FSM_SYNC_MODE). Intended solely for the FSM apply loop; only the raft WAL fsync makes pebble.NoSync safe. kv/fsm.go is the only caller of the new *Raft methods. CatalogStore and every adapter test retain the always-sync ApplyMutations, so the nosync opt-in can no longer silently weaken durability outside the raft-apply path. MVCCStore / ShardStore / LeaderRoutedStore all implement both variants; the in-memory mvccStore has no WAL so both delegate to the same body. Tests: - Added TestDirectApplyWriteOpts_AlwaysSync asserting the direct path always resolves to pebble.Sync even when the FSM-apply mode is nosync. - Added TestDirectApplyMutations_NoSyncConfigured_StillWritesDurably for functional coverage of the public entry points under a NoSync-configured store. - Existing sync-mode tests + benchmark renamed to *Raft to reflect that the knob now only governs the raft-apply path. Addresses codex P2 on #592.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 34 minutes and 39 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Follow-up to #592 addressing the codex P2 correctness finding:
ELASTICKV_FSM_SYNC_MODE=nosyncpreviously affected direct (non-raft) callers ofstore.ApplyMutations/store.DeletePrefixAt, which have no raft-log replay as a durability backstop. Splits the API so the knob only relaxes fsync on the raft-apply path.ApplyMutations/DeletePrefixAtare now unconditionallypebble.Sync. Safe for any direct caller (catalog bootstrap, admin snapshots, migrations, tests).ApplyMutationsRaft/DeletePrefixAtRaftare the new raft-apply entry points; they observes.fsmApplyWriteOpts. Called only fromkv/fsm.go.Code evidence for the bug
distribution/catalog.go:630callss.store.ApplyMutationsdirectly fromCatalogStore.applySaveMutations.main.go:749wiresdistribution.NewCatalogStorearound the rawraftGroupRuntime.store(astore.MVCCStore), soEnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutationsis a production path that never goes through the raft apply loop. Undernosync, a crash before Pebble flush would lose acknowledged catalog writes with no raft entry to replay — exactly the failure mode codex P2 described.Before / after per call site
kv/fsm.go(raft apply: raw / 1PC / prepare / commit / abort / DEL_PREFIX)s.fsmApplyWriteOptss.fsmApplyWriteOpts(via*Raft) — unchangeddistribution/catalog.goCatalogStore.applySaveMutationss.fsmApplyWriteOpts(UNSAFE)pebble.Sync(always)adapter/*_test.godirectApplyMutations/DeletePrefixAtcallss.fsmApplyWriteOptspebble.Sync(always)kv/shard_store.go,kv/leader_routed_store.gowrappersTest plan
TestDirectApplyWriteOpts_AlwaysSyncasserts the direct path resolves topebble.Synceven when the FSM-apply mode isnosync.TestDirectApplyMutations_NoSyncConfigured_StillWritesDurablyexercises the publicApplyMutations/DeletePrefixAtentry points under a NoSync-configured store and confirms data is visible after a clean reopen.TestApplyMutations_NoSync*andBenchmarkApplyMutations_SyncModerenamed to*Raftto reflect that the knob now only governs the raft-apply path.go test -race -count=1 ./...— all packages pass.golangci-lint run ./...— 0 issues./gemini review
@codex review