New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: kvserver: sync before removing sideloaded files #117299
release-23.1: kvserver: sync before removing sideloaded files #117299
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
|
84ed82d
to
50b733a
Compare
Hm, the test failed in Extended CI with an interesting error:
A pebble file is missing after a crash+restart. I'll investigate, wonder if this could be a Pebble bug. |
Are these any of the sideloaded files, or just a "regular" LSM SST? Since we don't sync state application, I suppose it's plausible that we could lose a file that was recently flushed from the memtable, but the data is still there in the WAL (and maybe also in the memtable after restart, but not yet flushed). Have we seen any failures like these in the nightlies? |
Ah nevermind, this is an actual Pebble error, not a failed test assertion. Yeah, I'd pull in storage for a look. |
Hasn't been caught in nightlies, but has occurred twice on 23.1, on this PR. |
The strict MemFS can be used for emulating crashes, and testing for data durability loss that they may cause if fsync is not used correctly. Epic: none Release note: none
The StickyVFSRegistry is only used with in-memory vfs.MemFS for now. Return the concrete type from the Get method directly, to avoid unnecessary casts on the caller site. Epic: none Release note: none
Epic: none Release note: none
This commit adds a couple of interceptors into the raft proposal's lifetime: one allows blocking the command application flow (without blocking the entry commit flow), another allows intercepting commands after applying them to storage and applying the in-memory side effects. Both interceptors are useful in the next commits that reproduce a problem in raft log's sideloaded storage. Epic: none Release note: none
This commit adds a knob that allows tests to disable the randomization which sometimes makes asynchronous log appends synchronous. This is necessary for tests that rely on the async log appends. Epic: none Release note: none
To decide whether log truncation requires sync, we need to check whether there are any sideloaded entries in the removed log entries range. This commit adds the corresponding helper method to the sideloaded storage. The existing BytesIfTruncatedFromTo method could be used instead, but it is more expensive: it fetches stats for all the files it scans, whereas we just need an existence check, and we can terminate the scan early. Epic: none Release note: none
A typical raft log truncation command removes log entries and updates the state machine in a single Pebble batch. This ensures that the state machine's TruncatedState is always up-to-date with the log. However, if some of the truncated entries are stored in the sideloaded storage (typically AddSST commands), the corresponding files are removed separately from the Pebble batch (when applying side effects). If the batch is not synced before the files are removed, a process crash may cause an inconsistent state: the TruncatedState in the state machine does not reflect the removal of the entries, but the entry files were removed. After restart, raft may try to load these entries in order to apply them or send to other replicas. It will find them missing and crash loop. This commit syncs the state machine upon applying truncation commands that remove at least one sideloaded entry, to prevent the issue above. Epic: none Release note (bug fix): this commit fixes a bug in raft log truncation that could lead to crash loops, and unrecoverable loss of quorum in the unlikely worst case that all replicas enter this crash loop. The bug manifests when a few things coincide. The cluster is running a bulk write workload (e.g. schema change, import, restore), a log truncation command is running, and CRDB process crashes at an unfortunate moment (e.g. the process is killed, or kills itself for reasons like when it detects a disk stall).
This commit adds a regression test for the issue fixed in the previous commit. Without the commit, the test panics because the sideloaded storage loses entries which the follower Replica will try to apply after a restart. For a detailed explanation of the scenario, see the previous commit and the comment in the test. Epic: none Release note: none
50b733a
to
1b1da17
Compare
Fixed the flakiness (required backporting #117523 first). |
Now there is a legitimate flake. Looks similar to #114191 (comment). Maybe the @erikgrinaker Do you know: when we trip a circuit breaker, does it try to reconnect and untrip? I'd like to permanently trip it. |
That's probably true, the circuit breakers use a different implementation in 23.2. |
The test is flaky. Possibly due to the circuit breakers auto-recovering after being tripped, whereas we need to permanently trip them. Epic: none Release note: none
Skipping the test under stress to land this fix. Filed #117785 to fix the flakiness (likely has to do with the circuit breaker behaviours). |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: critical bug fix