Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: kvserver: sync before removing sideloaded files #117299

Merged
merged 9 commits into from Jan 15, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Jan 3, 2024

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: critical bug fix

Copy link

blathers-crl bot commented Jan 3, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 3, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 3, 2024

Depends on #117295

@pav-kv pav-kv marked this pull request as ready for review January 5, 2024 18:14
@pav-kv pav-kv requested review from a team as code owners January 5, 2024 18:14
@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 8, 2024

Hm, the test failed in Extended CI with an interesting error:

=== RUN   TestCrashWhileTruncatingSideloadedEntries
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestCrashWhileTruncatingSideloadedEntries412629092
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_log_queue_test.go:234: leader replica: [n1,s1,r64/1:/{Table/Max-Max}]
    client_raft_log_queue_test.go:239: follower replica: [n2,s2,r64/2:/{Table/Max-Max}]
    client_raft_log_queue_test.go:245: leader: log indices: [11..20]
    client_raft_log_queue_test.go:246: leader: applied to: 20
    client_raft_log_queue_test.go:245: follower: log indices: [17..20]
    client_raft_log_queue_test.go:246: follower: applied to: 20
    client_raft_log_queue_test.go:271: committed AddSSTs
    client_raft_log_queue_test.go:245: leader: log indices: [11..40]
    client_raft_log_queue_test.go:246: leader: applied to: 40
    client_raft_log_queue_test.go:245: follower: log indices: [17..40]
    client_raft_log_queue_test.go:246: follower: applied to: 20
    client_raft_log_queue_test.go:299: committed truncation for indices < 39
    client_raft_log_queue_test.go:245: leader: log indices: [39..41]
    client_raft_log_queue_test.go:246: leader: applied to: 41
    client_raft_log_queue_test.go:245: follower: log indices: [17..41]
    client_raft_log_queue_test.go:246: follower: applied to: 20
    client_raft_log_queue_test.go:319: unblocked follower application flow
    client_raft_log_queue_test.go:321: follower applied the truncation
    client_raft_log_queue_test.go:245: follower: log indices: [39..41]
    client_raft_log_queue_test.go:246: follower: applied to: 41
    client_raft_log_queue_test.go:345: CRASH!
    client_raft_log_queue_test.go:349: restarting follower
    client_raft_log_queue_test.go:352: FS after restart:
                  /
             892    000002.log
           10853    000010.log
            1816    000012.sst
            1163    000013.sst
            1018    000014.sst
            1120    i21.t6
            1120    i22.t6
            1120    i23.t6
            1120    i24.t6
            1120    i25.t6
            1120    i26.t6
            1120    i27.t6
            1120    i28.t6
            1120    i29.t6
            1120    i30.t6
            1120    i31.t6
            1120    i32.t6
            1120    i33.t6
            1120    i34.t6
            1120    i35.t6
            1120    i36.t6
            1120    i37.t6
            1120    i38.t6
            1120    i39.t6
            1120    i40.t6
              16    CURRENT
               0    LOCK
            2226    MANIFEST-000001
            2538    OPTIONS-000003
               8    STORAGE_MIN_VERSION
                    auxiliary/
                      sideloading/
                        r0XXXX/
                          r64/
            1120            i39.t6
            1120            i40.t6
                      sstsnapshot/
               0    marker.format-version.000012.013
               0    marker.manifest.000001.MANIFEST-000001
F240105 18:43:53.285591 4473 3@pebble/internal/base/filenames.go:147  [n2,s2,pebble] 1  000035.sst:
F240105 18:43:53.285591 4473 3@pebble/internal/base/filenames.go:147  [n2,s2,pebble] 1 +orig err: open 000035.sst: file does not exist

A pebble file is missing after a crash+restart. I'll investigate, wonder if this could be a Pebble bug.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 8, 2024

A pebble file is missing after a crash+restart. I'll investigate, wonder if this could be a Pebble bug.

Are these any of the sideloaded files, or just a "regular" LSM SST? Since we don't sync state application, I suppose it's plausible that we could lose a file that was recently flushed from the memtable, but the data is still there in the WAL (and maybe also in the memtable after restart, but not yet flushed).

Have we seen any failures like these in the nightlies?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 8, 2024

Ah nevermind, this is an actual Pebble error, not a failed test assertion. Yeah, I'd pull in storage for a look.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 8, 2024

Hasn't been caught in nightlies, but has occurred twice on 23.1, on this PR.

The strict MemFS can be used for emulating crashes, and testing for data
durability loss that they may cause if fsync is not used correctly.

Epic: none
Release note: none
The StickyVFSRegistry is only used with in-memory vfs.MemFS for now.
Return the concrete type from the Get method directly, to avoid
unnecessary casts on the caller site.

Epic: none
Release note: none
Epic: none
Release note: none
This commit adds a couple of interceptors into the raft proposal's
lifetime: one allows blocking the command application flow (without
blocking the entry commit flow), another allows intercepting commands
after applying them to storage and applying the in-memory side effects.

Both interceptors are useful in the next commits that reproduce a
problem in raft log's sideloaded storage.

Epic: none
Release note: none
This commit adds a knob that allows tests to disable the randomization
which sometimes makes asynchronous log appends synchronous. This is
necessary for tests that rely on the async log appends.

Epic: none
Release note: none
To decide whether log truncation requires sync, we need to check whether
there are any sideloaded entries in the removed log entries range. This
commit adds the corresponding helper method to the sideloaded storage.

The existing BytesIfTruncatedFromTo method could be used instead, but it
is more expensive: it fetches stats for all the files it scans, whereas
we just need an existence check, and we can terminate the scan early.

Epic: none
Release note: none
A typical raft log truncation command removes log entries and updates
the state machine in a single Pebble batch. This ensures that the state
machine's TruncatedState is always up-to-date with the log.

However, if some of the truncated entries are stored in the sideloaded
storage (typically AddSST commands), the corresponding files are removed
separately from the Pebble batch (when applying side effects). If the
batch is not synced before the files are removed, a process crash may
cause an inconsistent state: the TruncatedState in the state machine
does not reflect the removal of the entries, but the entry files were
removed. After restart, raft may try to load these entries in order to
apply them or send to other replicas. It will find them missing and
crash loop.

This commit syncs the state machine upon applying truncation commands
that remove at least one sideloaded entry, to prevent the issue above.

Epic: none

Release note (bug fix): this commit fixes a bug in raft log truncation
that could lead to crash loops, and unrecoverable loss of quorum in the
unlikely worst case that all replicas enter this crash loop. The bug
manifests when a few things coincide. The cluster is running a bulk
write workload (e.g. schema change, import, restore), a log truncation
command is running, and CRDB process crashes at an unfortunate moment
(e.g. the process is killed, or kills itself for reasons like when it
detects a disk stall).
This commit adds a regression test for the issue fixed in the previous
commit. Without the commit, the test panics because the sideloaded
storage loses entries which the follower Replica will try to apply after
a restart.

For a detailed explanation of the scenario, see the previous commit and
the comment in the test.

Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 10, 2024

Fixed the flakiness (required backporting #117523 first).

@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 10, 2024

Now there is a legitimate flake. Looks similar to #114191 (comment).

Maybe the 23.1 circuit breakers don't do such a good job at isolating the node as the new ones in master.

@erikgrinaker Do you know: when we trip a circuit breaker, does it try to reconnect and untrip? I'd like to permanently trip it.

@erikgrinaker
Copy link
Contributor

Maybe the 23.1 circuit breakers don't do such a good job at isolating the node as the new ones in master.

That's probably true, the circuit breakers use a different implementation in 23.2.

The test is flaky. Possibly due to the circuit breakers auto-recovering
after being tripped, whereas we need to permanently trip them.

Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 15, 2024

Skipping the test under stress to land this fix. Filed #117785 to fix the flakiness (likely has to do with the circuit breaker behaviours).

@pav-kv pav-kv merged commit fccf4ff into cockroachdb:release-23.1 Jan 15, 2024
5 of 6 checks passed
@pav-kv pav-kv deleted the backport23.1-115595-114191 branch January 15, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants