release-25.4: kvserver: stop treating split/merge trigger errors as replica corruption#168013
Merged
trunk-io[bot] merged 1 commit intocockroachdb:release-25.4from Apr 10, 2026
Merged
Conversation
Previously, `maybeWrapReplicaCorruptionError` in `RunCommitTrigger` escalated any unrecognized error from split/merge trigger evaluation to a `ReplicaCorruptionError`, which crashes the process via `setCorruptRaftMuLocked`. This meant that transient I/O errors (e.g. cloud storage network timeouts during `MVCCIsSpanEmpty`) would fatal the node despite not indicating actual data corruption. Remove the corruption wrapping so that these errors simply fail the split or merge, which will be retried. Informs: cockroachdb#165558 Epic: CRDB-61447 Release note (bug fix): Fixed a bug where transient I/O errors (such as cloud storage network timeouts) during split or merge trigger evaluation were misidentified as replica corruption, causing the node to crash. These errors now correctly fail the operation, which is retried automatically. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Contributor
|
😎 Merged successfully - details. |
|
Thanks for opening a backport. Before merging, please confirm that it falls into one of the following categories (select one):
Add a brief release justification to the PR description explaining your selection. Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy. All backports must be reviewed by the TL and EM for the owning area. |
Member
arulajmani
approved these changes
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #167289.
/cc @cockroachdb/release
Previously,
maybeWrapReplicaCorruptionErrorinRunCommitTriggerescalated any unrecognized error from split/merge trigger evaluation to
a
ReplicaCorruptionError, which crashes the process viasetCorruptRaftMuLocked. This meant that transient I/O errors (e.g.cloud storage network timeouts during
MVCCIsSpanEmpty) would fatalthe node despite not indicating actual data corruption.
Remove the corruption wrapping so that these errors simply fail the
split or merge, which will be retried.
This is a minimal fix suitable for backporting. Follow-up work can
remove the now-no-op
maybeWrapReplicaCorruptionErrorwrapper entirely.Fixes-26.2: #165558
Epic: CRDB-61447
Release note (bug fix): Fixed a bug where transient I/O errors (such
as cloud storage network timeouts) during split or merge trigger
evaluation were misidentified as replica corruption, causing the node
to crash. These errors now correctly fail the operation, which is
retried automatically.
Release justification: bug fix: transient I/O errors during split/merge incorrectly crash the node