release-26.2: kvserver: stop treating split/merge trigger errors as replica corruption#167377
Merged
trunk-io[bot] merged 1 commit intocockroachdb:release-26.2from Apr 7, 2026
Conversation
Previously, `maybeWrapReplicaCorruptionError` in `RunCommitTrigger` escalated any unrecognized error from split/merge trigger evaluation to a `ReplicaCorruptionError`, which crashes the process via `setCorruptRaftMuLocked`. This meant that transient I/O errors (e.g. cloud storage network timeouts during `MVCCIsSpanEmpty`) would fatal the node despite not indicating actual data corruption. Remove the corruption wrapping so that these errors simply fail the split or merge, which will be retried. Informs: cockroachdb#165558 Epic: CRDB-61447 Release note (bug fix): Fixed a bug where transient I/O errors (such as cloud storage network timeouts) during split or merge trigger evaluation were misidentified as replica corruption, causing the node to crash. These errors now correctly fail the operation, which is retried automatically. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
Member
dt
approved these changes
Apr 2, 2026
Member
Author
|
/trunk merge |
Contributor
|
😎 Merged successfully - details. |
Member
Author
|
/trunk merge |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #167289 on behalf of @tbg.
Previously,
maybeWrapReplicaCorruptionErrorinRunCommitTriggerescalated any unrecognized error from split/merge trigger evaluation to
a
ReplicaCorruptionError, which crashes the process viasetCorruptRaftMuLocked. This meant that transient I/O errors (e.g.cloud storage network timeouts during
MVCCIsSpanEmpty) would fatalthe node despite not indicating actual data corruption.
Remove the corruption wrapping so that these errors simply fail the
split or merge, which will be retried.
This is a minimal fix suitable for backporting. Follow-up work can
remove the now-no-op
maybeWrapReplicaCorruptionErrorwrapper entirely.Fixes #165558
Epic: CRDB-61447
Release note (bug fix): Fixed a bug where transient I/O errors (such
as cloud storage network timeouts) during split or merge trigger
evaluation were misidentified as replica corruption, causing the node
to crash. These errors now correctly fail the operation, which is
retried automatically.
Release justification: Bug fix that prevents spurious replica corruption errors on split/merge trigger failures.