release-26.1: kvserver: stop wrapping AbortSpan errors as ReplicaCorruptionError#168016
Conversation
|
😎 Merged successfully - details. |
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This is the last remaining production call site that wraps errors as `ReplicaCorruptionError`. A failure to read from the AbortSpan is not indicative of replica corruption and should not crash the node. This is a minimal fix suitable for backporting. A follow-up commit removes the now-dead `ReplicaCorruptionError` infrastructure entirely. Informs: cockroachdb#165558 Release note (bug fix): Fixed a bug where transient I/O errors reading from the AbortSpan were misidentified as replica corruption, causing the node to crash. These errors are now returned to the caller as regular errors. Epic: none Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Detected infrastructure failure (matched: self-hosted runner lost communication with the server). Automatically rerunning failed jobs. (run link) |
b89c12a
into
cockroachdb:release-26.1
Backport 1/2 commits from #167295.
/cc @cockroachdb/release
AbortSpan read errors were wrapped as
ReplicaCorruptionError, causing thenode to fatal via
setCorruptRaftMuLocked. This was overly aggressive: afailure to read from the AbortSpan is not indicative of replica corruption.
Transient I/O errors would crash the node instead of being returned to the
caller.
This is the last remaining production call site that produces
ReplicaCorruptionError(the split/merge trigger wrapping was removed in#167289).
Informs: #165558
Epic: none
Release justification: Low-risk bug fix. One-line change that stops wrapping
a non-corruption error as
ReplicaCorruptionError, preventing unnecessarynode crashes on transient I/O errors during AbortSpan reads.