Skip to content

branch-4.0: [fix](cloud) fix segment footer CORRUPTION not triggering file cache retry#61387

Closed
Hastyshell wants to merge 1 commit intoapache:branch-4.0from
Hastyshell:fix/CORE-5965-segment-footer-corruption-retry-4.0
Closed

branch-4.0: [fix](cloud) fix segment footer CORRUPTION not triggering file cache retry#61387
Hastyshell wants to merge 1 commit intoapache:branch-4.0from
Hastyshell:fix/CORE-5965-segment-footer-corruption-retry-4.0

Conversation

@Hastyshell
Copy link
Collaborator

Proposed changes

Backport of #61386 to branch-4.0.

The three-tier retry logic in Segment::_open() (static method) was structured as if-else-if, so when open_file() succeeded but _parse_footer() returned CORRUPTION, the retry branch was unreachable.

Root cause

// Before: if-else-if structure
auto st = fs->open_file(path, &file_reader, &reader_options);
if (st) {                                  // open_file succeeded (almost always)
    segment->_file_reader = ...;
    st = segment->_open(stats);            // _parse_footer() → CORRUPTION stored in st
} else if (st.is<CORRUPTION>() && ...) {   // UNREACHABLE: already entered `if` branch
    // Tier 1: clear cache, retry
    // Tier 2: bypass cache, read remote directly
}
RETURN_IF_ERROR(st);  // CORRUPTION returned without any retry

open_file() only opens a file handle and rarely returns CORRUPTION. The actual footer checksum validation happens inside _parse_footer() (called via segment->_open()). Because the retry was in an else if guarded by the same st from open_file(), it was never reachable for the common _parse_footer() corruption case.

Fix

Change else if to a separate if block, so CORRUPTION from either open_file() or _parse_footer() triggers the three-tier retry.

Issue

Observed in cloud (S3) deployments (branch-4.0): schema change fails with CORRUPTION: Bad segment file footer checksum not match. Log analysis confirmed that no retry log messages were ever emitted, consistent with this code bug.

Tests

  • Added TestFooterCorruptionTriggersRetry to segment_corruption_test.cpp
  • Uses the existing Segment::parse_footer:magic_number_corruption sync point to corrupt the footer magic number on the first _parse_footer() call only

Checklist

…retry

The three-tier retry logic in Segment::_open() was structured as if-else-if,
so when open_file() succeeded but _parse_footer() returned CORRUPTION,
the retry branch was unreachable. Change to independent if blocks so that
CORRUPTION from _parse_footer() also triggers cache eviction and retry.

Add TestFooterCorruptionTriggersRetry unit test to cover this path.
@Hastyshell Hastyshell requested a review from yiguolei as a code owner March 16, 2026 10:39
@Thearas
Copy link
Contributor

Thearas commented Mar 16, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Hastyshell Hastyshell closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants