[opt](segment) Ignore not-found segments in query and load paths#61844
[opt](segment) Ignore not-found segments in query and load paths#61844dataroaring wants to merge 10 commits intoapache:masterfrom
Conversation
When a segment file is missing (e.g., removed by GC or external cause), queries and loads now skip the missing segment instead of failing with IO error. Controlled by mutable config `ignore_not_found_segment` (default true), togglable at runtime via BE HTTP API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Add regression test that verifies ignore_not_found_segment behavior end-to-end using debug point injection on a real BE cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a BE runtime config to tolerate missing native OLAP segment files by skipping NOT_FOUND segments in several segment-loading paths, aiming to avoid user-visible query/load failures when segment files are missing.
Changes:
- Add mutable BE config
ignore_not_found_segment(defaulttrue) to control skipping behavior. - Skip NOT_FOUND segments in
SegmentLoader::load_segments,BetaRowset::load_segments, andLazyInitSegmentIterator::init/next_batch. - Add UT coverage via
IgnoreNotFoundSegmentTestusing a debug-point injection inBetaRowset::load_segment.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/common/config.h | Declares new mutable config ignore_not_found_segment. |
| be/src/common/config.cpp | Defines new mutable config with default true. |
| be/src/storage/segment/segment_loader.cpp | Skips NOT_FOUND segments during bulk segment loading. |
| be/src/storage/segment/lazy_init_segment_iterator.h | Makes next_batch()/current_block_row_locations() return EOF when inner iterator is null. |
| be/src/storage/segment/lazy_init_segment_iterator.cpp | Ignores NOT_FOUND in init() when config enabled (leaves inner iterator null). |
| be/src/storage/rowset/beta_rowset.cpp | Skips NOT_FOUND in load_segments(); adds debug-point injection for NOT_FOUND in load_segment(). |
| be/test/storage/segment/ignore_not_found_segment_test.cpp | New UTs covering config on/off behaviors for the three paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. seg_id=" << _segment_id; | ||
| // _inner_iterator remains nullptr, next_batch() will return EOF | ||
| return Status::OK(); |
There was a problem hiding this comment.
LazyInitSegmentIterator::init() logs only seg_id when skipping NOT_FOUND, which makes correlating the warning to a specific tablet/rowset difficult in production. Please include at least the rowset id (and ideally tablet id / segment path if available) in the warning, and consider rate-limiting if this can be hit repeatedly.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset_id() | ||
| << ", seg_id=" << seg_id; | ||
| seg_id++; | ||
| continue; |
There was a problem hiding this comment.
Like SegmentLoader, BetaRowset::load_segments() now logs a WARNING per missing segment. If a rowset has many missing segments (or the scan is retried), this can generate a large volume of logs. Consider rate limiting and/or logging a summary once per rowset (e.g., number of skipped segments) to reduce operational noise.
| RowsetMetaPB pb; | ||
| json2pb::JsonToProtoMessage(json, &pb); | ||
| pb.set_start_version(0); | ||
| pb.set_end_version(1); | ||
| pb.set_num_segments(num_segments); | ||
| rsm->init_from_pb(pb); |
There was a problem hiding this comment.
The return value from json2pb::JsonToProtoMessage() is ignored here. Other tests in the repo treat this as a bool and assert success; if the JSON format changes, silently proceeding can make failures harder to diagnose. Please capture the return value and ASSERT_TRUE/EXPECT_TRUE it (or use a helper that returns Status).
| void TearDown() override { | ||
| DebugPoints::instance()->clear(); | ||
| config::ignore_not_found_segment = _saved_ignore; | ||
| config::enable_debug_points = _saved_debug_points; | ||
|
|
||
| ExecEnv::GetInstance()->set_segment_loader(_saved_segment_loader); | ||
| delete _segment_loader; | ||
| _segment_loader = nullptr; |
There was a problem hiding this comment.
The test fixture uses DebugPoints::instance()->clear() in TearDown(), which removes all debug points for the entire process. This can create test-order coupling if other tests in the same binary rely on debug points. Prefer removing only the points added by this test (e.g., remove("BetaRowset::load_segment.return_not_found")) or using an RAII helper that adds/removes a named debug point and restores config::enable_debug_points.
| ASSERT_EQ(0, handle.get_segments().size()); | ||
| ASSERT_TRUE(handle.is_inited()); | ||
| } | ||
|
|
There was a problem hiding this comment.
The new UTs validate that SegmentLoader::load_segments() returns OK when segments are missing, but they don't cover any real caller that relies on SegmentCacheHandle::get_segments() being indexable by seg_id (e.g., code paths that do segments[seg_id]). Given the change in semantics, please add a regression test exercising such a caller (or explicitly validate the seg_id->vector-index contract you intend to provide).
| // Regression test: validate that SegmentCacheHandle::get_segments() can be indexed by seg_id | |
| // after a successful SegmentLoader::load_segments() call. This mimics real callers that do | |
| // `segments[seg_id]` and relies on the seg_id -> vector-index contract. | |
| TEST_F(IgnoreNotFoundSegmentTest, SegmentLoaderSegmentsIndexableBySegId) { | |
| config::ignore_not_found_segment = true; | |
| auto rowset = create_rowset(3); | |
| // Do not inject NOT_FOUND for this test; we want all segments to load successfully | |
| SegmentLoader loader(1024 * 1024, 100); | |
| SegmentCacheHandle handle; | |
| auto st = loader.load_segments(rowset, &handle, false); | |
| ASSERT_TRUE(st.ok()) << st; | |
| ASSERT_TRUE(handle.is_inited()); | |
| const auto& segments = handle.get_segments(); | |
| // Expect that we can index by seg_id in [0, 3) | |
| ASSERT_GE(segments.size(), 3); | |
| for (int seg_id = 0; seg_id < 3; ++seg_id) { | |
| // Real callers rely on segments[seg_id] being valid for each seg_id | |
| ASSERT_NE(nullptr, segments[seg_id]); | |
| } | |
| } |
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | ||
| << ", seg_id=" << i; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
SegmentLoader::load_segments() now skips NOT_FOUND segments and still returns OK, but many call sites implicitly assume cache_handle->get_segments() is indexed by seg_id and has size()==rowset->num_segments(). For example, BaseTablet::lookup_row_key() indexes segments[id] where id is a seg_id; with skipped entries this can become out-of-bounds or dereference the wrong segment in release builds. To make skipping safe, either (a) preserve the seg_id->index contract by resizing the segment vector to num_segments and storing loaded segments at segments[seg_id] (leaving nullptr for missing) and update callers to handle nullptr, or (b) change the API to return a mapping and update all callers to lookup by Segment::id() instead of positional indexing.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | |
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | |
| << ", seg_id=" << i; | |
| continue; | |
| } |
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | ||
| << ", seg_id=" << i; | ||
| continue; |
There was a problem hiding this comment.
The warning log inside the segment-load loop can spam logs when multiple segments are missing (and load_segments() is called frequently during query execution). Consider rate limiting (e.g., LOG_EVERY_N / LOG_EVERY_N_SECONDS) and/or aggregating counts per rowset to reduce operational noise while still preserving debuggability.
|
|
||
| #pragma once | ||
|
|
||
| #include "common/config.h" |
There was a problem hiding this comment.
lazy_init_segment_iterator.h adds #include "common/config.h", but this header doesn't reference config:: anywhere. Keeping this include in a widely-used header increases compile-time dependencies; it should be removed (the .cpp already includes what it needs transitively).
| #include "common/config.h" |
- Add rowset_id to LazyInitSegmentIterator skip log for better debuggability - Remove unused #include "common/config.h" from lazy_init_segment_iterator.h - Use specific DebugPoints::remove() instead of clear() in test TearDown - Assert json2pb::JsonToProtoMessage return value in test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
The forward declaration of BetaRowset is insufficient for calling rowset_id() in the log message. Add the full include. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 26832 ms |
TPC-DS: Total hot run time: 169589 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
The segment cache serves cached segments from baseline query, bypassing BetaRowset::load_segment entirely. Disable it so the debug point is hit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the skip condition to cover both NOT_FOUND and IO_ERROR, so EIO errors from damaged/inaccessible segment files are also tolerated when ignore_not_found_segment is enabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 26765 ms |
TPC-DS: Total hot run time: 168132 ms |
The baseline query populated the segment LRU cache before fault injection. Since disable_segment_cache only prevents new insertions (not lookups), subsequent queries were served from cache and never hit the debug point. Remove the baseline query and verify data integrity in the recovery test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native olap tables. | ||
| // Default is true. When a segment file is missing or has IO errors, | ||
| // the query/load will skip the failing segment instead of reporting error to users. | ||
| DECLARE_mBool(ignore_not_found_segment); |
There was a problem hiding this comment.
The new config name ignore_not_found_segment is misleading because the behavior (and comment) also includes IO_ERROR, not just missing segments. Since this is a newly introduced config, consider renaming it to something like ignore_segment_io_errors (or otherwise aligning the name strictly to NOT_FOUND-only behavior) to avoid operator confusion when toggling at runtime.
| // Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native olap tables. | |
| // Default is true. When a segment file is missing or has IO errors, | |
| // the query/load will skip the failing segment instead of reporting error to users. | |
| DECLARE_mBool(ignore_not_found_segment); | |
| // Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native OLAP tables. | |
| // Default is true. When a segment file is missing or has IO errors, | |
| // the query/load will skip the failing segment instead of reporting error to users. | |
| DECLARE_mBool(ignore_segment_io_errors); |
| auto st = SegmentLoader::instance()->load_segment( | ||
| _rowset, _segment_id, &segment_cache_handle, _should_use_cache, false, opts.stats); | ||
| if ((st.is<ErrorCode::NOT_FOUND>() || st.is<ErrorCode::IO_ERROR>()) && | ||
| config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment io error, skip it. rowset_id=" << _rowset->rowset_id() | ||
| << ", seg_id=" << _segment_id << ", status=" << st; | ||
| // _inner_iterator remains nullptr, next_batch() will return EOF | ||
| return Status::OK(); | ||
| } | ||
| RETURN_IF_ERROR(st); |
There was a problem hiding this comment.
The ignore+log logic for segment load failures is now duplicated across SegmentLoader::load_segments, LazyInitSegmentIterator::init, and BetaRowset::load_segments. To avoid future drift (e.g., one path handling a new error code differently), consider extracting a small helper (e.g., should_ignore_segment_load_error(Status)) or a shared utility for the condition + standardized log message.
TPC-H: Total hot run time: 26638 ms |
TPC-DS: Total hot run time: 168484 ms |
…n tests The per-test enable/disable of disable_segment_cache left a window between tests where segments could be loaded and cached by background tasks. Moving disable_segment_cache=true to a single outer scope eliminates this race. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 26925 ms |
TPC-DS: Total hot run time: 169603 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Summary
ignore_not_found_segment(defaulttrue), togglable at runtime via HTTP API.SegmentLoader::load_segments,LazyInitSegmentIterator::init, andBetaRowset::load_segments.Changes
config.h/cppignore_not_found_segmentconfig (mutable bool, default true)segment_loader.cppload_segments()catches NOT_FOUND and skips with warninglazy_init_segment_iterator.cpp/hinit()catches NOT_FOUND, returns OK with null iterator;next_batch()/current_block_row_locations()return EOF on nullbeta_rowset.cppload_segments()catches NOT_FOUND and skips;load_segment()gets DBUG injection pointignore_not_found_segment_test.cppTest plan
IgnoreNotFoundSegmentTest(9 cases) covering BetaRowset, SegmentLoader, and LazyInitSegmentIterator paths🤖 Generated with Claude Code