Skip to content

[opt](segment) Ignore not-found segments in query and load paths#61844

Open
dataroaring wants to merge 10 commits intoapache:masterfrom
freemandealer:opt/ignore-not-found-segment
Open

[opt](segment) Ignore not-found segments in query and load paths#61844
dataroaring wants to merge 10 commits intoapache:masterfrom
freemandealer:opt/ignore-not-found-segment

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

Summary

  • When a segment file is missing (e.g., removed by GC or external cause), queries and loads now skip the missing segment instead of failing with IO error reported to users.
  • Controlled by mutable BE config ignore_not_found_segment (default true), togglable at runtime via HTTP API.
  • Covers all three segment-loading paths: SegmentLoader::load_segments, LazyInitSegmentIterator::init, and BetaRowset::load_segments.

Changes

File Change
config.h/cpp New ignore_not_found_segment config (mutable bool, default true)
segment_loader.cpp load_segments() catches NOT_FOUND and skips with warning
lazy_init_segment_iterator.cpp/h init() catches NOT_FOUND, returns OK with null iterator; next_batch()/current_block_row_locations() return EOF on null
beta_rowset.cpp load_segments() catches NOT_FOUND and skips; load_segment() gets DBUG injection point
ignore_not_found_segment_test.cpp 9 test cases covering all paths with config on/off

Test plan

  • New UT: IgnoreNotFoundSegmentTest (9 cases) covering BetaRowset, SegmentLoader, and LazyInitSegmentIterator paths
  • Verify config toggle works at runtime via BE HTTP API
  • Regression: existing segment_cache_test still passes

🤖 Generated with Claude Code

When a segment file is missing (e.g., removed by GC or external cause),
queries and loads now skip the missing segment instead of failing with
IO error. Controlled by mutable config `ignore_not_found_segment`
(default true), togglable at runtime via BE HTTP API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 28, 2026 07:12
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Mar 28, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Add regression test that verifies ignore_not_found_segment behavior
end-to-end using debug point injection on a real BE cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a BE runtime config to tolerate missing native OLAP segment files by skipping NOT_FOUND segments in several segment-loading paths, aiming to avoid user-visible query/load failures when segment files are missing.

Changes:

  • Add mutable BE config ignore_not_found_segment (default true) to control skipping behavior.
  • Skip NOT_FOUND segments in SegmentLoader::load_segments, BetaRowset::load_segments, and LazyInitSegmentIterator::init/next_batch.
  • Add UT coverage via IgnoreNotFoundSegmentTest using a debug-point injection in BetaRowset::load_segment.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
be/src/common/config.h Declares new mutable config ignore_not_found_segment.
be/src/common/config.cpp Defines new mutable config with default true.
be/src/storage/segment/segment_loader.cpp Skips NOT_FOUND segments during bulk segment loading.
be/src/storage/segment/lazy_init_segment_iterator.h Makes next_batch()/current_block_row_locations() return EOF when inner iterator is null.
be/src/storage/segment/lazy_init_segment_iterator.cpp Ignores NOT_FOUND in init() when config enabled (leaves inner iterator null).
be/src/storage/rowset/beta_rowset.cpp Skips NOT_FOUND in load_segments(); adds debug-point injection for NOT_FOUND in load_segment().
be/test/storage/segment/ignore_not_found_segment_test.cpp New UTs covering config on/off behaviors for the three paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +48
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. seg_id=" << _segment_id;
// _inner_iterator remains nullptr, next_batch() will return EOF
return Status::OK();
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyInitSegmentIterator::init() logs only seg_id when skipping NOT_FOUND, which makes correlating the warning to a specific tablet/rowset difficult in production. Please include at least the rowset id (and ideally tablet id / segment path if available) in the warning, and consider rate-limiting if this can be hit repeatedly.

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +259
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset_id()
<< ", seg_id=" << seg_id;
seg_id++;
continue;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like SegmentLoader, BetaRowset::load_segments() now logs a WARNING per missing segment. If a rowset has many missing segments (or the scan is retried), this can generate a large volume of logs. Consider rate limiting and/or logging a summary once per rowset (e.g., number of skipped segments) to reduce operational noise.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +83
RowsetMetaPB pb;
json2pb::JsonToProtoMessage(json, &pb);
pb.set_start_version(0);
pb.set_end_version(1);
pb.set_num_segments(num_segments);
rsm->init_from_pb(pb);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value from json2pb::JsonToProtoMessage() is ignored here. Other tests in the repo treat this as a bool and assert success; if the JSON format changes, silently proceeding can make failures harder to diagnose. Please capture the return value and ASSERT_TRUE/EXPECT_TRUE it (or use a helper that returns Status).

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +53
void TearDown() override {
DebugPoints::instance()->clear();
config::ignore_not_found_segment = _saved_ignore;
config::enable_debug_points = _saved_debug_points;

ExecEnv::GetInstance()->set_segment_loader(_saved_segment_loader);
delete _segment_loader;
_segment_loader = nullptr;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fixture uses DebugPoints::instance()->clear() in TearDown(), which removes all debug points for the entire process. This can create test-order coupling if other tests in the same binary rely on debug points. Prefer removing only the points added by this test (e.g., remove("BetaRowset::load_segment.return_not_found")) or using an RAII helper that adds/removes a named debug point and restores config::enable_debug_points.

Copilot uses AI. Check for mistakes.
ASSERT_EQ(0, handle.get_segments().size());
ASSERT_TRUE(handle.is_inited());
}

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new UTs validate that SegmentLoader::load_segments() returns OK when segments are missing, but they don't cover any real caller that relies on SegmentCacheHandle::get_segments() being indexable by seg_id (e.g., code paths that do segments[seg_id]). Given the change in semantics, please add a regression test exercising such a caller (or explicitly validate the seg_id->vector-index contract you intend to provide).

Suggested change
// Regression test: validate that SegmentCacheHandle::get_segments() can be indexed by seg_id
// after a successful SegmentLoader::load_segments() call. This mimics real callers that do
// `segments[seg_id]` and relies on the seg_id -> vector-index contract.
TEST_F(IgnoreNotFoundSegmentTest, SegmentLoaderSegmentsIndexableBySegId) {
config::ignore_not_found_segment = true;
auto rowset = create_rowset(3);
// Do not inject NOT_FOUND for this test; we want all segments to load successfully
SegmentLoader loader(1024 * 1024, 100);
SegmentCacheHandle handle;
auto st = loader.load_segments(rowset, &handle, false);
ASSERT_TRUE(st.ok()) << st;
ASSERT_TRUE(handle.is_inited());
const auto& segments = handle.get_segments();
// Expect that we can index by seg_id in [0, 3)
ASSERT_GE(segments.size(), 3);
for (int seg_id = 0; seg_id < 3; ++seg_id) {
// Real callers rely on segments[seg_id] being valid for each seg_id
ASSERT_NE(nullptr, segments[seg_id]);
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +100
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SegmentLoader::load_segments() now skips NOT_FOUND segments and still returns OK, but many call sites implicitly assume cache_handle->get_segments() is indexed by seg_id and has size()==rowset->num_segments(). For example, BaseTablet::lookup_row_key() indexes segments[id] where id is a seg_id; with skipped entries this can become out-of-bounds or dereference the wrong segment in release builds. To make skipping safe, either (a) preserve the seg_id->index contract by resizing the segment vector to num_segments and storing loaded segments at segments[seg_id] (leaving nullptr for missing) and update callers to handle nullptr, or (b) change the API to return a mapping and update all callers to lookup by Segment::id() instead of positional indexing.

Suggested change
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +99
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning log inside the segment-load loop can spam logs when multiple segments are missing (and load_segments() is called frequently during query execution). Consider rate limiting (e.g., LOG_EVERY_N / LOG_EVERY_N_SECONDS) and/or aggregating counts per rowset to reduce operational noise while still preserving debuggability.

Copilot uses AI. Check for mistakes.

#pragma once

#include "common/config.h"
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_init_segment_iterator.h adds #include "common/config.h", but this header doesn't reference config:: anywhere. Keeping this include in a widely-used header increases compile-time dependencies; it should be removed (the .cpp already includes what it needs transitively).

Suggested change
#include "common/config.h"

Copilot uses AI. Check for mistakes.
- Add rowset_id to LazyInitSegmentIterator skip log for better debuggability
- Remove unused #include "common/config.h" from lazy_init_segment_iterator.h
- Use specific DebugPoints::remove() instead of clear() in test TearDown
- Assert json2pb::JsonToProtoMessage return value in test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gavinchou
gavinchou previously approved these changes Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

The forward declaration of BetaRowset is insufficient for calling
rowset_id() in the log message. Add the full include.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26832 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 47f21911665268c40cbbd2b9d5ef4087bb831f35, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17637	4553	4290	4290
q2	q3	10639	835	533	533
q4	4682	362	258	258
q5	7566	1215	1030	1030
q6	177	174	148	148
q7	789	840	677	677
q8	9307	1483	1376	1376
q9	4913	4774	4792	4774
q10	6248	1950	1669	1669
q11	459	260	251	251
q12	739	593	462	462
q13	18019	2737	1941	1941
q14	229	224	215	215
q15	q16	742	736	674	674
q17	727	859	422	422
q18	6020	5452	5357	5357
q19	1124	993	636	636
q20	559	497	389	389
q21	4444	1861	1420	1420
q22	438	453	310	310
Total cold run time: 95458 ms
Total hot run time: 26832 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4783	4577	4649	4577
q2	q3	3903	4399	3862	3862
q4	899	1214	769	769
q5	4128	4390	4364	4364
q6	187	179	143	143
q7	1750	1679	1529	1529
q8	2498	2719	2619	2619
q9	7788	7426	7541	7426
q10	3819	4035	3591	3591
q11	511	430	441	430
q12	507	594	470	470
q13	2537	2884	2069	2069
q14	296	318	283	283
q15	q16	741	761	708	708
q17	1179	1396	1366	1366
q18	7142	7008	6694	6694
q19	916	931	1005	931
q20	2084	2177	1997	1997
q21	4027	3577	3282	3282
q22	463	432	463	432
Total cold run time: 50158 ms
Total hot run time: 47542 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 169589 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 47f21911665268c40cbbd2b9d5ef4087bb831f35, data reload: false

query5	4369	632	524	524
query6	351	230	208	208
query7	4222	473	265	265
query8	344	242	253	242
query9	8695	2659	2711	2659
query10	507	427	340	340
query11	6988	5080	4842	4842
query12	177	127	129	127
query13	1279	490	350	350
query14	5844	3683	3526	3526
query14_1	2852	2852	2826	2826
query15	209	203	177	177
query16	1011	512	432	432
query17	876	733	598	598
query18	2437	439	342	342
query19	214	204	179	179
query20	134	124	124	124
query21	214	145	109	109
query22	13480	14958	14529	14529
query23	16529	16422	15672	15672
query23_1	15680	15621	15682	15621
query24	7222	1633	1228	1228
query24_1	1239	1267	1239	1239
query25	606	507	438	438
query26	1241	268	153	153
query27	2785	487	301	301
query28	4493	1855	1834	1834
query29	841	607	514	514
query30	300	227	195	195
query31	1014	969	885	885
query32	94	79	76	76
query33	534	367	302	302
query34	897	884	535	535
query35	657	692	610	610
query36	1097	1138	1071	1071
query37	143	103	91	91
query38	2943	2883	2871	2871
query39	850	848	815	815
query39_1	816	806	791	791
query40	241	160	143	143
query41	68	67	65	65
query42	261	260	258	258
query43	241	246	227	227
query44	
query45	206	193	188	188
query46	961	999	613	613
query47	2138	2135	2023	2023
query48	308	318	233	233
query49	653	498	407	407
query50	703	287	227	227
query51	4142	4163	4041	4041
query52	264	278	258	258
query53	297	335	287	287
query54	338	313	291	291
query55	104	89	89	89
query56	348	351	335	335
query57	1938	1734	1704	1704
query58	293	282	276	276
query59	2808	2926	2772	2772
query60	335	332	331	331
query61	162	163	156	156
query62	641	602	533	533
query63	310	283	277	277
query64	4997	1319	1061	1061
query65	
query66	1464	459	363	363
query67	24335	24324	24240	24240
query68	
query69	401	322	284	284
query70	992	969	964	964
query71	339	310	297	297
query72	2844	2744	2485	2485
query73	545	559	311	311
query74	9573	9533	9428	9428
query75	2867	2767	2471	2471
query76	2307	1032	681	681
query77	359	404	316	316
query78	10993	11181	10497	10497
query79	1083	849	578	578
query80	1446	648	556	556
query81	545	263	232	232
query82	1364	150	126	126
query83	389	265	259	259
query84	304	125	110	110
query85	1085	526	488	488
query86	428	313	302	302
query87	3178	3179	2992	2992
query88	3537	2660	2632	2632
query89	429	383	346	346
query90	1867	185	179	179
query91	180	174	141	141
query92	83	78	73	73
query93	916	862	498	498
query94	596	324	305	305
query95	598	408	324	324
query96	651	531	231	231
query97	2477	2516	2457	2457
query98	241	220	215	215
query99	1042	1005	920	920
Total cold run time: 250436 ms
Total hot run time: 169589 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.90% (19959/37729)
Line Coverage 36.44% (187227/513755)
Region Coverage 32.70% (145248/444177)
Branch Coverage 33.86% (63634/187927)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.78% (26523/36949)
Line Coverage 54.63% (279801/512208)
Region Coverage 51.64% (231484/448297)
Branch Coverage 53.11% (100111/188493)

dataroaring and others added 3 commits March 30, 2026 20:10
The segment cache serves cached segments from baseline query, bypassing
BetaRowset::load_segment entirely. Disable it so the debug point is hit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the skip condition to cover both NOT_FOUND and IO_ERROR,
so EIO errors from damaged/inaccessible segment files are also
tolerated when ignore_not_found_segment is enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26765 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9f587cb24f187effff53e943a5e90f3ccbdfaf75, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17621	4486	4306	4306
q2	q3	10636	793	525	525
q4	4685	367	250	250
q5	7586	1233	1020	1020
q6	176	177	150	150
q7	819	861	697	697
q8	9340	1527	1398	1398
q9	5006	4769	4736	4736
q10	6297	1949	1652	1652
q11	467	266	243	243
q12	739	585	468	468
q13	18042	2697	1938	1938
q14	239	237	218	218
q15	q16	743	751	668	668
q17	751	874	443	443
q18	6231	5526	5316	5316
q19	1128	997	600	600
q20	548	507	371	371
q21	4481	1865	1481	1481
q22	506	410	285	285
Total cold run time: 96041 ms
Total hot run time: 26765 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4745	4576	4673	4576
q2	q3	3935	4361	3841	3841
q4	900	1197	792	792
q5	4153	4444	4361	4361
q6	186	177	144	144
q7	1801	1657	1538	1538
q8	2537	2727	2786	2727
q9	7687	7353	7347	7347
q10	3781	4020	3617	3617
q11	496	432	408	408
q12	474	579	449	449
q13	2472	2932	2104	2104
q14	293	338	281	281
q15	q16	723	769	705	705
q17	1160	1325	1355	1325
q18	7225	6795	6666	6666
q19	918	954	994	954
q20	2097	2205	2001	2001
q21	4173	3535	3325	3325
q22	474	445	368	368
Total cold run time: 50230 ms
Total hot run time: 47529 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168132 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9f587cb24f187effff53e943a5e90f3ccbdfaf75, data reload: false

query5	4339	670	502	502
query6	339	225	205	205
query7	4222	467	266	266
query8	356	252	234	234
query9	8727	2724	2722	2722
query10	513	408	332	332
query11	6984	5096	4869	4869
query12	182	125	126	125
query13	1279	462	351	351
query14	5673	3728	3479	3479
query14_1	2818	2837	2783	2783
query15	204	193	176	176
query16	1026	494	472	472
query17	1127	731	622	622
query18	2465	461	358	358
query19	232	210	192	192
query20	140	133	126	126
query21	219	140	106	106
query22	13318	13202	13138	13138
query23	16288	15909	16347	15909
query23_1	16012	16187	16102	16102
query24	8136	1757	1277	1277
query24_1	1330	1258	1302	1258
query25	630	459	402	402
query26	1206	261	150	150
query27	2797	477	293	293
query28	4523	1830	1815	1815
query29	833	560	471	471
query30	298	226	188	188
query31	1004	934	884	884
query32	87	69	73	69
query33	520	338	300	300
query34	904	884	523	523
query35	632	662	604	604
query36	1056	1134	970	970
query37	129	96	86	86
query38	2991	2922	2894	2894
query39	850	833	813	813
query39_1	798	780	781	780
query40	235	150	137	137
query41	68	61	59	59
query42	266	254	263	254
query43	241	261	216	216
query44	
query45	196	192	180	180
query46	873	978	589	589
query47	2117	2586	2088	2088
query48	327	318	224	224
query49	631	453	385	385
query50	687	279	209	209
query51	4113	4128	4037	4037
query52	264	263	261	261
query53	285	336	281	281
query54	296	283	273	273
query55	94	94	86	86
query56	343	320	315	315
query57	1923	1671	1603	1603
query58	319	273	269	269
query59	2790	2940	2713	2713
query60	347	341	330	330
query61	156	142	151	142
query62	649	597	547	547
query63	311	278	270	270
query64	5041	1276	995	995
query65	
query66	1455	460	353	353
query67	24197	24278	24161	24161
query68	
query69	436	321	294	294
query70	988	976	867	867
query71	346	321	296	296
query72	2892	2792	2512	2512
query73	536	534	324	324
query74	9590	9546	9403	9403
query75	2867	2823	2462	2462
query76	2373	1031	667	667
query77	363	389	314	314
query78	10860	11151	10434	10434
query79	1130	778	571	571
query80	1317	616	532	532
query81	553	270	223	223
query82	1015	156	120	120
query83	344	269	240	240
query84	255	119	101	101
query85	908	494	440	440
query86	418	336	316	316
query87	3163	3133	3054	3054
query88	3542	2653	2640	2640
query89	426	379	340	340
query90	2027	198	184	184
query91	168	170	134	134
query92	82	77	68	68
query93	923	850	507	507
query94	629	321	304	304
query95	579	351	389	351
query96	645	520	227	227
query97	2478	2498	2408	2408
query98	237	224	216	216
query99	1012	968	914	914
Total cold run time: 251010 ms
Total hot run time: 168132 ms

The baseline query populated the segment LRU cache before fault injection.
Since disable_segment_cache only prevents new insertions (not lookups),
subsequent queries were served from cache and never hit the debug point.
Remove the baseline query and verify data integrity in the recovery test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring dataroaring requested a review from Copilot March 31, 2026 09:45
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1707 to +1710
// Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native olap tables.
// Default is true. When a segment file is missing or has IO errors,
// the query/load will skip the failing segment instead of reporting error to users.
DECLARE_mBool(ignore_not_found_segment);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new config name ignore_not_found_segment is misleading because the behavior (and comment) also includes IO_ERROR, not just missing segments. Since this is a newly introduced config, consider renaming it to something like ignore_segment_io_errors (or otherwise aligning the name strictly to NOT_FOUND-only behavior) to avoid operator confusion when toggling at runtime.

Suggested change
// Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native olap tables.
// Default is true. When a segment file is missing or has IO errors,
// the query/load will skip the failing segment instead of reporting error to users.
DECLARE_mBool(ignore_not_found_segment);
// Whether to ignore IO errors (NOT_FOUND, EIO) when loading segment files in native OLAP tables.
// Default is true. When a segment file is missing or has IO errors,
// the query/load will skip the failing segment instead of reporting error to users.
DECLARE_mBool(ignore_segment_io_errors);

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +53
auto st = SegmentLoader::instance()->load_segment(
_rowset, _segment_id, &segment_cache_handle, _should_use_cache, false, opts.stats);
if ((st.is<ErrorCode::NOT_FOUND>() || st.is<ErrorCode::IO_ERROR>()) &&
config::ignore_not_found_segment) {
LOG(WARNING) << "segment io error, skip it. rowset_id=" << _rowset->rowset_id()
<< ", seg_id=" << _segment_id << ", status=" << st;
// _inner_iterator remains nullptr, next_batch() will return EOF
return Status::OK();
}
RETURN_IF_ERROR(st);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ignore+log logic for segment load failures is now duplicated across SegmentLoader::load_segments, LazyInitSegmentIterator::init, and BetaRowset::load_segments. To avoid future drift (e.g., one path handling a new error code differently), consider extracting a small helper (e.g., should_ignore_segment_load_error(Status)) or a shared utility for the condition + standardized log message.

Copilot uses AI. Check for mistakes.
@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26638 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 93578cd28925056e98de62aa986bbdcdfb387589, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17611	4472	4306	4306
q2	q3	10644	807	522	522
q4	4677	361	253	253
q5	7580	1210	1030	1030
q6	172	178	145	145
q7	787	846	667	667
q8	9303	1496	1381	1381
q9	4968	4723	4714	4714
q10	6310	1917	1656	1656
q11	445	255	233	233
q12	735	588	472	472
q13	18043	2688	1933	1933
q14	225	236	215	215
q15	q16	729	725	666	666
q17	719	823	488	488
q18	5878	5449	5242	5242
q19	1123	982	618	618
q20	568	502	380	380
q21	4405	1859	1445	1445
q22	342	308	272	272
Total cold run time: 95264 ms
Total hot run time: 26638 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4872	4685	4750	4685
q2	q3	3941	4405	3804	3804
q4	877	1203	789	789
q5	4060	4405	4399	4399
q6	194	186	140	140
q7	1775	1627	1547	1547
q8	2537	2776	2623	2623
q9	7452	7175	7342	7175
q10	3859	4043	3614	3614
q11	512	441	418	418
q12	512	600	476	476
q13	2535	2844	2062	2062
q14	402	312	276	276
q15	q16	760	796	726	726
q17	1150	1318	1386	1318
q18	7256	7117	6652	6652
q19	953	963	974	963
q20	2143	2183	2015	2015
q21	3955	3472	3315	3315
q22	465	445	381	381
Total cold run time: 50210 ms
Total hot run time: 47378 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168484 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 93578cd28925056e98de62aa986bbdcdfb387589, data reload: false

query5	4327	621	493	493
query6	339	228	208	208
query7	4211	461	264	264
query8	341	239	221	221
query9	8759	2736	2752	2736
query10	514	393	333	333
query11	6980	5100	4890	4890
query12	178	125	121	121
query13	1268	466	370	370
query14	5678	3703	3465	3465
query14_1	2875	2851	2854	2851
query15	212	193	181	181
query16	978	476	379	379
query17	906	736	629	629
query18	2443	475	357	357
query19	224	214	188	188
query20	134	127	123	123
query21	217	138	111	111
query22	13155	14033	14432	14033
query23	16691	16512	15954	15954
query23_1	16144	15758	15678	15678
query24	7161	1653	1230	1230
query24_1	1239	1228	1245	1228
query25	628	468	419	419
query26	1237	257	142	142
query27	2809	490	300	300
query28	4528	1854	1829	1829
query29	842	556	475	475
query30	300	228	190	190
query31	1001	936	884	884
query32	84	70	69	69
query33	512	347	293	293
query34	892	863	532	532
query35	644	699	597	597
query36	1083	1120	943	943
query37	137	91	84	84
query38	2956	2889	2853	2853
query39	860	829	821	821
query39_1	795	782	785	782
query40	233	153	133	133
query41	66	61	57	57
query42	261	257	259	257
query43	234	243	223	223
query44	
query45	193	192	183	183
query46	876	978	605	605
query47	2109	2143	2018	2018
query48	309	310	225	225
query49	638	466	375	375
query50	690	272	214	214
query51	4221	4003	3987	3987
query52	258	262	256	256
query53	284	337	284	284
query54	294	266	264	264
query55	95	90	87	87
query56	325	314	304	304
query57	1906	1691	1706	1691
query58	279	304	266	266
query59	2776	2987	2722	2722
query60	342	335	349	335
query61	162	156	156	156
query62	630	590	541	541
query63	316	277	276	276
query64	5112	1296	1022	1022
query65	
query66	1459	455	357	357
query67	24124	24257	24178	24178
query68	
query69	408	315	287	287
query70	978	909	972	909
query71	332	313	294	294
query72	2869	2781	2495	2495
query73	529	536	319	319
query74	9586	9591	9412	9412
query75	2871	2749	2457	2457
query76	2300	1025	656	656
query77	366	377	300	300
query78	10945	11103	10418	10418
query79	2975	768	583	583
query80	1735	615	546	546
query81	558	266	221	221
query82	982	157	117	117
query83	332	265	245	245
query84	304	122	105	105
query85	962	499	455	455
query86	426	329	304	304
query87	3122	3153	2983	2983
query88	3514	2660	2637	2637
query89	424	383	339	339
query90	1972	179	174	174
query91	169	174	141	141
query92	80	73	69	69
query93	1185	851	505	505
query94	671	321	298	298
query95	599	403	319	319
query96	637	507	226	226
query97	2480	2477	2385	2385
query98	232	236	231	231
query99	965	1006	909	909
Total cold run time: 252355 ms
Total hot run time: 168484 ms

…n tests

The per-test enable/disable of disable_segment_cache left a window between
tests where segments could be loaded and cached by background tasks. Moving
disable_segment_cache=true to a single outer scope eliminates this race.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26925 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 60bb3c5c46a4b9bb0c0892b1615597529096eb5e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17593	4607	4292	4292
q2	q3	10641	784	521	521
q4	4675	365	250	250
q5	7571	1198	1021	1021
q6	178	173	146	146
q7	788	876	663	663
q8	9860	1484	1381	1381
q9	5421	4787	4726	4726
q10	6321	1944	1677	1677
q11	494	248	249	248
q12	767	582	468	468
q13	18031	2690	1973	1973
q14	228	231	210	210
q15	q16	760	753	657	657
q17	758	873	431	431
q18	6010	5475	5247	5247
q19	1524	972	613	613
q20	571	502	378	378
q21	4490	1873	1731	1731
q22	406	379	292	292
Total cold run time: 97087 ms
Total hot run time: 26925 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4822	4638	4601	4601
q2	q3	3890	4335	3820	3820
q4	911	1219	813	813
q5	4051	4381	4348	4348
q6	187	181	142	142
q7	1789	1635	1578	1578
q8	2490	2729	2629	2629
q9	7536	7278	7371	7278
q10	3913	4011	3621	3621
q11	528	431	411	411
q12	472	603	446	446
q13	2558	3081	2030	2030
q14	295	300	297	297
q15	q16	734	774	704	704
q17	1173	1322	1555	1322
q18	7332	6892	6992	6892
q19	950	961	962	961
q20	2083	2163	1995	1995
q21	3907	3603	3344	3344
q22	456	445	411	411
Total cold run time: 50077 ms
Total hot run time: 47643 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 169603 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 60bb3c5c46a4b9bb0c0892b1615597529096eb5e, data reload: false

query5	4364	644	505	505
query6	341	218	206	206
query7	4214	457	264	264
query8	341	238	228	228
query9	8726	2734	2729	2729
query10	548	413	336	336
query11	7005	5112	4856	4856
query12	194	123	123	123
query13	1260	477	337	337
query14	5719	3690	3435	3435
query14_1	3079	2783	2902	2783
query15	209	190	172	172
query16	977	467	447	447
query17	868	746	595	595
query18	2438	445	350	350
query19	212	226	196	196
query20	136	125	126	125
query21	217	141	107	107
query22	13139	15110	14464	14464
query23	16642	16380	16146	16146
query23_1	15849	15665	15612	15612
query24	7152	1630	1218	1218
query24_1	1225	1236	1232	1232
query25	584	493	451	451
query26	1263	268	160	160
query27	2757	487	296	296
query28	4524	1873	1879	1873
query29	859	609	507	507
query30	310	231	194	194
query31	1005	961	884	884
query32	85	73	73	73
query33	528	370	306	306
query34	939	872	532	532
query35	650	679	596	596
query36	1092	1121	999	999
query37	132	101	87	87
query38	2928	2956	2871	2871
query39	861	864	818	818
query39_1	794	798	801	798
query40	235	161	143	143
query41	68	66	65	65
query42	268	261	257	257
query43	243	269	225	225
query44	
query45	201	239	178	178
query46	888	990	614	614
query47	2124	2145	2084	2084
query48	308	313	228	228
query49	660	455	386	386
query50	684	268	232	232
query51	4073	4009	4024	4009
query52	258	264	253	253
query53	300	342	275	275
query54	298	277	267	267
query55	90	90	85	85
query56	312	308	319	308
query57	1935	1757	1719	1719
query58	286	284	260	260
query59	2790	2951	2742	2742
query60	343	332	320	320
query61	161	150	161	150
query62	620	571	537	537
query63	306	278	275	275
query64	5069	1273	1030	1030
query65	
query66	1462	457	356	356
query67	24355	24208	24094	24094
query68	
query69	403	325	290	290
query70	923	975	963	963
query71	330	304	298	298
query72	2804	2681	2477	2477
query73	546	558	323	323
query74	9622	9567	9421	9421
query75	2871	2756	2456	2456
query76	2275	1041	699	699
query77	369	391	314	314
query78	10988	11136	10460	10460
query79	2484	755	574	574
query80	1773	630	540	540
query81	562	258	219	219
query82	989	159	129	129
query83	343	266	247	247
query84	301	122	114	114
query85	918	488	449	449
query86	433	333	288	288
query87	3128	3095	3088	3088
query88	3572	2701	2662	2662
query89	434	371	338	338
query90	2016	181	179	179
query91	171	172	149	149
query92	80	76	67	67
query93	1020	835	509	509
query94	637	313	317	313
query95	607	405	327	327
query96	639	521	232	232
query97	2493	2548	2471	2471
query98	231	221	219	219
query99	1017	993	916	916
Total cold run time: 251800 ms
Total hot run time: 169603 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.93% (19991/37768)
Line Coverage 36.47% (187618/514469)
Region Coverage 32.69% (145345/444583)
Branch Coverage 33.91% (63830/188247)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.57% (27209/36986)
Line Coverage 57.13% (293020/512924)
Region Coverage 54.25% (243426/448700)
Branch Coverage 56.03% (105789/188813)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants