[fix](executor) prevent BE crash when split process throws unexpectedly by eldenmoon · Pull Request #62044 · apache/doris

eldenmoon · 2026-04-02T07:30:28Z

Catch exceptions around split->process() in TimeSharingTaskExecutor and
convert them to split failure status.

This avoids worker thread termination and BE crash for cases :

erminate called after throwing an instance of 'doris::Exception' what(): [E-7412] assert cast err:[E-7412] Bad cast from
...
doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:420 10# doris::vectorized::PrioritizedSplitRunner::process() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104 11# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:568 12#

, while keeping MEM_ALLOC_FAILED mapped to
MemoryLimitExceeded.

hello-stephen · 2026-04-02T07:30:33Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

Prevents backend worker-thread termination (and potential BE crash) when split->process() throws unexpectedly in the time-sharing scan executor by converting thrown exceptions into split failure statuses.

Changes:

Wrap PrioritizedSplitRunner::process() invocation in a try/catch to prevent exceptions from escaping the dispatch thread.
Map doris::Exception (including special-casing MEM_ALLOC_FAILED) and other exceptions to appropriate Status errors returned via Result.
Keep enable_thread_catch_bad_alloc scoped around split->process() to preserve existing memory-exception behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

be/src/exec/scan/task_executor/time_sharing/time_sharing_task_executor.cpp

Copilot · 2026-04-02T07:34:12Z

be/src/exec/scan/task_executor/time_sharing/time_sharing_task_executor.cpp

+        auto blocked_future_result = [&]() -> Result<SharedListenableFuture<Void>> {
+            try {
+                doris::enable_thread_catch_bad_alloc++;
+                Defer defer {[&]() { doris::enable_thread_catch_bad_alloc--; }};
+                return split->process();
+            } catch (const doris::Exception& e) {
+                if (e.code() == doris::ErrorCode::MEM_ALLOC_FAILED) {
+                    return unexpected(Status::MemoryLimitExceeded(
+                            "PreCatch error code:{}, {}, __FILE__:{}, __LINE__:{}, "
+                            "__FUNCTION__:{}",
+                            e.code(), e.to_string(), __FILE__, __LINE__, __PRETTY_FUNCTION__));
+                }
+                return unexpected(e.to_status());
+            } catch (const std::exception& e) {


This try/catch block duplicates the exception-to-Status mapping logic already implemented in common/exception.h (including the enable_thread_catch_bad_alloc guard and the "PreCatch" message). To avoid future divergence, consider factoring the conversion into a shared helper (e.g., a function that converts an Exception/std::exception to Status) and reuse it here for the Result<...> error path.

Catch exceptions around split->process() in TimeSharingTaskExecutor and convert them to split failure status. This avoids worker thread termination and BE crash for cases : ``` erminate called after throwing an instance of 'doris::Exception' what(): [E-7412] assert cast err:[E-7412] Bad cast from ... doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:420 10# doris::vectorized::PrioritizedSplitRunner::process() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104 11# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:568 12# ``` w , while keeping MEM_ALLOC_FAILED mapped to MemoryLimitExceeded.

…executor.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

eldenmoon · 2026-04-02T07:52:06Z

/review

eldenmoon · 2026-04-02T07:54:28Z

run buildall

doris-robot · 2026-04-02T08:23:13Z

TPC-H: Total hot run time: 29125 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8f13f3a212046cbc7af127306d470223b5a72f7e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17628	3663	3658	3658
q2	q3	10655	926	601	601
q4	4682	470	355	355
q5	7485	1353	1141	1141
q6	188	164	137	137
q7	911	955	753	753
q8	9301	1380	1274	1274
q9	5402	5324	5252	5252
q10	6239	2021	1779	1779
q11	480	278	270	270
q12	841	681	509	509
q13	18046	2795	2169	2169
q14	284	287	262	262
q15	q16	861	861	788	788
q17	1028	1072	747	747
q18	6460	5683	5537	5537
q19	1150	1286	1065	1065
q20	595	546	412	412
q21	4376	2586	2063	2063
q22	483	401	353	353
Total cold run time: 97095 ms
Total hot run time: 29125 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4603	4588	4570	4570
q2	q3	4721	4775	4166	4166
q4	2049	2075	1324	1324
q5	4897	4896	5152	4896
q6	210	178	141	141
q7	2004	1795	1586	1586
q8	3341	3093	3039	3039
q9	8345	8286	8365	8286
q10	4601	4520	4262	4262
q11	574	399	382	382
q12	654	728	510	510
q13	2649	3050	2426	2426
q14	302	329	309	309
q15	q16	768	770	685	685
q17	1344	1261	1235	1235
q18	7982	7385	7044	7044
q19	1127	1156	1127	1127
q20	2272	2263	2078	2078
q21	5985	5351	4846	4846
q22	517	492	414	414
Total cold run time: 58945 ms
Total hot run time: 53326 ms

doris-robot · 2026-04-02T08:34:33Z

TPC-DS: Total hot run time: 179999 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8f13f3a212046cbc7af127306d470223b5a72f7e, data reload: false

query5	4349	641	506	506
query6	339	230	204	204
query7	4216	621	326	326
query8	327	268	222	222
query9	8742	3878	3893	3878
query10	472	407	341	341
query11	6642	5481	5148	5148
query12	185	136	127	127
query13	1278	625	441	441
query14	5628	5191	4703	4703
query14_1	4170	4086	4094	4086
query15	205	211	181	181
query16	979	480	453	453
query17	938	752	628	628
query18	2458	498	367	367
query19	241	223	189	189
query20	135	130	131	130
query21	229	144	113	113
query22	13957	14870	14631	14631
query23	18353	17093	16682	16682
query23_1	16769	16867	16587	16587
query24	7468	1735	1321	1321
query24_1	1361	1340	1320	1320
query25	576	521	441	441
query26	1260	333	178	178
query27	2668	611	371	371
query28	4492	1879	1887	1879
query29	960	685	582	582
query30	300	234	192	192
query31	1083	1043	931	931
query32	97	70	72	70
query33	551	348	293	293
query34	1176	1147	700	700
query35	727	769	655	655
query36	1251	1184	1062	1062
query37	157	101	84	84
query38	3082	3012	2976	2976
query39	915	890	872	872
query39_1	832	840	831	831
query40	238	158	136	136
query41	61	59	58	58
query42	274	275	274	274
query43	311	314	277	277
query44	
query45	210	194	191	191
query46	1155	1286	800	800
query47	2342	2337	2246	2246
query48	386	400	294	294
query49	637	532	439	439
query50	710	285	217	217
query51	4331	4337	4269	4269
query52	278	283	263	263
query53	329	337	272	272
query54	334	285	263	263
query55	99	96	88	88
query56	325	322	328	322
query57	1746	1725	1485	1485
query58	305	281	278	278
query59	2872	2991	2733	2733
query60	339	334	326	326
query61	161	158	155	155
query62	703	622	573	573
query63	311	266	273	266
query64	5321	1452	1141	1141
query65	
query66	1499	484	411	411
query67	24280	24262	24163	24163
query68	
query69	485	347	330	330
query70	1006	1006	994	994
query71	375	337	323	323
query72	3231	2882	2594	2594
query73	813	778	456	456
query74	9824	9691	9618	9618
query75	3535	3366	3009	3009
query76	2297	1139	775	775
query77	396	398	334	334
query78	11381	11360	10802	10802
query79	1472	1144	827	827
query80	817	773	668	668
query81	465	273	243	243
query82	1399	150	119	119
query83	378	286	263	263
query84	306	144	116	116
query85	873	528	453	453
query86	388	344	279	279
query87	3318	3199	3103	3103
query88	3594	2714	2721	2714
query89	477	412	376	376
query90	1991	182	176	176
query91	181	169	141	141
query92	78	78	68	68
query93	904	881	501	501
query94	553	348	304	304
query95	646	358	419	358
query96	984	763	350	350
query97	2670	2679	2557	2557
query98	240	238	224	224
query99	1063	1074	966	966
Total cold run time: 258776 ms
Total hot run time: 179999 ms

doris-robot · 2026-04-02T09:01:42Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.93% (20046/37875)
Line Coverage	36.52% (188197/515326)
Region Coverage	32.80% (146182/445658)
Branch Coverage	33.94% (63969/188496)

hello-stephen · 2026-04-02T10:17:10Z

BE Regression && UT Coverage Report

Increment line coverage 43.48% (10/23) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.71% (26599/37094)
Line Coverage	54.67% (280857/513777)
Region Coverage	51.79% (232944/449771)
Branch Coverage	53.19% (100567/189062)

eldenmoon · 2026-04-02T12:08:25Z

skip check_coverage

morningman · 2026-04-02T17:52:02Z

/review

github-actions

Found 1 blocking issue.

[high] TimeSharingTaskExecutor now swallows exceptions from split->process() and routes them to _split_finished(split, status), but for scan splits that bypasses the existing ScannerScheduler error path that records the failure on ScanTask and pushes it back into ScannerContext. ScannerContext::submit_scan_task() increments _in_flight_tasks_num, and that counter is decremented only in ScannerContext::push_back_scan_task(). ScannerSplitRunner::close() is empty, so when _scanner_scan() (or anything inside _scan_func) throws, the worker no longer crashes, but the scan task is never pushed back, the error never reaches the consumer, and the context can hang waiting for an in-flight task that will never return.

Critical checkpoint conclusions:

Goal of the task: Partially accomplished. The patch prevents worker-thread termination, but the scanner failure path is no longer completed correctly, and the added test does not cover the affected scan path.
Modification size/focus: Small and focused.
Concurrency: No new lock-order issue found in the catch block itself, but split/task lifecycle accounting becomes inconsistent on the exception path.
Lifecycle management: Failed for ScannerSplitRunner on the new path because the task never re-enters ScannerContext::push_back_scan_task().
Configuration: Not applicable.
Compatibility: Not applicable.
Parallel code paths: Problematic. Generic ThrowingSplitRunner in the new test behaves differently from ScannerSplitRunner, whose close() is empty and whose failure propagation relies on the scheduler path.
Special conditional checks: Catch ordering is correct (doris::Exception -> std::exception -> ...).
Test coverage: Insufficient. The new test covers generic exception-to-status conversion only; it does not cover scanner splits or the promised MEM_ALLOC_FAILED remapping branch.
Observability: Warning log exists, but the real scanner error is no longer surfaced to the consumer on this path.
Transaction/persistence/data writes/FE-BE variable passing: Not applicable.
Performance: No material issue found in this small change.
Other issues: None beyond the blocking lifecycle/error-propagation issue above.

github-actions · 2026-04-02T18:05:12Z

be/src/exec/scan/task_executor/time_sharing/time_sharing_task_executor.cpp

        }

-        Result<SharedListenableFuture<Void>> blocked_future_result = split->process();
+        auto blocked_future_result = [&]() -> Result<SharedListenableFuture<Void>> {


Catching split->process() here avoids the crash, but it also changes the scanner control flow in a way that loses the failure signal. For ScannerSplitRunner, the normal error path is inside ScannerScheduler::submit(): the work_func lambda catches _scanner_scan() failures, calls scanner_ref->set_status(status), and then ctx->push_back_scan_task(scanner_ref). That push-back is what decrements ScannerContext::_in_flight_tasks_num and wakes the consumer.

On this new path we skip that entire layer and go straight to _split_finished(split, status). ScannerSplitRunner::close() is empty, so nothing completes its _completion_future, nothing pushes the ScanTask back into the context, and the consumer never sees the error. In practice this can trade the BE crash for a hung scan, because submit_scan_task() increments _in_flight_tasks_num and push_back_scan_task() is the only place that decrements it.

The new unit test does not catch this because ThrowingSplitRunner::close() marks itself finished, which is not how ScannerSplitRunner behaves.

github-actions · 2026-04-03T01:17:32Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-04-03T01:17:34Z

PR approved by anyone and no changes requested.

kaka11chen

Why wasn't this exception caught here?

auto sumbit_task = [&]() {
        auto work_func = [scanner_ref = scan_task, ctx]() {
            auto status = [&] {
                **RETURN_IF_CATCH_EXCEPTION**(_scanner_scan(ctx, scanner_ref));
                return Status::OK();
            }();

            if (!status.ok()) {
                scanner_ref->set_status(status);
                ctx->push_back_scan_task(scanner_ref);
                return true;
            }
            return scanner_ref->is_eos();
        };
        SimplifiedScanTask simple_scan_task = {work_func, ctx, scan_task};
        return this->submit_scan_task(simple_scan_task);
    };

Copilot AI review requested due to automatic review settings April 2, 2026 07:30

Copilot started reviewing on behalf of eldenmoon April 2, 2026 07:31 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

eldenmoon force-pushed the exp-safe-executor branch from c1429c7 to a67425a Compare April 2, 2026 07:44

eldenmoon added dev/4.1.x dev/4.0.x usercase Important user case type label labels Apr 2, 2026

Update be/src/exec/scan/task_executor/time_sharing/time_sharing_task_…

8f13f3a

…executor.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions bot reviewed Apr 2, 2026

View reviewed changes

yiguolei approved these changes Apr 3, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 3, 2026

github-actions bot added the reviewed label Apr 3, 2026

kaka11chen reviewed Apr 3, 2026

View reviewed changes

Conversation

eldenmoon commented Apr 2, 2026

Uh oh!

hello-stephen commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

eldenmoon commented Apr 2, 2026

Uh oh!

eldenmoon commented Apr 2, 2026

Uh oh!

doris-robot commented Apr 2, 2026

Uh oh!

doris-robot commented Apr 2, 2026

Uh oh!

doris-robot commented Apr 2, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 2, 2026

BE Regression && UT Coverage Report

Uh oh!

eldenmoon commented Apr 2, 2026

Uh oh!

morningman commented Apr 2, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

kaka11chen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants