Skip to content

Conversation

@liaoxin01
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings December 28, 2025 09:55
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01 liaoxin01 marked this pull request as draft December 28, 2025 09:55
@liaoxin01
Copy link
Contributor Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements asynchronous future handling for stream load operations to prevent blocking libevent threads. The implementation uses event_base_once to defer callback execution to the libevent thread, avoiding long-running operations in the fragment manager thread.

Key Changes:

  • Introduced async callback mechanism using libevent's event_base_once to handle stream load completion without blocking
  • Added event_base, http_request, and stream_load_action fields to StreamLoadContext for async processing coordination
  • Refactored completion logic to support both sync (fallback) and async modes based on event_base availability

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

File Description
be/src/runtime/stream_load/stream_load_executor.cpp Implements async callback infrastructure with StreamLoadAsyncCallbackData and event_base_once integration for non-blocking fragment completion handling
be/src/runtime/stream_load/stream_load_context.h Adds event_base, http_request, and stream_load_action pointers to support async callback coordination
be/src/http/action/stream_load.h Adds continue_handle_after_future public method and _finalize_request private method to support async completion flow
be/src/http/action/stream_load.cpp Refactors _handle to support async mode, implements continue_handle_after_future callback, and extracts common finalization logic into _finalize_request

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1045 to +1076
if (ctx->need_rollback) {
_exec_env->stream_load_executor()->rollback_txn(ctx.get());
ctx->need_rollback = false;
}
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated rollback logic: The rollback logic appears in multiple places - in continue_handle_after_future (lines 995, 1010) and in _finalize_request (lines 1046). This duplication could lead to maintenance issues if the rollback behavior needs to change. Additionally, there's a risk of double rollback if both code paths execute, though the need_rollback flag should prevent this in most cases.

Suggested change
if (ctx->need_rollback) {
_exec_env->stream_load_executor()->rollback_txn(ctx.get());
ctx->need_rollback = false;
}
auto rollback_if_needed = [this](const std::shared_ptr<StreamLoadContext>& c) {
if (c->need_rollback) {
_exec_env->stream_load_executor()->rollback_txn(c.get());
c->need_rollback = false;
}
};
rollback_if_needed(ctx);

Copilot uses AI. Check for mistakes.
// Note: In async mode, no code is waiting for promise (_handle() has returned),
// so no need to set promise
LOG(ERROR) << "event_base_once failed, cannot send async callback, ctx="
<< ctx->id.to_string() << ", errno=" << errno;
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak: If event_base_once fails (line 195), the callback_data allocated on line 181-186 is never deleted, causing a memory leak. The callback_data should be deleted in the error path.

Suggested change
<< ctx->id.to_string() << ", errno=" << errno;
<< ctx->id.to_string() << ", errno=" << errno;
// Clean up callback_data since callback will never be invoked on failure
delete callback_data;

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +270
struct event_base* event_base = nullptr; // libevent event loop
HttpRequest* http_request = nullptr; // HTTP request reference
StreamLoadAction* stream_load_action = nullptr; // StreamLoadAction instance pointer
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw pointer members without ownership semantics: The StreamLoadContext class now contains raw pointers (event_base, http_request, stream_load_action) that don't express ownership. These pointers are set to nullptr in free_handler_ctx but there's no clear lifetime management. If the HttpRequest or StreamLoadAction is destroyed while the async callback is pending or executing, accessing these pointers will cause undefined behavior. Consider using weak_ptr or another mechanism to safely check if these objects are still valid.

Copilot uses AI. Check for mistakes.
Comment on lines 1041 to 1056
void StreamLoadAction::_finalize_request(HttpRequest* req, std::shared_ptr<StreamLoadContext> ctx) {
ctx->load_cost_millis = UnixMillis() - ctx->start_millis;

if (!ctx->status.ok() && !ctx->status.is<PUBLISH_TIMEOUT>()) {
if (ctx->need_rollback) {
_exec_env->stream_load_executor()->rollback_txn(ctx.get());
ctx->need_rollback = false;
}
if (ctx->body_sink != nullptr) {
ctx->body_sink->cancel(ctx->status.to_string());
}
}

auto str = ctx->to_json();
str = str + '\n';
HttpChannel::send_reply(req, str);
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential use-after-free: In _finalize_request, ctx->http_request is passed as a parameter and used on line 1056 to send a reply. However, the http_request pointer in ctx may have been set to nullptr by free_handler_ctx (line 441) which could be called if the client disconnects. While the parameter is passed by value, if free_handler_ctx is called concurrently or the underlying HttpRequest object is destroyed, this will cause a crash.

Copilot uses AI. Check for mistakes.
Comment on lines 196 to 202
// event_base_once failed, log error
// Note: In async mode, no code is waiting for promise (_handle() has returned),
// so no need to set promise
LOG(ERROR) << "event_base_once failed, cannot send async callback, ctx="
<< ctx->id.to_string() << ", errno=" << errno;
// Note: Cannot send HTTP response in this case, request may be invalid
// No need to set promise, as no code is waiting in async mode
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent error handling: When event_base_once fails (line 195), the error is logged but no corrective action is taken. The request will be left in an incomplete state - the fragment execution has finished but no response will be sent to the client, and the transaction state is unclear. At minimum, this should attempt to finalize the request or set an error status that will be handled elsewhere.

Suggested change
// event_base_once failed, log error
// Note: In async mode, no code is waiting for promise (_handle() has returned),
// so no need to set promise
LOG(ERROR) << "event_base_once failed, cannot send async callback, ctx="
<< ctx->id.to_string() << ", errno=" << errno;
// Note: Cannot send HTTP response in this case, request may be invalid
// No need to set promise, as no code is waiting in async mode
// event_base_once failed, log error and fall back to executing callback
// synchronously in the current thread to avoid leaving the request incomplete.
LOG(ERROR) << "event_base_once failed, cannot send async callback, ctx="
<< ctx->id.to_string() << ", errno=" << errno;
// Best-effort fallback: directly invoke the callback so that the normal
// completion logic (including HTTP response / txn handling) still runs.
// The callback will delete callback_data.
stream_load_async_callback(-1, EV_TIMEOUT, callback_data);

Copilot uses AI. Check for mistakes.
Status StreamLoadAction::_handle(std::shared_ptr<StreamLoadContext> ctx) {
Status StreamLoadAction::_handle(std::shared_ptr<StreamLoadContext> ctx, HttpRequest* req) {
if (ctx->body_bytes > 0 && ctx->receive_bytes != ctx->body_bytes) {
LOG(WARNING) << "recevie body don't equal with body bytes, body_bytes=" << ctx->body_bytes
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "recevie" should be spelled "receive".

Suggested change
LOG(WARNING) << "recevie body don't equal with body bytes, body_bytes=" << ctx->body_bytes
LOG(WARNING) << "receive body don't equal with body bytes, body_bytes=" << ctx->body_bytes

Copilot uses AI. Check for mistakes.
// Save other state needed in callback
bool need_rollback = ctx->need_rollback;
bool need_commit_self = ctx->need_commit_self;
bool body_sink_cancelled = (ctx->body_sink != nullptr && ctx->body_sink->cancelled());
Copy link

Copilot AI Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double access to body_sink: The code reads ctx->body_sink and checks if it's cancelled on line 178, but this happens in the fragment manager thread. Later, the callback may access body_sink in a different thread (libevent thread). If body_sink is not thread-safe or can be modified/destroyed between these accesses, this could lead to undefined behavior. The body_sink_cancelled flag is captured but the sink itself may still be accessed in continue_handle_after_future.

Copilot uses AI. Check for mistakes.
@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34928 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 73fd4a0ea7795eede985750322ae18773547daae, data reload: false

------ Round 1 ----------------------------------
q1	17621	4257	4047	4047
q2	2020	343	234	234
q3	10180	1321	760	760
q4	10225	907	321	321
q5	7526	2117	1922	1922
q6	183	167	136	136
q7	1002	850	713	713
q8	9349	1456	1078	1078
q9	6967	5326	5270	5270
q10	6840	2412	1985	1985
q11	527	335	308	308
q12	646	763	576	576
q13	17778	3655	3043	3043
q14	291	295	279	279
q15	605	499	509	499
q16	707	670	650	650
q17	692	738	591	591
q18	7624	7065	7115	7065
q19	1117	970	607	607
q20	416	359	252	252
q21	4240	3918	3620	3620
q22	1049	987	972	972
Total cold run time: 107605 ms
Total hot run time: 34928 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4081	4006	4045	4006
q2	322	392	324	324
q3	2170	2729	2272	2272
q4	1344	1756	1316	1316
q5	4216	4476	4648	4476
q6	235	181	139	139
q7	2067	1996	1851	1851
q8	2612	2580	2540	2540
q9	7576	7562	7641	7562
q10	3048	3238	2856	2856
q11	640	545	512	512
q12	695	866	669	669
q13	3501	3989	3349	3349
q14	328	294	288	288
q15	550	501	505	501
q16	634	692	632	632
q17	1192	1443	1399	1399
q18	7993	7694	7521	7521
q19	906	922	930	922
q20	1941	1959	1790	1790
q21	4602	4348	4220	4220
q22	1041	1036	1002	1002
Total cold run time: 51694 ms
Total hot run time: 50147 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 179035 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 73fd4a0ea7795eede985750322ae18773547daae, data reload: false

query5	4510	593	448	448
query6	324	237	233	233
query7	4221	468	275	275
query8	311	258	240	240
query9	8782	2558	2550	2550
query10	485	369	353	353
query11	15337	14898	14758	14758
query12	186	120	119	119
query13	1297	510	391	391
query14	5927	2984	2746	2746
query14_1	2659	2629	2639	2629
query15	216	200	175	175
query16	844	473	477	473
query17	1143	710	616	616
query18	2437	443	350	350
query19	242	241	214	214
query20	118	120	118	118
query21	219	152	119	119
query22	4065	3922	4042	3922
query23	16830	16176	15836	15836
query23_1	16093	16152	16224	16152
query24	7458	1672	1218	1218
query24_1	1249	1223	1263	1223
query25	591	503	454	454
query26	829	269	164	164
query27	2753	467	306	306
query28	4459	2133	2129	2129
query29	817	576	465	465
query30	319	248	213	213
query31	814	741	615	615
query32	85	69	71	69
query33	547	354	304	304
query34	897	922	540	540
query35	761	810	715	715
query36	917	895	802	802
query37	133	98	77	77
query38	3017	3095	3011	3011
query39	757	750	726	726
query39_1	707	699	702	699
query40	219	138	118	118
query41	68	63	64	63
query42	110	107	104	104
query43	435	437	400	400
query44	1313	750	738	738
query45	189	196	183	183
query46	880	969	602	602
query47	1669	1750	1633	1633
query48	327	319	256	256
query49	612	460	351	351
query50	660	293	222	222
query51	3806	3810	3890	3810
query52	106	109	99	99
query53	323	357	291	291
query54	296	263	253	253
query55	78	75	70	70
query56	293	289	302	289
query57	1189	1125	1080	1080
query58	277	255	255	255
query59	2382	2494	2400	2400
query60	311	307	333	307
query61	165	159	164	159
query62	731	714	728	714
query63	332	292	308	292
query64	4443	1321	1010	1010
query65	4007	3939	3967	3939
query66	1401	447	316	316
query67	15163	15036	14770	14770
query68	8474	986	708	708
query69	504	355	312	312
query70	1083	965	963	963
query71	371	309	286	286
query72	6123	4814	4699	4699
query73	664	546	304	304
query74	9015	8952	8639	8639
query75	3206	3172	2824	2824
query76	3808	1137	761	761
query77	568	399	306	306
query78	9470	9602	8844	8844
query79	1622	866	630	630
query80	739	668	551	551
query81	524	269	240	240
query82	203	127	98	98
query83	263	259	239	239
query84	258	120	103	103
query85	905	522	472	472
query86	392	292	275	275
query87	3292	3244	3105	3105
query88	4429	2297	2312	2297
query89	478	425	397	397
query90	2216	169	162	162
query91	171	170	145	145
query92	90	72	62	62
query93	1650	902	559	559
query94	479	307	281	281
query95	572	332	309	309
query96	598	497	209	209
query97	2258	2300	2233	2233
query98	214	197	195	195
query99	1297	1428	1363	1363
Total cold run time: 261235 ms
Total hot run time: 179035 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.27 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 73fd4a0ea7795eede985750322ae18773547daae, data reload: false

query1	0.06	0.05	0.06
query2	0.12	0.06	0.05
query3	0.26	0.09	0.09
query4	1.61	0.12	0.11
query5	0.27	0.27	0.26
query6	1.17	0.64	0.64
query7	0.04	0.04	0.03
query8	0.06	0.04	0.04
query9	0.57	0.52	0.51
query10	0.55	0.56	0.56
query11	0.16	0.11	0.11
query12	0.14	0.11	0.12
query13	0.62	0.61	0.61
query14	0.97	1.00	0.98
query15	0.81	0.80	0.80
query16	0.39	0.40	0.42
query17	1.03	0.98	1.06
query18	0.23	0.22	0.21
query19	1.88	1.76	1.80
query20	0.02	0.01	0.02
query21	15.47	0.30	0.13
query22	4.95	0.05	0.04
query23	15.80	0.29	0.11
query24	2.00	0.49	0.30
query25	0.09	0.08	0.06
query26	0.15	0.13	0.13
query27	0.06	0.05	0.05
query28	3.99	1.21	1.03
query29	12.61	3.94	3.18
query30	0.32	0.17	0.13
query31	2.82	0.61	0.39
query32	3.24	0.54	0.45
query33	3.06	3.04	3.03
query34	16.96	5.18	4.55
query35	4.55	4.54	4.52
query36	0.66	0.51	0.50
query37	0.11	0.08	0.07
query38	0.08	0.04	0.04
query39	0.05	0.04	0.03
query40	0.18	0.15	0.13
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.03
Total cold run time: 98.29 s
Total hot run time: 27.27 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/148) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.39% (18952/35494)
Line Coverage 39.25% (175777/447817)
Region Coverage 33.81% (135982/402208)
Branch Coverage 34.75% (58723/169006)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.60% (162/201) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.87% (25632/34700)
Line Coverage 61.27% (273709/446700)
Region Coverage 56.27% (228684/406436)
Branch Coverage 58.09% (98520/169598)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.60% (162/201) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (25631/34700)
Line Coverage 61.27% (273690/446700)
Region Coverage 56.25% (228623/406436)
Branch Coverage 58.08% (98503/169598)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants