Skip to content

[fix](group commit) support forward group commit stream load#63594

Open
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:forward-group-commit-stream-load
Open

[fix](group commit) support forward group commit stream load#63594
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:forward-group-commit-stream-load

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented May 25, 2026

No description provided.

Copilot AI review requested due to automatic review settings May 25, 2026 07:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for forwarding group-commit stream-load planning requests from the master FE to follower FEs (round-robin) to better utilize follower capacity, and introduces/adjusts related FE configuration plus a regression test for multi-FE follower mode.

Changes:

  • Add master-side forwarding logic in FrontendServiceImpl.streamLoadPut() for non-off_mode group commit requests (guarded by a new config).
  • Introduce enable_forward_group_commit_stream_load FE config toggle.
  • Enable enable_group_commit_streamload_be_forward by default and add a regression test that exercises group-commit stream load against multiple FEs in follower mode.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
regression-test/suites/load_p0/stream_load/test_group_commit_stream_load_multi_follower.groovy New docker regression test for group-commit stream load with 3 FEs in follower mode.
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Forward group-commit streamLoadPut requests from master to follower FE via Thrift client pool.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Add FE forwarding config and change default for BE-forwarding config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Outdated
Comment thread fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Outdated
Comment thread fe/fe-common/src/main/java/org/apache/doris/common/Config.java
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 25, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking issue beyond the existing inline threads. The master now synchronously forwards group-commit streamLoadPut requests to a follower, but the follower's group-commit planning path calls back to the master to select the BE, so concurrent requests can occupy all master RPC workers waiting for followers while those followers are waiting for master RPC workers. Existing review threads already cover the config default change, table type/catch behavior, and test style issues, so I did not duplicate them.

Critical checkpoints: Goal: forwarding group-commit planning to followers is partially implemented and has a docker regression test, but the nested RPC path can fail under concurrency. Scope: change is small, but it introduces a cross-FE RPC cycle on a hot load path. Concurrency: relevant; master RPC worker threads can block waiting on follower RPCs while followers need master RPC capacity for getGroupCommitLoadBeId. Lifecycle/static init: no additional issue found. Config: new mutable master-only config is present; existing thread covers the separate default flip. Compatibility/persistence/data versioning: no incompatible storage/editlog change found. Parallel paths: follower-direct group-commit path exists and is affected by the new master-forward path. Tests: added test verifies success with a small number of serial loads, but does not cover the concurrent worker-exhaustion case. Observability: logs exist, no additional blocker. User focus: no additional user-provided focus points.

try {
client = ClientPool.frontendPool.borrowObject(address);
TStreamLoadPutResult streamLoadPutResult = client.streamLoadPut(request);
ok = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This synchronous forward creates a nested FE RPC cycle for every forwarded group-commit load: the master RPC worker blocks here waiting for the follower, and the follower's StreamLoadHandler.generatePlan() calls GroupCommitManager.selectBackendForGroupCommit(), whose non-master branch calls back to the master via MasterOpExecutor.getGroupCommitLoadBeId(). With enough concurrent group-commit stream loads, all master RPC workers can be occupied waiting on followers while the followers are waiting for free master RPC workers, causing the loads to stall until RPC timeout. The previous follower-direct path only used one follower-to-master RPC and did not hold a master worker while waiting for that callback. Please avoid the master->follower->master cycle, or make the forwarded request carry the already-selected group-commit backend/table routing decision so the follower does not need to call back to the master.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamLoadHandler.generatePlan() does not call GroupCommitManager.selectBackendForGroupCommit()

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 25, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31418 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e5b7daa5eb1cae8f2f154b483b24b41071d68bd7, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	18004	4120	4009	4009
q2	q3	10794	1356	837	837
q4	4690	471	342	342
q5	7569	2241	2078	2078
q6	370	178	141	141
q7	937	776	621	621
q8	9419	1802	1580	1580
q9	7059	4953	4933	4933
q10	6451	2204	1879	1879
q11	429	271	244	244
q12	688	422	301	301
q13	18200	3394	2789	2789
q14	277	261	234	234
q15	q16	817	774	719	719
q17	1007	948	880	880
q18	6950	5863	5682	5682
q19	1188	1433	1084	1084
q20	519	416	279	279
q21	6100	2835	2472	2472
q22	446	376	314	314
Total cold run time: 101914 ms
Total hot run time: 31418 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4785	4853	5011	4853
q2	q3	4853	5351	4839	4839
q4	2227	2273	1464	1464
q5	5133	4885	4867	4867
q6	237	185	132	132
q7	1894	1885	1616	1616
q8	2439	1983	1977	1977
q9	7517	7498	7499	7498
q10	4754	4714	4233	4233
q11	557	389	370	370
q12	740	780	550	550
q13	3012	3340	2774	2774
q14	283	293	259	259
q15	q16	684	709	634	634
q17	1338	1302	1299	1299
q18	7274	6945	6884	6884
q19	1101	1082	1131	1082
q20	2240	2238	1967	1967
q21	5354	4661	4529	4529
q22	518	487	398	398
Total cold run time: 56940 ms
Total hot run time: 52225 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172430 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e5b7daa5eb1cae8f2f154b483b24b41071d68bd7, data reload: false

query5	4333	681	534	534
query6	346	230	198	198
query7	4234	554	305	305
query8	325	228	214	214
query9	8813	4143	4139	4139
query10	457	349	296	296
query11	5817	2536	2240	2240
query12	182	126	125	125
query13	1261	602	431	431
query14	6188	5568	5214	5214
query14_1	4522	4524	4585	4524
query15	212	202	184	184
query16	996	470	421	421
query17	1131	747	582	582
query18	2442	476	351	351
query19	213	204	164	164
query20	140	127	127	127
query21	210	141	116	116
query22	13557	13584	13385	13385
query23	17329	16460	16339	16339
query23_1	16356	16477	16435	16435
query24	7619	1800	1352	1352
query24_1	1331	1339	1357	1339
query25	607	503	443	443
query26	1310	323	177	177
query27	2685	588	349	349
query28	4513	2047	2053	2047
query29	988	667	514	514
query30	305	240	207	207
query31	1172	1083	951	951
query32	86	82	80	80
query33	543	368	316	316
query34	1196	1167	672	672
query35	782	798	709	709
query36	1450	1403	1287	1287
query37	152	107	92	92
query38	3238	3204	3069	3069
query39	946	902	916	902
query39_1	901	865	895	865
query40	243	150	128	128
query41	71	69	69	69
query42	114	115	112	112
query43	352	344	299	299
query44	
query45	213	210	204	204
query46	1102	1212	737	737
query47	2387	2399	2218	2218
query48	417	420	302	302
query49	656	523	418	418
query50	999	360	255	255
query51	4373	4374	4274	4274
query52	109	110	97	97
query53	274	294	214	214
query54	340	298	284	284
query55	97	96	96	96
query56	334	313	309	309
query57	1432	1441	1313	1313
query58	303	288	287	287
query59	1654	1711	1499	1499
query60	353	339	320	320
query61	180	210	155	155
query62	692	640	589	589
query63	248	204	204	204
query64	2404	829	647	647
query65	
query66	1694	487	356	356
query67	30282	30067	29921	29921
query68	
query69	468	347	306	306
query70	1016	1017	1012	1012
query71	307	274	269	269
query72	2997	2687	2455	2455
query73	874	793	432	432
query74	5135	4957	4828	4828
query75	2694	2599	2289	2289
query76	2298	1160	780	780
query77	407	413	331	331
query78	12457	12308	11900	11900
query79	1464	1012	742	742
query80	645	520	440	440
query81	449	280	240	240
query82	1349	163	124	124
query83	371	290	252	252
query84	266	149	113	113
query85	875	545	442	442
query86	397	325	297	297
query87	3436	3406	3232	3232
query88	3671	2763	2774	2763
query89	481	399	345	345
query90	2003	187	189	187
query91	180	165	140	140
query92	75	82	76	76
query93	1511	1506	804	804
query94	534	348	322	322
query95	686	479	361	361
query96	1039	768	350	350
query97	2713	2763	2604	2604
query98	238	229	229	229
query99	1153	1163	1030	1030
Total cold run time: 255425 ms
Total hot run time: 172430 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.92% (2/51) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants