Skip to content

[fix](group commit) fix can not get a block queue#63722

Open
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:fix-group-commit-get-queue
Open

[fix](group commit) fix can not get a block queue#63722
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:fix-group-commit-get-queue

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented May 27, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings May 27, 2026 06:42
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses failures in group-commit stream load where the BE could not obtain a usable LoadBlockQueue under concurrency, by improving error handling around FE RPCs and adding resubmission/timeout logic for creating group-commit plan fragments. It also adds a new high-concurrency regression test intended to reproduce/guard the issue.

Changes:

  • Improve RPC-status handling when aborting/committing group-commit transactions (avoid masking transport errors as FE result statuses).
  • Add a background “create plan” worker + resubmission/timeout mechanism to reduce “can not get a block queue” occurrences during plan creation.
  • Add a high-concurrency async group-commit stream load regression test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
regression-test/suites/load_p0/stream_load/test_group_commit_stream_load_high_concurrency_async.groovy New regression test stressing async group-commit under high concurrency.
be/src/load/group_commit/wal/wal_table.cpp Return/LOG transport (RPC) errors distinctly from FE rollback result status.
be/src/load/group_commit/group_commit_mgr.h Add new create-plan worker plumbing and make _need_commit atomic.
be/src/load/group_commit/group_commit_mgr.cpp Implement create-plan resubmission worker, timeout handling, and improved commit/rollback RPC error handling.
be/src/common/config.h Declare new config group_commit_create_plan_timeout_ms.
be/src/common/config.cpp Define default for group_commit_create_plan_timeout_ms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/load/group_commit/group_commit_mgr.cpp Outdated
Comment thread be/src/common/config.cpp Outdated
Comment thread be/src/common/config.h Outdated
@mymeiyi mymeiyi changed the title [fix](group commit) fix error: can not get a block queue [fix](group commit) fix can not get a block queue May 27, 2026
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the new group-commit plan creation worker. The new background thread does not initialize Doris thread context before executing BE code, which violates the BE runtime memory-tracking convention. Existing Copilot comments already covered the timeout comment mismatch, the add-load-id error message, and the p0 test size concern, so I did not duplicate those.

Checklist conclusions:

  • Goal/test: The PR targets high-concurrency async group commit plan creation and adds a regression case, but the new worker thread setup is incomplete.
  • Scope: The implementation is focused on group commit queue creation/retry, though it introduces a new manager thread.
  • Concurrency: The new worker/condition-variable path is relevant; lock ordering looks consistent, but the worker thread lacks required thread-context initialization.
  • Lifecycle: GroupCommitMgr::stop() joins the new worker before shutting down the pool; no additional lifecycle blocker found.
  • Config: Adds a mutable timeout config with default/comment now aligned.
  • Compatibility: No protocol/storage-format incompatibility found.
  • Parallel paths: WAL rollback RPC handling was updated consistently in the touched paths.
  • Tests: A stress regression test was added; existing review already noted the p0 size/flakiness risk.
  • Observability: Existing logs are mostly sufficient for retries/failures.
  • Transaction/persistence/data correctness: No additional transaction visibility or WAL recovery issue found in the changed logic.
  • Performance: No additional performance blocker beyond the existing high-concurrency p0 test concern.
  • User focus: No additional user-provided review focus was specified.

Comment thread be/src/load/group_commit/group_commit_mgr.cpp
@mymeiyi mymeiyi force-pushed the fix-group-commit-get-queue branch 2 times, most recently from 9bbc402 to 47e1354 Compare May 27, 2026 08:32
@mymeiyi mymeiyi force-pushed the fix-group-commit-get-queue branch from 47e1354 to 250bbcb Compare May 27, 2026 08:42
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31323 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 250bbcbb34580f8acafaa87e43bc633533c2c37c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17833	4105	4072	4072
q2	q3	10769	1426	824	824
q4	4723	476	351	351
q5	8563	2357	2098	2098
q6	372	180	136	136
q7	962	774	668	668
q8	9593	1628	1494	1494
q9	7018	4959	4928	4928
q10	6484	2221	1904	1904
q11	442	291	243	243
q12	690	422	294	294
q13	18225	3444	2805	2805
q14	266	262	239	239
q15	q16	812	783	712	712
q17	1007	966	1028	966
q18	6843	5853	5521	5521
q19	1180	1183	1103	1103
q20	515	405	257	257
q21	5692	2668	2402	2402
q22	447	352	306	306
Total cold run time: 102436 ms
Total hot run time: 31323 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4356	4258	4251	4251
q2	q3	4552	4964	4324	4324
q4	2106	2231	1408	1408
q5	4441	4343	5024	4343
q6	267	191	144	144
q7	1926	1889	1589	1589
q8	2528	2147	2116	2116
q9	8138	7954	8069	7954
q10	4893	4761	4311	4311
q11	559	423	381	381
q12	761	762	551	551
q13	3256	3655	3029	3029
q14	314	319	265	265
q15	q16	719	755	703	703
q17	1390	1333	1326	1326
q18	7900	7369	6742	6742
q19	1125	1107	1100	1100
q20	2238	2250	1951	1951
q21	5289	4597	4417	4417
q22	536	481	407	407
Total cold run time: 57294 ms
Total hot run time: 51312 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172487 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 250bbcbb34580f8acafaa87e43bc633533c2c37c, data reload: false

query5	4337	671	525	525
query6	334	219	199	199
query7	4239	544	308	308
query8	324	229	220	220
query9	8793	4082	4091	4082
query10	467	334	302	302
query11	5781	2828	2256	2256
query12	187	137	149	137
query13	1290	593	439	439
query14	6091	5452	5102	5102
query14_1	4456	4821	4457	4457
query15	218	206	185	185
query16	1022	460	434	434
query17	1159	717	590	590
query18	2513	481	357	357
query19	217	204	166	166
query20	140	140	137	137
query21	217	136	118	118
query22	13744	13526	13493	13493
query23	17465	16552	16349	16349
query23_1	16301	16371	16362	16362
query24	7438	1816	1346	1346
query24_1	1292	1356	1308	1308
query25	585	507	455	455
query26	1303	338	188	188
query27	2643	588	347	347
query28	4461	2019	2020	2019
query29	1010	661	523	523
query30	309	245	199	199
query31	1160	1090	958	958
query32	90	79	80	79
query33	568	396	310	310
query34	1196	1170	652	652
query35	806	807	703	703
query36	1441	1398	1213	1213
query37	159	112	93	93
query38	3223	3154	3100	3100
query39	945	922	892	892
query39_1	878	887	907	887
query40	233	156	129	129
query41	71	69	68	68
query42	113	113	113	113
query43	332	335	300	300
query44	
query45	236	208	200	200
query46	1088	1249	763	763
query47	2380	2369	2261	2261
query48	410	431	294	294
query49	641	510	406	406
query50	980	363	270	270
query51	4346	4283	4258	4258
query52	114	114	97	97
query53	261	292	211	211
query54	333	289	269	269
query55	95	99	89	89
query56	326	313	319	313
query57	1432	1434	1353	1353
query58	318	286	288	286
query59	1625	1674	1486	1486
query60	372	338	323	323
query61	161	163	156	156
query62	697	649	597	597
query63	246	203	217	203
query64	2414	803	641	641
query65	
query66	1709	484	351	351
query67	29772	29088	29612	29088
query68	
query69	459	350	306	306
query70	1081	1022	970	970
query71	300	274	275	274
query72	3068	2767	2433	2433
query73	891	741	444	444
query74	5139	4976	4791	4791
query75	2697	2585	2278	2278
query76	2289	1175	794	794
query77	413	422	335	335
query78	12556	12478	11942	11942
query79	1501	1074	788	788
query80	1221	545	500	500
query81	506	281	240	240
query82	1401	160	124	124
query83	363	299	258	258
query84	270	142	113	113
query85	946	530	471	471
query86	449	340	338	338
query87	3415	3395	3228	3228
query88	3660	2736	2745	2736
query89	458	391	348	348
query90	1827	190	194	190
query91	184	172	145	145
query92	85	79	76	76
query93	1534	1415	849	849
query94	620	367	292	292
query95	693	381	363	363
query96	1075	798	355	355
query97	2783	2760	2606	2606
query98	251	228	226	226
query99	1179	1134	1027	1027
Total cold run time: 255883 ms
Total hot run time: 172487 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/251) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.90% (20945/38861)
Line Coverage 37.45% (198483/530005)
Region Coverage 33.73% (155482/460967)
Branch Coverage 34.74% (67741/194967)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 58.17% (146/251) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.92% (28117/38035)
Line Coverage 57.84% (305456/528150)
Region Coverage 54.98% (255639/464993)
Branch Coverage 56.49% (110406/195446)

@mymeiyi mymeiyi force-pushed the fix-group-commit-get-queue branch from 5c6c86a to 3e7b817 Compare May 28, 2026 10:32
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 28, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31954 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3e7b8175a0b8ad3b8cbb2e7e5e4d0c63be602c45, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17893	4091	4057	4057
q2	q3	10812	1400	818	818
q4	4699	488	345	345
q5	7805	2258	2185	2185
q6	381	177	137	137
q7	951	809	647	647
q8	9608	1878	1672	1672
q9	7079	4944	4944	4944
q10	6448	2258	1960	1960
q11	436	282	251	251
q12	689	439	307	307
q13	18230	3348	2827	2827
q14	271	257	238	238
q15	q16	829	778	717	717
q17	997	941	966	941
q18	7125	5790	5621	5621
q19	1292	1413	1098	1098
q20	530	433	263	263
q21	5949	2707	2603	2603
q22	456	369	323	323
Total cold run time: 102480 ms
Total hot run time: 31954 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4969	4975	5003	4975
q2	q3	4991	5315	4617	4617
q4	2151	2267	1451	1451
q5	4971	4686	4705	4686
q6	239	180	127	127
q7	1871	1800	1448	1448
q8	2275	1985	1950	1950
q9	7412	7354	7439	7354
q10	4792	4702	4218	4218
q11	530	390	359	359
q12	733	747	534	534
q13	3057	3429	2747	2747
q14	288	278	251	251
q15	q16	681	711	610	610
q17	1296	1284	1279	1279
q18	7358	6869	6885	6869
q19	1084	1159	1101	1101
q20	2236	2243	1953	1953
q21	5329	4632	4512	4512
q22	527	449	410	410
Total cold run time: 56790 ms
Total hot run time: 51451 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172648 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3e7b8175a0b8ad3b8cbb2e7e5e4d0c63be602c45, data reload: false

query5	4351	662	541	541
query6	348	210	203	203
query7	4221	606	301	301
query8	326	234	224	224
query9	8817	4169	4109	4109
query10	454	349	310	310
query11	5815	2935	2272	2272
query12	187	132	125	125
query13	1276	637	437	437
query14	6132	5497	5181	5181
query14_1	4495	4539	4494	4494
query15	210	201	181	181
query16	1036	475	446	446
query17	1125	712	597	597
query18	2717	491	355	355
query19	227	199	155	155
query20	139	132	135	132
query21	214	138	119	119
query22	13789	13702	13529	13529
query23	17535	16694	16196	16196
query23_1	16374	16275	16342	16275
query24	7567	1783	1332	1332
query24_1	1348	1333	1339	1333
query25	561	466	414	414
query26	1305	317	167	167
query27	2752	557	342	342
query28	4434	2044	1998	1998
query29	996	628	494	494
query30	311	247	197	197
query31	1161	1112	959	959
query32	89	73	73	73
query33	531	346	301	301
query34	1205	1165	628	628
query35	776	815	688	688
query36	1442	1412	1273	1273
query37	166	107	90	90
query38	3237	3165	3097	3097
query39	943	912	903	903
query39_1	875	871	871	871
query40	233	151	129	129
query41	67	67	63	63
query42	113	106	107	106
query43	336	335	299	299
query44	
query45	215	211	198	198
query46	1128	1225	745	745
query47	2391	2345	2213	2213
query48	413	404	306	306
query49	622	494	388	388
query50	1003	354	252	252
query51	4414	4313	4253	4253
query52	105	116	104	104
query53	266	288	209	209
query54	308	264	254	254
query55	94	91	88	88
query56	295	301	322	301
query57	1439	1397	1328	1328
query58	305	275	278	275
query59	1602	1695	1478	1478
query60	318	324	313	313
query61	155	146	152	146
query62	689	667	565	565
query63	253	201	211	201
query64	2426	853	700	700
query65	
query66	1686	503	386	386
query67	29887	29852	29057	29057
query68	
query69	480	366	321	321
query70	1064	1000	1029	1000
query71	320	282	269	269
query72	3199	2988	2586	2586
query73	853	753	468	468
query74	5151	4992	4822	4822
query75	2696	2617	2284	2284
query76	2274	1166	805	805
query77	406	431	346	346
query78	12603	12560	12056	12056
query79	1520	1023	759	759
query80	1320	544	465	465
query81	530	286	241	241
query82	1056	165	123	123
query83	355	276	249	249
query84	261	143	113	113
query85	957	541	448	448
query86	452	342	334	334
query87	3486	3402	3267	3267
query88	3695	2781	2760	2760
query89	466	389	344	344
query90	1941	186	196	186
query91	180	213	136	136
query92	82	80	74	74
query93	1643	1443	835	835
query94	754	365	327	327
query95	679	395	461	395
query96	1135	816	327	327
query97	2764	2745	2592	2592
query98	240	230	232	230
query99	1165	1148	1039	1039
Total cold run time: 256996 ms
Total hot run time: 172648 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/256) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.94% (20963/38863)
Line Coverage 37.50% (198761/530072)
Region Coverage 33.77% (155718/461057)
Branch Coverage 34.78% (67823/194997)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 57.03% (146/256) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.86% (28112/38062)
Line Coverage 57.76% (305411/528716)
Region Coverage 55.04% (256218/465472)
Branch Coverage 56.46% (110514/195723)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants