Skip to content

[Enhancement](compaction) Try get global lock when execute compaction #49882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 23, 2025

Conversation

Yukang-Lian
Copy link
Collaborator

@Yukang-Lian Yukang-Lian commented Apr 8, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Background:
In cloud mode, compaction tasks for the same tablet may be scheduled across multiple BEs. To ensure that only one BE can execute a compaction task for a given tablet at a time, a global locking mechanism is used.

During compaction preparation, tablet and compaction information is written as key-value pairs to the metadata service. A background thread periodically renews the lease. Other BEs can only perform compaction on a tablet when the KV entry has expired or doesn't exist, ensuring that a tablet's compaction occurs on only one BE at a time.

Problem:
Compaction tasks are processed through a thread pool. Currently, we first prepare compaction and acquire the global lock before queueing the task. If a BE is under heavy compaction pressure with all threads occupied, tablets may wait in the queue for extended periods. Meanwhile, other idle BEs cannot perform compaction on these tablets because they cannot acquire the global lock, leading to resource imbalance with some BEs starved and others overloaded.

Solution:
To address this issue, we'll modify the workflow to queue tasks first, then attempt to acquire the lock only when the task is about to be executed. This ensures that even if a tablet's compaction task is queued on one BE, another idle BE can still perform compaction on that tablet, resulting in better resource utilization across the cluster.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Yukang-Lian Yukang-Lian marked this pull request as draft April 8, 2025 12:05
@Yukang-Lian Yukang-Lian changed the title 1 just test Apr 8, 2025
@Yukang-Lian Yukang-Lian marked this pull request as ready for review April 8, 2025 12:54
@Yukang-Lian
Copy link
Collaborator Author

run buildall

gavinchou
gavinchou previously approved these changes Apr 8, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 8, 2025
Copy link
Contributor

github-actions bot commented Apr 8, 2025

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Apr 8, 2025

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Apr 9, 2025
@Yukang-Lian
Copy link
Collaborator Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/69) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.33% (14069/26883)
Line Coverage 41.12% (121547/295561)
Region Coverage 39.85% (61835/155174)
Branch Coverage 34.51% (30956/89706)

@Yukang-Lian Yukang-Lian requested a review from w41ter as a code owner April 9, 2025 18:17
@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34203 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6541104964d3abdc5a19992381c9ef4623820aed, data reload: false

------ Round 1 ----------------------------------
q1	26336	5044	4969	4969
q2	2082	276	184	184
q3	10399	1233	670	670
q4	10238	1018	590	590
q5	7693	2416	2388	2388
q6	191	162	134	134
q7	915	765	610	610
q8	9322	1301	1110	1110
q9	6998	5177	5086	5086
q10	6810	2305	1876	1876
q11	494	298	277	277
q12	357	347	217	217
q13	17792	3618	3069	3069
q14	233	238	215	215
q15	532	478	485	478
q16	622	613	596	596
q17	626	797	422	422
q18	7527	7324	7121	7121
q19	1229	945	541	541
q20	340	328	229	229
q21	4041	3456	2466	2466
q22	1060	995	955	955
Total cold run time: 115837 ms
Total hot run time: 34203 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5091	5069	5043	5043
q2	234	323	232	232
q3	2143	2622	2230	2230
q4	1446	1784	1465	1465
q5	4465	4420	4376	4376
q6	207	172	125	125
q7	1993	1913	1767	1767
q8	2587	2577	2552	2552
q9	7255	7134	7082	7082
q10	2987	3136	2729	2729
q11	576	510	502	502
q12	714	756	634	634
q13	3522	3853	3376	3376
q14	303	303	270	270
q15	519	482	468	468
q16	663	692	651	651
q17	1215	1508	1406	1406
q18	7762	7451	7346	7346
q19	838	808	915	808
q20	1935	1941	1841	1841
q21	5340	4974	4801	4801
q22	1100	1034	1010	1010
Total cold run time: 52895 ms
Total hot run time: 50714 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193350 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6541104964d3abdc5a19992381c9ef4623820aed, data reload: false

query1	1393	1057	1056	1056
query2	6265	1897	1919	1897
query3	11030	4496	4467	4467
query4	52983	24934	23452	23452
query5	5026	529	463	463
query6	332	203	192	192
query7	4905	508	277	277
query8	318	273	240	240
query9	5420	2601	2598	2598
query10	422	323	270	270
query11	15355	14988	14728	14728
query12	161	109	104	104
query13	1033	509	404	404
query14	10089	6261	6397	6261
query15	201	205	179	179
query16	7071	654	508	508
query17	1102	762	603	603
query18	1549	471	326	326
query19	201	200	179	179
query20	131	130	125	125
query21	214	128	107	107
query22	4343	4710	4388	4388
query23	33993	33315	33418	33315
query24	6615	2429	2417	2417
query25	457	451	431	431
query26	836	268	159	159
query27	2780	508	341	341
query28	3041	2487	2439	2439
query29	590	562	485	485
query30	282	221	202	202
query31	896	880	822	822
query32	73	69	59	59
query33	451	382	317	317
query34	795	853	513	513
query35	792	846	742	742
query36	936	1047	907	907
query37	120	97	81	81
query38	4234	4125	4191	4125
query39	1512	1444	1458	1444
query40	211	125	108	108
query41	52	52	51	51
query42	135	109	107	107
query43	498	522	460	460
query44	1342	836	831	831
query45	188	175	169	169
query46	868	1028	646	646
query47	1854	1919	1844	1844
query48	391	422	307	307
query49	732	546	431	431
query50	653	729	401	401
query51	4233	4335	4224	4224
query52	112	111	104	104
query53	225	259	184	184
query54	609	613	511	511
query55	94	85	83	83
query56	320	293	310	293
query57	1208	1203	1143	1143
query58	272	267	257	257
query59	2713	2738	2801	2738
query60	327	315	307	307
query61	131	130	138	130
query62	731	753	693	693
query63	233	191	188	188
query64	2116	1058	732	732
query65	4470	4376	4378	4376
query66	749	407	309	309
query67	16153	15656	15284	15284
query68	8132	888	522	522
query69	569	304	274	274
query70	1203	1078	1101	1078
query71	525	320	285	285
query72	6043	4918	5011	4918
query73	1155	745	346	346
query74	9027	8873	8884	8873
query75	4077	3212	2669	2669
query76	4317	1174	741	741
query77	679	366	277	277
query78	10061	10216	9323	9323
query79	2103	808	567	567
query80	590	511	435	435
query81	493	261	220	220
query82	445	125	97	97
query83	324	253	232	232
query84	289	109	82	82
query85	798	352	336	336
query86	370	304	292	292
query87	4304	4386	4390	4386
query88	3479	2204	2196	2196
query89	412	318	281	281
query90	1892	208	209	208
query91	149	221	118	118
query92	76	60	52	52
query93	1738	971	591	591
query94	688	446	307	307
query95	373	290	276	276
query96	480	571	274	274
query97	3186	3272	3094	3094
query98	237	213	202	202
query99	1429	1403	1264	1264
Total cold run time: 299210 ms
Total hot run time: 193350 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.3 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 6541104964d3abdc5a19992381c9ef4623820aed, data reload: false

query1	0.04	0.03	0.04
query2	0.13	0.10	0.11
query3	0.26	0.19	0.19
query4	1.60	0.19	0.19
query5	0.62	0.59	0.59
query6	1.20	0.72	0.71
query7	0.02	0.02	0.01
query8	0.04	0.03	0.03
query9	0.58	0.52	0.53
query10	0.58	0.59	0.56
query11	0.16	0.10	0.11
query12	0.14	0.11	0.11
query13	0.61	0.59	0.60
query14	2.78	2.72	2.78
query15	0.94	0.85	0.85
query16	0.39	0.37	0.37
query17	1.06	1.06	1.04
query18	0.22	0.19	0.19
query19	1.90	1.94	1.80
query20	0.02	0.01	0.01
query21	15.36	0.90	0.54
query22	0.75	1.30	0.77
query23	14.70	1.40	0.63
query24	7.29	1.12	0.96
query25	0.47	0.20	0.06
query26	0.58	0.16	0.15
query27	0.06	0.05	0.05
query28	9.40	0.86	0.43
query29	12.53	3.88	3.27
query30	0.24	0.09	0.06
query31	2.83	0.58	0.38
query32	3.22	0.54	0.46
query33	3.04	3.08	3.09
query34	15.80	5.16	4.49
query35	4.50	4.47	4.51
query36	0.66	0.51	0.48
query37	0.08	0.06	0.06
query38	0.05	0.04	0.03
query39	0.02	0.02	0.03
query40	0.17	0.13	0.12
query41	0.08	0.03	0.03
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 105.19 s
Total hot run time: 31.3 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/71) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.33% (14068/26882)
Line Coverage 41.10% (121495/295573)
Region Coverage 39.84% (61824/155200)
Branch Coverage 34.49% (30949/89724)

@Yukang-Lian Yukang-Lian changed the title just test [Enhancement](compaction) Try get galobal lock when execute compaction Apr 10, 2025
@Yukang-Lian
Copy link
Collaborator Author

run cloud_p0

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34845 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit db4adce4281035815cd5ca27b67293ab51163add, data reload: false

------ Round 1 ----------------------------------
q1	26663	5242	5086	5086
q2	2096	283	197	197
q3	11593	1256	698	698
q4	10221	1031	541	541
q5	7665	2405	2390	2390
q6	186	168	135	135
q7	939	793	645	645
q8	9751	1389	1170	1170
q9	8978	5441	5302	5302
q10	6887	2335	1895	1895
q11	485	290	274	274
q12	379	372	229	229
q13	20405	3729	3157	3157
q14	240	231	218	218
q15	535	483	493	483
q16	628	633	585	585
q17	591	854	381	381
q18	7400	7149	7186	7149
q19	1652	962	576	576
q20	349	333	225	225
q21	4569	3404	2544	2544
q22	1071	1013	965	965
Total cold run time: 123283 ms
Total hot run time: 34845 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5168	5064	5111	5064
q2	244	325	226	226
q3	2156	2888	2418	2418
q4	1575	1959	1516	1516
q5	4471	4348	4358	4348
q6	223	169	124	124
q7	2005	1882	1797	1797
q8	2588	2562	2576	2562
q9	7053	7135	7103	7103
q10	2992	3192	2757	2757
q11	561	524	490	490
q12	674	772	647	647
q13	3662	3900	3259	3259
q14	300	309	285	285
q15	520	489	479	479
q16	646	693	649	649
q17	1140	1591	1384	1384
q18	7649	7475	7114	7114
q19	824	803	947	803
q20	1945	2036	1875	1875
q21	5316	4958	4837	4837
q22	1112	1066	1025	1025
Total cold run time: 52824 ms
Total hot run time: 50762 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193259 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit db4adce4281035815cd5ca27b67293ab51163add, data reload: false

query1	1397	1076	1086	1076
query2	6098	1959	1941	1941
query3	11160	4737	4550	4550
query4	25506	23721	22907	22907
query5	4870	596	454	454
query6	291	206	199	199
query7	3995	491	292	292
query8	336	257	240	240
query9	8533	2622	2595	2595
query10	489	312	263	263
query11	15707	15095	14861	14861
query12	162	110	120	110
query13	1561	544	398	398
query14	9472	6151	6401	6151
query15	203	185	174	174
query16	7628	670	466	466
query17	1172	806	608	608
query18	2024	427	342	342
query19	218	195	176	176
query20	131	131	125	125
query21	206	125	112	112
query22	4378	4416	4363	4363
query23	34578	33336	33548	33336
query24	8421	2497	2463	2463
query25	508	446	418	418
query26	1180	276	151	151
query27	2666	508	336	336
query28	4401	2477	2425	2425
query29	690	586	441	441
query30	275	224	219	219
query31	929	846	810	810
query32	73	62	66	62
query33	564	369	324	324
query34	803	872	569	569
query35	798	878	771	771
query36	939	989	909	909
query37	125	99	83	83
query38	4298	4221	4235	4221
query39	1496	1459	1503	1459
query40	211	119	105	105
query41	53	55	52	52
query42	119	104	107	104
query43	511	519	491	491
query44	1355	821	840	821
query45	190	176	166	166
query46	845	1016	679	679
query47	1852	1880	1780	1780
query48	371	411	328	328
query49	804	508	440	440
query50	664	704	422	422
query51	4315	4257	4192	4192
query52	109	104	97	97
query53	242	262	185	185
query54	617	574	512	512
query55	89	83	86	83
query56	313	316	316	316
query57	1203	1183	1107	1107
query58	268	268	260	260
query59	2847	2965	2809	2809
query60	346	317	303	303
query61	133	138	131	131
query62	782	758	666	666
query63	237	196	209	196
query64	4013	1018	695	695
query65	4329	4215	4242	4215
query66	1000	404	317	317
query67	16044	15463	15305	15305
query68	8884	899	517	517
query69	485	312	268	268
query70	1238	1094	1103	1094
query71	459	323	297	297
query72	5893	4743	4771	4743
query73	717	596	353	353
query74	8937	9277	9034	9034
query75	4181	3255	2719	2719
query76	3855	1199	766	766
query77	805	384	296	296
query78	10035	10144	9300	9300
query79	2509	816	587	587
query80	754	507	486	486
query81	478	264	225	225
query82	453	127	94	94
query83	282	258	240	240
query84	290	98	90	90
query85	793	351	317	317
query86	337	329	262	262
query87	4540	4619	4420	4420
query88	2888	2283	2256	2256
query89	427	308	289	289
query90	1983	219	221	219
query91	149	143	114	114
query92	142	65	57	57
query93	1651	958	574	574
query94	686	425	308	308
query95	371	292	290	290
query96	490	568	277	277
query97	3160	3210	3142	3142
query98	222	199	195	195
query99	1417	1413	1253	1253
Total cold run time: 281654 ms
Total hot run time: 193259 ms

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33970 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ab4d208fc2033ca172060bf854dba4288e6a7ee4, data reload: false

------ Round 1 ----------------------------------
q1	25741	5062	4981	4981
q2	2057	285	178	178
q3	10399	1246	719	719
q4	10220	1008	532	532
q5	7455	2452	2355	2355
q6	181	164	132	132
q7	914	758	598	598
q8	9310	1278	1092	1092
q9	7356	5207	5209	5207
q10	6884	2282	1886	1886
q11	471	282	269	269
q12	341	359	224	224
q13	18093	3692	3074	3074
q14	237	232	213	213
q15	546	492	490	490
q16	449	453	407	407
q17	584	851	351	351
q18	7587	7105	7048	7048
q19	1505	933	566	566
q20	340	335	222	222
q21	4003	2728	2417	2417
q22	1104	1028	1009	1009
Total cold run time: 115777 ms
Total hot run time: 33970 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5113	5041	5067	5041
q2	237	324	230	230
q3	2147	2638	2249	2249
q4	1461	1845	1413	1413
q5	4444	4378	4414	4378
q6	217	175	124	124
q7	2001	1952	1781	1781
q8	2554	2570	2572	2570
q9	7293	7219	7003	7003
q10	3038	3183	2741	2741
q11	581	513	482	482
q12	668	772	603	603
q13	3480	3936	3244	3244
q14	294	300	261	261
q15	530	486	477	477
q16	480	520	483	483
q17	1137	1526	1387	1387
q18	7699	7576	7481	7481
q19	793	821	989	821
q20	1989	2033	1888	1888
q21	5204	4887	4812	4812
q22	1174	1071	1050	1050
Total cold run time: 52534 ms
Total hot run time: 50519 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193289 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ab4d208fc2033ca172060bf854dba4288e6a7ee4, data reload: false

query1	1440	1070	1066	1066
query2	6337	1840	1868	1840
query3	11132	4656	4615	4615
query4	25421	24140	23538	23538
query5	5512	662	457	457
query6	316	210	199	199
query7	3985	484	274	274
query8	292	239	235	235
query9	8561	2548	2555	2548
query10	516	310	273	273
query11	15198	15011	14761	14761
query12	166	109	108	108
query13	1550	509	380	380
query14	8799	6010	6102	6010
query15	198	186	165	165
query16	7208	643	476	476
query17	1169	772	648	648
query18	2003	424	314	314
query19	194	196	164	164
query20	123	125	124	124
query21	202	135	119	119
query22	4717	4610	4437	4437
query23	34524	33804	33639	33639
query24	8788	2469	2432	2432
query25	525	548	395	395
query26	1187	269	149	149
query27	3166	509	331	331
query28	4714	2159	2148	2148
query29	735	568	454	454
query30	281	226	195	195
query31	874	873	801	801
query32	76	67	64	64
query33	530	398	315	315
query34	783	863	519	519
query35	793	844	779	779
query36	981	1010	920	920
query37	109	97	76	76
query38	4188	4221	4179	4179
query39	1482	1418	1450	1418
query40	217	128	109	109
query41	55	51	53	51
query42	127	105	112	105
query43	508	492	480	480
query44	1310	818	807	807
query45	188	173	164	164
query46	831	1047	655	655
query47	1878	1921	1798	1798
query48	385	419	319	319
query49	755	520	405	405
query50	674	710	424	424
query51	4168	4233	4348	4233
query52	108	104	101	101
query53	227	253	181	181
query54	581	600	526	526
query55	86	81	80	80
query56	319	317	297	297
query57	1214	1198	1173	1173
query58	289	289	255	255
query59	2737	2850	2785	2785
query60	330	329	300	300
query61	130	135	122	122
query62	777	726	666	666
query63	225	192	188	188
query64	4135	1128	711	711
query65	4415	4370	4390	4370
query66	1097	401	300	300
query67	15944	15593	15418	15418
query68	9570	896	504	504
query69	516	297	259	259
query70	1229	1147	1104	1104
query71	481	308	279	279
query72	5587	4732	4675	4675
query73	718	591	345	345
query74	9258	9057	9027	9027
query75	4452	3191	2688	2688
query76	3809	1181	751	751
query77	1003	370	285	285
query78	9974	9996	9314	9314
query79	6786	781	556	556
query80	635	505	434	434
query81	487	255	212	212
query82	722	126	104	104
query83	280	253	235	235
query84	368	111	90	90
query85	750	344	307	307
query86	341	320	306	306
query87	4379	4465	4337	4337
query88	3697	2189	2188	2188
query89	464	323	271	271
query90	2024	209	211	209
query91	139	140	112	112
query92	75	61	58	58
query93	3969	961	573	573
query94	671	410	302	302
query95	368	294	285	285
query96	486	579	270	270
query97	3181	3246	3116	3116
query98	269	209	206	206
query99	1460	1400	1299	1299
Total cold run time: 290633 ms
Total hot run time: 193289 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.48 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ab4d208fc2033ca172060bf854dba4288e6a7ee4, data reload: false

query1	0.04	0.04	0.03
query2	0.14	0.10	0.12
query3	0.25	0.19	0.19
query4	1.59	0.20	0.19
query5	0.60	0.59	0.60
query6	1.18	0.72	0.72
query7	0.03	0.01	0.01
query8	0.04	0.04	0.03
query9	0.58	0.54	0.51
query10	0.57	0.56	0.57
query11	0.15	0.11	0.11
query12	0.14	0.12	0.11
query13	0.61	0.60	0.60
query14	1.21	1.16	1.16
query15	0.87	0.83	0.86
query16	0.38	0.37	0.38
query17	1.03	1.04	1.01
query18	0.21	0.20	0.19
query19	1.85	1.79	1.77
query20	0.01	0.01	0.01
query21	15.40	0.90	0.55
query22	0.77	1.27	0.74
query23	14.76	1.39	0.66
query24	7.03	2.19	0.64
query25	0.50	0.19	0.08
query26	0.58	0.15	0.13
query27	0.06	0.04	0.05
query28	9.95	0.89	0.44
query29	12.52	4.02	3.30
query30	0.26	0.10	0.07
query31	2.82	0.60	0.38
query32	3.22	0.54	0.47
query33	3.03	3.09	3.05
query34	15.79	5.17	4.49
query35	4.54	4.59	4.53
query36	0.66	0.49	0.49
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.03
query40	0.16	0.14	0.13
query41	0.09	0.03	0.03
query42	0.03	0.02	0.02
query43	0.04	0.04	0.03
Total cold run time: 103.86 s
Total hot run time: 29.48 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/178) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.53% (14535/27155)
Line Coverage 42.35% (125979/297485)
Region Coverage 41.18% (64397/156390)
Branch Coverage 35.75% (32369/90530)

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/180) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.53% (14537/27155)
Line Coverage 42.36% (126008/297493)
Region Coverage 41.18% (64402/156388)
Branch Coverage 35.76% (32373/90530)

@hello-stephen
Copy link
Contributor

BE Regression P0 && UT Coverage Report

Increment line coverage 0.00% (0/180) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage
Line Coverage
Region Coverage
Branch Coverage

@Yukang-Lian
Copy link
Collaborator Author

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 34096 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ad89e2cb7e6106b66cdfd98d55598fd78c5cf599, data reload: false

------ Round 1 ----------------------------------
q1	26130	5161	5074	5074
q2	2081	279	182	182
q3	10402	1256	687	687
q4	10221	1017	527	527
q5	7633	2427	2375	2375
q6	183	159	130	130
q7	920	758	608	608
q8	9315	1224	1067	1067
q9	6812	5128	5153	5128
q10	6888	2313	1876	1876
q11	493	293	267	267
q12	362	355	219	219
q13	17800	3717	3125	3125
q14	229	232	207	207
q15	542	482	497	482
q16	451	448	408	408
q17	602	879	367	367
q18	7562	7303	7266	7266
q19	1581	947	551	551
q20	329	327	220	220
q21	4022	3388	2409	2409
q22	1098	1023	921	921
Total cold run time: 115656 ms
Total hot run time: 34096 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5206	5132	5102	5102
q2	237	320	228	228
q3	2154	2670	2270	2270
q4	1388	1791	1475	1475
q5	4494	4405	4387	4387
q6	214	167	129	129
q7	1963	1945	1726	1726
q8	2583	2496	2492	2492
q9	7299	7190	6994	6994
q10	2989	3245	2717	2717
q11	562	483	507	483
q12	695	766	607	607
q13	3606	3877	3319	3319
q14	286	305	292	292
q15	546	482	483	482
q16	485	513	458	458
q17	1206	1528	1397	1397
q18	7782	7659	7450	7450
q19	834	807	901	807
q20	2039	1964	1862	1862
q21	5327	4831	4845	4831
q22	1111	1057	1027	1027
Total cold run time: 53006 ms
Total hot run time: 50535 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192187 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ad89e2cb7e6106b66cdfd98d55598fd78c5cf599, data reload: false

query1	1408	1110	1063	1063
query2	6106	1759	1771	1759
query3	11160	4798	4790	4790
query4	53976	25685	23028	23028
query5	5029	521	451	451
query6	358	227	200	200
query7	4894	499	288	288
query8	340	256	239	239
query9	5652	2592	2612	2592
query10	422	337	261	261
query11	14988	14970	14698	14698
query12	160	122	115	115
query13	1040	525	403	403
query14	10092	6524	6347	6347
query15	207	196	183	183
query16	7218	661	502	502
query17	1111	749	613	613
query18	1630	416	343	343
query19	208	210	199	199
query20	126	126	128	126
query21	251	120	113	113
query22	4321	4565	4277	4277
query23	34094	33440	33410	33410
query24	6577	2410	2481	2410
query25	474	473	414	414
query26	722	272	157	157
query27	2246	508	343	343
query28	3437	2140	2124	2124
query29	599	557	429	429
query30	275	221	193	193
query31	869	865	774	774
query32	77	62	69	62
query33	448	372	317	317
query34	784	881	531	531
query35	801	834	790	790
query36	966	999	924	924
query37	118	100	78	78
query38	4240	4164	4200	4164
query39	1495	1454	1454	1454
query40	213	124	108	108
query41	53	60	49	49
query42	127	113	114	113
query43	532	511	487	487
query44	1370	821	824	821
query45	191	177	167	167
query46	865	1039	665	665
query47	1866	1863	1770	1770
query48	394	419	307	307
query49	692	527	445	445
query50	687	696	411	411
query51	4233	4250	4138	4138
query52	121	114	106	106
query53	242	265	181	181
query54	583	586	520	520
query55	81	85	79	79
query56	377	297	313	297
query57	1187	1210	1114	1114
query58	276	264	256	256
query59	2616	2747	2667	2667
query60	361	321	324	321
query61	135	136	128	128
query62	732	784	657	657
query63	222	188	190	188
query64	1958	1079	743	743
query65	4324	4216	4232	4216
query66	716	402	304	304
query67	15894	15712	15370	15370
query68	7594	904	516	516
query69	541	303	261	261
query70	1201	1076	1064	1064
query71	512	317	325	317
query72	5880	4811	4664	4664
query73	1452	597	350	350
query74	9135	9051	8837	8837
query75	3823	3193	2664	2664
query76	4376	1212	780	780
query77	612	376	287	287
query78	10049	10143	9199	9199
query79	2328	814	569	569
query80	817	521	426	426
query81	485	256	228	228
query82	442	128	96	96
query83	355	244	231	231
query84	298	106	87	87
query85	779	349	313	313
query86	390	308	312	308
query87	4352	4393	4271	4271
query88	3350	2239	2315	2239
query89	397	318	277	277
query90	1801	220	215	215
query91	143	145	110	110
query92	74	71	57	57
query93	2000	943	582	582
query94	658	420	305	305
query95	367	312	280	280
query96	498	567	274	274
query97	3195	3237	3108	3108
query98	237	211	205	205
query99	1438	1441	1275	1275
Total cold run time: 299546 ms
Total hot run time: 192187 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.87 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ad89e2cb7e6106b66cdfd98d55598fd78c5cf599, data reload: false

query1	0.05	0.04	0.03
query2	0.12	0.11	0.12
query3	0.25	0.20	0.19
query4	1.59	0.20	0.20
query5	0.60	0.59	0.59
query6	1.18	0.71	0.72
query7	0.03	0.01	0.02
query8	0.04	0.03	0.03
query9	0.58	0.52	0.51
query10	0.57	0.59	0.56
query11	0.15	0.11	0.11
query12	0.16	0.11	0.12
query13	0.62	0.59	0.60
query14	1.20	1.23	1.19
query15	0.89	0.86	0.85
query16	0.40	0.38	0.38
query17	1.06	1.04	1.01
query18	0.21	0.20	0.19
query19	1.96	1.81	1.80
query20	0.02	0.01	0.01
query21	15.39	0.91	0.55
query22	0.77	1.39	0.74
query23	14.73	1.40	0.64
query24	7.06	1.25	0.90
query25	0.50	0.17	0.08
query26	0.65	0.17	0.15
query27	0.06	0.05	0.04
query28	10.52	0.90	0.44
query29	12.58	3.99	3.31
query30	0.26	0.09	0.07
query31	2.82	0.60	0.38
query32	3.23	0.54	0.46
query33	3.00	3.07	3.11
query34	15.99	5.12	4.54
query35	4.57	4.60	4.51
query36	0.67	0.50	0.49
query37	0.08	0.06	0.06
query38	0.06	0.04	0.04
query39	0.02	0.02	0.02
query40	0.17	0.13	0.13
query41	0.08	0.02	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 104.97 s
Total hot run time: 29.87 s

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 23, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@gavinchou gavinchou merged commit 34fb45b into apache:master Apr 23, 2025
23 of 24 checks passed
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Apr 25, 2025
…apache#49882)

Background:
In cloud mode, compaction tasks for the same tablet may be scheduled
across multiple BEs. To ensure that only one BE can execute a compaction
task for a given tablet at a time, a global locking mechanism is used.

During compaction preparation, tablet and compaction information is
written as key-value pairs to the metadata service. A background thread
periodically renews the lease. Other BEs can only perform compaction on
a tablet when the KV entry has expired or doesn't exist, ensuring that a
tablet's compaction occurs on only one BE at a time.

Problem:
Compaction tasks are processed through a thread pool. Currently, we
first prepare compaction and acquire the global lock before queueing the
task. If a BE is under heavy compaction pressure with all threads
occupied, tablets may wait in the queue for extended periods. Meanwhile,
other idle BEs cannot perform compaction on these tablets because they
cannot acquire the global lock, leading to resource imbalance with some
BEs starved and others overloaded.

Solution:
To address this issue, we'll modify the workflow to queue tasks first,
then attempt to acquire the lock only when the task is about to be
executed. This ensures that even if a tablet's compaction task is queued
on one BE, another idle BE can still perform compaction on that tablet,
resulting in better resource utilization across the cluster.
dataroaring pushed a commit that referenced this pull request Apr 27, 2025
…bal lock when execute compaction (#49882)" (#50432)

Pick #49882 

Background:
In cloud mode, compaction tasks for the same tablet may be scheduled
across multiple BEs. To ensure that only one BE can execute a compaction
task for a given tablet at a time, a global locking mechanism is used.

During compaction preparation, tablet and compaction information is
written as key-value pairs to the metadata service. A background thread
periodically renews the lease. Other BEs can only perform compaction on
a tablet when the KV entry has expired or doesn't exist, ensuring that a
tablet's compaction occurs on only one BE at a time.

Problem:
Compaction tasks are processed through a thread pool. Currently, we
first prepare compaction and acquire the global lock before queueing the
task. If a BE is under heavy compaction pressure with all threads
occupied, tablets may wait in the queue for extended periods. Meanwhile,
other idle BEs cannot perform compaction on these tablets because they
cannot acquire the global lock, leading to resource imbalance with some
BEs starved and others overloaded.

Solution:
To address this issue, we'll modify the workflow to queue tasks first,
then attempt to acquire the lock only when the task is about to be
executed. This ensures that even if a tablet's compaction task is queued
on one BE, another idle BE can still perform compaction on that tablet,
resulting in better resource utilization across the cluster.
dataroaring pushed a commit that referenced this pull request May 14, 2025
… access to compaction maps (#50819)

Related PR: #49882 

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you
are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893
(TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true>
>::_M_find_before_node(unsigned long, long const&, unsigned long) const
at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >, false, false>, bool>
std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long,
decltype(nullptr)>(std::integral_constant<bool, true>, long&&,
decltype(nullptr)&&) at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6#
doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr<doris::CloudTablet>
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7#
doris::CloudStorageEngine::submit_compaction_task(std::shared_ptr<doris::CloudTablet>
const&, doris::CompactionType) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
github-actions bot pushed a commit that referenced this pull request May 14, 2025
… access to compaction maps (#50819)

Related PR: #49882 

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you
are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893
(TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true>
>::_M_find_before_node(unsigned long, long const&, unsigned long) const
at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >, false, false>, bool>
std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long,
decltype(nullptr)>(std::integral_constant<bool, true>, long&&,
decltype(nullptr)&&) at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6#
doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr<doris::CloudTablet>
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7#
doris::CloudStorageEngine::submit_compaction_task(std::shared_ptr<doris::CloudTablet>
const&, doris::CompactionType) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…apache#49882)

Background:
In cloud mode, compaction tasks for the same tablet may be scheduled
across multiple BEs. To ensure that only one BE can execute a compaction
task for a given tablet at a time, a global locking mechanism is used.

During compaction preparation, tablet and compaction information is
written as key-value pairs to the metadata service. A background thread
periodically renews the lease. Other BEs can only perform compaction on
a tablet when the KV entry has expired or doesn't exist, ensuring that a
tablet's compaction occurs on only one BE at a time.

Problem:
Compaction tasks are processed through a thread pool. Currently, we
first prepare compaction and acquire the global lock before queueing the
task. If a BE is under heavy compaction pressure with all threads
occupied, tablets may wait in the queue for extended periods. Meanwhile,
other idle BEs cannot perform compaction on these tablets because they
cannot acquire the global lock, leading to resource imbalance with some
BEs starved and others overloaded.

Solution:
To address this issue, we'll modify the workflow to queue tasks first,
then attempt to acquire the lock only when the task is about to be
executed. This ensures that even if a tablet's compaction task is queued
on one BE, another idle BE can still perform compaction on that tablet,
resulting in better resource utilization across the cluster.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
… access to compaction maps (apache#50819)

Related PR: apache#49882 

Problem Summary:

*** Query id: 0-0 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1746727905 (unix time) try "date -d @1746727905" if you
are using GNU date ***
*** Current BE git commitID: ace825a ***
*** SIGSEGV address not mapped to object (@0x8) received by PID 3151893
(TID 3152363 OR 0x7f1186c00640) from PID 8; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F12D9FEE520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true>
>::_M_find_before_node(unsigned long, long const&, unsigned long) const
at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1817
5# std::pair<std::__detail::_Node_iterator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >, false, false>, bool>
std::_Hashtable<long, std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> >,
std::allocator<std::pair<long const,
std::shared_ptr<doris::CloudBaseCompaction> > >,
std::__detail::_Select1st, std::equal_to<long>, std::hash<long>,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<long,
decltype(nullptr)>(std::integral_constant<bool, true>, long&&,
decltype(nullptr)&&) at
/var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/hashtable.h:1947
6#
doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr<doris::CloudTablet>
const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
7#
doris::CloudStorageEngine::submit_compaction_task(std::shared_ptr<doris::CloudTablet>
const&, doris::CompactionType) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:917
8# doris::CloudStorageEngine::_compaction_tasks_producer_callback() at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/cloud/cloud_storage_engine.cpp:494
9# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-3.0/doris/be/src/util/thread.cpp:499
10# start_thread at ./nptl/pthread_create.c:442
11# 0x00007F12DA0D2850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.6-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants