Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvement](cloud) Accelerate cloud rebalance by batch editlog #37787

Merged
merged 7 commits into from
Jul 19, 2024

Conversation

deardeng
Copy link
Contributor

@deardeng deardeng commented Jul 15, 2024

Proposed changes

Issue Number: close #xxx

  1. use JournalBatch to batch editlogs
  2. same partition, different tablets use one editlog

env:
in docker cloud mode, 3fe 3be.
3be expansion to 4be, trigger cloud rebalance
table, 1860 partitions, 48 buckets, every rebalance loop min balance 12 and close pre cache

result:

before improvement
2024-07-16 16:51:01,371 INFO (cloud tablet rebalancer|77) [CloudTabletRebalancer.runAfterCatalogReady():228] 
finished to rebalancer. cost: 58471 ms


after imprevement
2024-07-16 17:10:20,699 INFO (cloud tablet rebalancer|77) [CloudTabletRebalancer.runAfterCatalogReady():235]
finished to rebalancer. cost: 28687 ms

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@deardeng deardeng force-pushed the fix-reb-slow branch 2 times, most recently from 6348ac9 to 14b7d99 Compare July 15, 2024 04:11
@deardeng
Copy link
Contributor Author

run buildall

@gavinchou gavinchou requested a review from w41ter July 17, 2024 11:15
@doris-robot
Copy link

TPC-H: Total hot run time: 40092 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 58922dde2e4c966eb12a54ad1e40391ebf395e27, data reload: false

------ Round 1 ----------------------------------
q1	17642	5164	4277	4277
q2	2006	191	184	184
q3	10552	1206	1149	1149
q4	10286	747	789	747
q5	7610	2689	2685	2685
q6	225	137	135	135
q7	965	587	601	587
q8	9218	2075	2105	2075
q9	8819	6568	6601	6568
q10	8792	3814	3790	3790
q11	449	234	232	232
q12	400	216	226	216
q13	18899	2973	2993	2973
q14	278	239	241	239
q15	527	482	495	482
q16	477	398	391	391
q17	990	642	701	642
q18	8182	7466	7463	7463
q19	6001	1456	1409	1409
q20	758	321	318	318
q21	5094	3247	3238	3238
q22	356	292	292	292
Total cold run time: 118526 ms
Total hot run time: 40092 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4501	4341	4242	4242
q2	371	299	258	258
q3	3099	2922	2987	2922
q4	1967	1734	1765	1734
q5	5550	5498	5416	5416
q6	224	137	128	128
q7	2260	1828	1834	1828
q8	3297	3471	3431	3431
q9	8778	8891	8821	8821
q10	4191	3741	3846	3741
q11	593	493	499	493
q12	802	633	632	632
q13	15822	3186	3134	3134
q14	317	310	287	287
q15	546	487	495	487
q16	492	436	431	431
q17	1821	1534	1517	1517
q18	8173	7928	7906	7906
q19	1747	1440	1514	1440
q20	2076	1887	1900	1887
q21	9570	4942	4775	4775
q22	625	525	540	525
Total cold run time: 76822 ms
Total hot run time: 56035 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172500 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 58922dde2e4c966eb12a54ad1e40391ebf395e27, data reload: false

query1	911	374	365	365
query2	6493	1966	1814	1814
query3	6653	207	218	207
query4	28589	17490	17380	17380
query5	3687	478	470	470
query6	262	171	159	159
query7	4569	294	288	288
query8	248	195	185	185
query9	8758	2397	2392	2392
query10	444	296	276	276
query11	12215	9970	10046	9970
query12	122	89	87	87
query13	1644	361	361	361
query14	8732	6954	7838	6954
query15	223	169	170	169
query16	7845	335	331	331
query17	1764	566	552	552
query18	1978	282	286	282
query19	201	159	155	155
query20	97	89	84	84
query21	217	134	130	130
query22	4289	4159	4052	4052
query23	34347	33708	33303	33303
query24	10712	2906	2891	2891
query25	575	396	374	374
query26	695	150	147	147
query27	2287	275	280	275
query28	6099	2047	2070	2047
query29	888	658	653	653
query30	259	152	151	151
query31	972	749	751	749
query32	89	53	54	53
query33	680	321	303	303
query34	905	504	510	504
query35	699	571	573	571
query36	1130	994	970	970
query37	147	87	85	85
query38	2994	2928	2813	2813
query39	897	852	819	819
query40	211	144	121	121
query41	46	45	43	43
query42	120	95	96	95
query43	489	467	463	463
query44	1086	723	722	722
query45	195	160	160	160
query46	1071	730	706	706
query47	1880	1803	1777	1777
query48	358	286	283	283
query49	825	399	410	399
query50	788	392	395	392
query51	6791	6774	6699	6699
query52	102	94	94	94
query53	367	282	289	282
query54	853	448	434	434
query55	75	73	76	73
query56	310	268	269	268
query57	1159	1064	1039	1039
query58	233	264	254	254
query59	2908	2689	2639	2639
query60	310	281	278	278
query61	97	130	93	93
query62	792	655	629	629
query63	318	278	282	278
query64	9148	2230	1678	1678
query65	3155	3134	3126	3126
query66	762	362	324	324
query67	15458	15029	15121	15029
query68	6178	546	538	538
query69	669	462	362	362
query70	1202	1162	1142	1142
query71	457	280	269	269
query72	8062	5817	5455	5455
query73	838	327	322	322
query74	6103	5663	5695	5663
query75	3773	2658	2702	2658
query76	3916	936	903	903
query77	678	298	310	298
query78	9671	9055	9041	9041
query79	2633	517	516	516
query80	1919	470	460	460
query81	570	218	221	218
query82	763	137	132	132
query83	263	167	164	164
query84	281	87	92	87
query85	1234	316	378	316
query86	452	303	326	303
query87	3256	3097	3103	3097
query88	4096	2346	2360	2346
query89	468	389	374	374
query90	1752	192	189	189
query91	125	101	97	97
query92	59	50	49	49
query93	3580	508	503	503
query94	1040	216	209	209
query95	405	314	308	308
query96	610	278	271	271
query97	3222	3024	3042	3024
query98	213	205	193	193
query99	1678	1244	1266	1244
Total cold run time: 284389 ms
Total hot run time: 172500 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.63 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 58922dde2e4c966eb12a54ad1e40391ebf395e27, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.05
query3	0.23	0.04	0.04
query4	1.69	0.08	0.07
query5	0.50	0.50	0.50
query6	1.13	0.72	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.57	0.49	0.49
query10	0.55	0.54	0.53
query11	0.16	0.11	0.12
query12	0.15	0.12	0.14
query13	0.59	0.59	0.58
query14	0.74	0.79	0.78
query15	0.86	0.81	0.82
query16	0.37	0.37	0.35
query17	0.95	1.04	1.03
query18	0.22	0.21	0.21
query19	1.76	1.79	1.68
query20	0.02	0.01	0.01
query21	15.39	0.73	0.66
query22	4.02	7.59	1.66
query23	18.31	1.44	1.40
query24	2.09	0.22	0.22
query25	0.16	0.09	0.08
query26	0.29	0.22	0.22
query27	0.45	0.24	0.23
query28	13.31	1.02	1.01
query29	12.56	3.30	3.30
query30	0.26	0.05	0.05
query31	2.88	0.39	0.38
query32	3.31	0.49	0.47
query33	2.88	2.95	2.96
query34	17.04	4.36	4.36
query35	4.42	4.44	4.40
query36	0.66	0.49	0.50
query37	0.19	0.16	0.16
query38	0.16	0.15	0.15
query39	0.04	0.04	0.04
query40	0.15	0.12	0.13
query41	0.09	0.05	0.05
query42	0.06	0.04	0.05
query43	0.05	0.04	0.04
Total cold run time: 109.45 s
Total hot run time: 30.63 s

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39996 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d4341279fc438cd8d2ee21886c3260fe9e3031f1, data reload: false

------ Round 1 ----------------------------------
q1	17952	4582	4393	4393
q2	2943	201	192	192
q3	10881	1171	1089	1089
q4	10710	872	851	851
q5	7630	2738	2651	2651
q6	223	142	139	139
q7	974	612	609	609
q8	9287	2099	2068	2068
q9	8586	6543	6600	6543
q10	8808	3823	3825	3823
q11	469	237	238	237
q12	391	228	221	221
q13	18702	2966	2966	2966
q14	279	248	243	243
q15	531	484	466	466
q16	512	378	378	378
q17	965	572	616	572
q18	8137	7495	7324	7324
q19	3567	1408	1428	1408
q20	668	318	335	318
q21	4912	3223	3239	3223
q22	345	290	282	282
Total cold run time: 117472 ms
Total hot run time: 39996 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4394	4288	4236	4236
q2	385	284	262	262
q3	2987	2793	2740	2740
q4	1914	1666	1571	1571
q5	5327	5352	5320	5320
q6	219	132	132	132
q7	2101	1710	1743	1710
q8	3197	3339	3352	3339
q9	8494	8439	8387	8387
q10	3892	3748	3776	3748
q11	581	494	503	494
q12	760	600	613	600
q13	17472	2941	2975	2941
q14	303	273	263	263
q15	516	483	472	472
q16	475	414	422	414
q17	1776	1486	1460	1460
q18	7567	7680	7425	7425
q19	1663	1562	1536	1536
q20	2002	1814	1813	1813
q21	5443	4672	4799	4672
q22	578	520	520	520
Total cold run time: 72046 ms
Total hot run time: 54055 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173956 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d4341279fc438cd8d2ee21886c3260fe9e3031f1, data reload: false

query1	917	371	367	367
query2	6464	1919	1811	1811
query3	6667	204	213	204
query4	28405	17433	17472	17433
query5	4189	492	484	484
query6	266	171	169	169
query7	4558	307	285	285
query8	241	203	208	203
query9	8466	2486	2461	2461
query10	436	296	288	288
query11	11532	10123	10004	10004
query12	141	87	89	87
query13	1594	368	373	368
query14	10315	7200	8178	7200
query15	206	169	170	169
query16	7460	328	323	323
query17	1797	565	554	554
query18	1298	286	286	286
query19	203	153	162	153
query20	90	86	88	86
query21	214	128	130	128
query22	4332	4246	4060	4060
query23	33975	33368	33246	33246
query24	12217	2853	2940	2853
query25	681	395	398	395
query26	1669	155	154	154
query27	2767	274	277	274
query28	7436	2020	2010	2010
query29	1032	661	637	637
query30	303	153	152	152
query31	946	739	767	739
query32	98	57	58	57
query33	781	323	322	322
query34	897	490	483	483
query35	706	592	592	592
query36	1106	944	989	944
query37	278	81	84	81
query38	2943	2748	2780	2748
query39	900	820	821	820
query40	278	123	123	123
query41	52	49	48	48
query42	128	100	100	100
query43	513	485	481	481
query44	1186	728	731	728
query45	196	169	166	166
query46	1094	762	708	708
query47	1873	1751	1809	1751
query48	380	298	292	292
query49	1229	447	435	435
query50	790	413	404	404
query51	6903	6854	6781	6781
query52	110	92	100	92
query53	361	293	286	286
query54	1004	469	456	456
query55	74	78	73	73
query56	290	281	282	281
query57	1144	1043	1055	1043
query58	242	250	261	250
query59	3027	2606	2586	2586
query60	301	289	276	276
query61	100	94	96	94
query62	820	652	673	652
query63	322	291	289	289
query64	10501	2265	1672	1672
query65	3189	3113	3144	3113
query66	1233	339	335	335
query67	15802	15153	15044	15044
query68	6710	546	539	539
query69	762	422	371	371
query70	1207	1146	1121	1121
query71	530	288	274	274
query72	9386	5806	5965	5806
query73	834	331	330	330
query74	6152	5719	5696	5696
query75	4782	2695	2698	2695
query76	4775	936	958	936
query77	770	318	315	315
query78	11761	11501	9343	9343
query79	11855	536	536	536
query80	1253	488	497	488
query81	591	225	223	223
query82	544	141	141	141
query83	356	166	176	166
query84	273	86	90	86
query85	739	316	297	297
query86	449	322	305	305
query87	3356	3172	3097	3097
query88	5875	2456	2449	2449
query89	497	400	389	389
query90	2255	206	201	201
query91	135	103	99	99
query92	64	54	51	51
query93	4888	521	509	509
query94	1561	220	219	219
query95	408	323	326	323
query96	623	273	276	273
query97	3195	3021	3038	3021
query98	219	201	203	201
query99	1519	1267	1285	1267
Total cold run time: 309992 ms
Total hot run time: 173956 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.14 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d4341279fc438cd8d2ee21886c3260fe9e3031f1, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.05	0.05
query4	1.66	0.08	0.09
query5	0.50	0.49	0.48
query6	1.14	0.72	0.73
query7	0.02	0.01	0.02
query8	0.05	0.05	0.04
query9	0.56	0.49	0.49
query10	0.55	0.54	0.54
query11	0.16	0.12	0.12
query12	0.15	0.12	0.13
query13	0.59	0.59	0.58
query14	0.76	0.76	0.80
query15	0.85	0.82	0.82
query16	0.36	0.36	0.37
query17	1.00	1.00	0.97
query18	0.23	0.21	0.21
query19	1.89	1.74	1.73
query20	0.01	0.01	0.01
query21	15.40	0.76	0.66
query22	3.87	6.93	2.55
query23	18.31	1.43	1.17
query24	2.06	0.24	0.22
query25	0.16	0.08	0.08
query26	0.30	0.22	0.21
query27	0.46	0.23	0.23
query28	13.29	1.03	1.00
query29	12.65	3.37	3.30
query30	0.25	0.06	0.07
query31	2.86	0.39	0.39
query32	3.28	0.47	0.48
query33	2.83	2.96	2.91
query34	17.17	4.30	4.36
query35	4.44	4.43	4.38
query36	0.66	0.46	0.48
query37	0.19	0.15	0.16
query38	0.16	0.15	0.15
query39	0.05	0.03	0.04
query40	0.15	0.12	0.12
query41	0.10	0.06	0.05
query42	0.06	0.05	0.05
query43	0.04	0.04	0.04
Total cold run time: 109.56 s
Total hot run time: 31.14 s

Copy link
Collaborator

@yujun777 yujun777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 19, 2024
@w41ter w41ter merged commit 40b3d58 into apache:master Jul 19, 2024
31 of 33 checks passed
dataroaring pushed a commit that referenced this pull request Jul 22, 2024
)

1. use `JournalBatch` to batch editlogs
2. same partition, different tablets use one editlog

env:
in docker cloud mode, 3fe 3be. 
3be expansion to 4be, trigger cloud rebalance
table, 1860 partitions, 48 buckets, every rebalance loop min balance 12
and close pre cache

result:
```
before improvement
2024-07-16 16:51:01,371 INFO (cloud tablet rebalancer|77) [CloudTabletRebalancer.runAfterCatalogReady():228] 
finished to rebalancer. cost: 58471 ms


after imprevement
2024-07-16 17:10:20,699 INFO (cloud tablet rebalancer|77) [CloudTabletRebalancer.runAfterCatalogReady():235]
finished to rebalancer. cost: 28687 ms
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.1-merged meta-change reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants