Skip to content

[fix](cloud) Periodically cleaning secondary be in cloud used by redundant tablets #50200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

deardeng
Copy link
Contributor

@deardeng deardeng commented Apr 20, 2025

What problem does this PR solve?

secondary BE in cloud mode, temporarily stores the be tablet mapping relationship of the be abnormal rehash. If it is not cleaned up, report the redundant tablet cleaning diff logic, which does not work as expected.

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng
Copy link
Contributor Author

wait regression

@gavinchou gavinchou changed the title [fix](cloud) Periodic cleaning secondary be in cloud used by redundan… [fix](cloud) Periodically cleaning secondary be in cloud used by redundan… Apr 21, 2025
@gavinchou gavinchou changed the title [fix](cloud) Periodically cleaning secondary be in cloud used by redundan… [fix](cloud) Periodically cleaning secondary be in cloud used by redundant tablets Apr 21, 2025
@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33912 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

------ Round 1 ----------------------------------
q1	26115	5011	5199	5011
q2	2076	286	198	198
q3	10372	1253	704	704
q4	10229	1008	527	527
q5	7547	2369	2371	2369
q6	186	168	130	130
q7	902	729	625	625
q8	9319	1326	1094	1094
q9	6869	5103	5128	5103
q10	6793	2335	1885	1885
q11	473	287	260	260
q12	346	348	216	216
q13	17761	4061	3096	3096
q14	236	225	223	223
q15	536	491	473	473
q16	440	447	400	400
q17	578	859	370	370
q18	7576	7253	7116	7116
q19	1119	946	537	537
q20	335	323	225	225
q21	4287	2619	2403	2403
q22	1027	1031	947	947
Total cold run time: 115122 ms
Total hot run time: 33912 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5074	5031	5045	5031
q2	243	324	230	230
q3	2178	2618	2269	2269
q4	1386	1789	1419	1419
q5	4536	4456	4440	4440
q6	208	166	128	128
q7	1915	1853	1726	1726
q8	2631	2595	2454	2454
q9	7186	7163	7146	7146
q10	2947	3168	2714	2714
q11	570	504	481	481
q12	704	757	622	622
q13	3478	3969	3302	3302
q14	294	284	269	269
q15	512	480	461	461
q16	457	502	461	461
q17	1129	1589	1414	1414
q18	7565	7534	7428	7428
q19	775	844	980	844
q20	2009	2021	1885	1885
q21	5209	4688	4581	4581
q22	1070	1005	964	964
Total cold run time: 52076 ms
Total hot run time: 50269 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185227 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

query1	1023	489	479	479
query2	6571	1778	1786	1778
query3	6749	221	230	221
query4	26016	23931	23619	23619
query5	5822	604	464	464
query6	299	196	197	196
query7	4628	490	290	290
query8	287	231	224	224
query9	8613	2542	2553	2542
query10	529	312	265	265
query11	15624	15087	14859	14859
query12	158	111	113	111
query13	1648	508	378	378
query14	10604	6034	6061	6034
query15	206	204	163	163
query16	7626	623	488	488
query17	1165	689	545	545
query18	2003	386	296	296
query19	185	186	153	153
query20	127	119	117	117
query21	209	123	102	102
query22	4120	4198	3940	3940
query23	33807	32885	32756	32756
query24	7364	2356	2325	2325
query25	528	460	384	384
query26	1243	266	150	150
query27	2574	461	329	329
query28	4443	2099	2084	2084
query29	735	551	423	423
query30	278	217	193	193
query31	927	864	767	767
query32	72	69	66	66
query33	546	374	313	313
query34	808	843	500	500
query35	799	839	725	725
query36	953	975	886	886
query37	105	95	77	77
query38	4171	4077	4216	4077
query39	1460	1410	1415	1410
query40	218	119	114	114
query41	77	70	68	68
query42	121	111	104	104
query43	479	518	469	469
query44	1261	816	777	777
query45	177	177	167	167
query46	806	1002	610	610
query47	1755	1779	1713	1713
query48	372	404	296	296
query49	782	490	440	440
query50	624	677	400	400
query51	4108	4141	4065	4065
query52	112	105	100	100
query53	222	247	180	180
query54	590	564	494	494
query55	81	80	78	78
query56	303	316	312	312
query57	1111	1151	1084	1084
query58	293	249	267	249
query59	2561	2643	2493	2493
query60	327	318	297	297
query61	139	131	127	127
query62	775	737	691	691
query63	236	181	183	181
query64	4276	1002	857	857
query65	4288	4233	4241	4233
query66	986	418	308	308
query67	15799	15382	15157	15157
query68	6281	868	507	507
query69	471	300	262	262
query70	1157	1116	1096	1096
query71	406	309	317	309
query72	5744	4807	4734	4734
query73	659	608	329	329
query74	8811	9132	8660	8660
query75	3161	3177	2714	2714
query76	3051	1172	749	749
query77	464	380	282	282
query78	9870	10114	9352	9352
query79	1931	814	555	555
query80	646	563	444	444
query81	492	263	216	216
query82	350	123	96	96
query83	254	246	233	233
query84	247	102	100	100
query85	756	360	366	360
query86	332	294	292	292
query87	4413	4529	4280	4280
query88	2747	2173	2189	2173
query89	376	312	282	282
query90	1893	214	210	210
query91	137	149	107	107
query92	62	58	54	54
query93	1090	926	591	591
query94	616	413	287	287
query95	375	295	287	287
query96	483	557	274	274
query97	3075	3251	3121	3121
query98	234	220	211	211
query99	1297	1375	1267	1267
Total cold run time: 269785 ms
Total hot run time: 185227 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

query1	0.04	0.04	0.03
query2	0.13	0.10	0.12
query3	0.27	0.20	0.19
query4	1.59	0.19	0.20
query5	0.61	0.60	0.59
query6	1.17	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.58	0.54	0.51
query10	0.55	0.57	0.56
query11	0.16	0.10	0.10
query12	0.15	0.12	0.11
query13	0.61	0.59	0.60
query14	1.20	1.20	1.17
query15	0.88	0.84	0.87
query16	0.40	0.40	0.38
query17	1.02	1.06	1.02
query18	0.22	0.20	0.19
query19	1.93	1.78	1.74
query20	0.01	0.01	0.01
query21	15.40	0.94	0.57
query22	0.74	1.10	0.71
query23	15.00	1.41	0.65
query24	6.85	1.23	1.35
query25	0.47	0.23	0.15
query26	0.56	0.16	0.13
query27	0.05	0.05	0.06
query28	9.76	0.90	0.46
query29	12.56	3.97	3.35
query30	0.25	0.09	0.06
query31	2.83	0.60	0.39
query32	3.23	0.55	0.47
query33	3.00	3.04	3.04
query34	16.29	5.14	4.47
query35	4.51	4.55	4.48
query36	0.67	0.51	0.49
query37	0.08	0.06	0.07
query38	0.06	0.04	0.04
query39	0.03	0.03	0.02
query40	0.17	0.13	0.14
query41	0.08	0.02	0.02
query42	0.03	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 104.24 s
Total hot run time: 30.11 s

@dataroaring dataroaring added dev/3.0.x usercase Important user case type label labels Apr 23, 2025
@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33888 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

------ Round 1 ----------------------------------
q1	25646	5140	5114	5114
q2	2058	279	176	176
q3	10393	1251	701	701
q4	10239	989	525	525
q5	7521	2350	2317	2317
q6	179	163	139	139
q7	930	736	629	629
q8	9314	1288	1011	1011
q9	6931	5193	5113	5113
q10	6811	2313	1875	1875
q11	469	273	260	260
q12	352	352	213	213
q13	17765	3683	3065	3065
q14	234	221	207	207
q15	534	496	490	490
q16	440	447	396	396
q17	585	856	356	356
q18	7510	7263	7091	7091
q19	1327	950	564	564
q20	342	318	222	222
q21	3960	2623	2408	2408
q22	1059	1047	1016	1016
Total cold run time: 114599 ms
Total hot run time: 33888 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5124	5150	5148	5148
q2	235	323	224	224
q3	2149	2644	2319	2319
q4	1371	1776	1347	1347
q5	4443	4345	4403	4345
q6	220	175	131	131
q7	2008	1929	1757	1757
q8	2597	2569	2611	2569
q9	7397	7331	7226	7226
q10	3016	3195	2801	2801
q11	613	543	496	496
q12	692	812	648	648
q13	3588	3900	3306	3306
q14	294	320	279	279
q15	527	468	485	468
q16	463	523	461	461
q17	1153	1502	1388	1388
q18	7820	7576	7398	7398
q19	816	878	1065	878
q20	2025	1937	1802	1802
q21	5049	4772	4700	4700
q22	1122	1072	1028	1028
Total cold run time: 52722 ms
Total hot run time: 50719 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192346 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

query1	1402	1059	1066	1059
query2	6116	1874	1837	1837
query3	11096	4658	4454	4454
query4	54370	25511	23256	23256
query5	5304	539	468	468
query6	354	211	205	205
query7	4924	496	288	288
query8	317	248	243	243
query9	5866	2562	2574	2562
query10	440	312	266	266
query11	15111	14933	14824	14824
query12	161	106	102	102
query13	1058	517	407	407
query14	10190	6384	6440	6384
query15	204	198	185	185
query16	7140	654	523	523
query17	1112	760	615	615
query18	1576	411	330	330
query19	224	243	179	179
query20	126	129	118	118
query21	205	126	106	106
query22	4443	4335	4339	4335
query23	34019	33626	33482	33482
query24	6656	2430	2418	2418
query25	473	476	410	410
query26	664	269	155	155
query27	2257	502	353	353
query28	2968	2155	2160	2155
query29	587	554	431	431
query30	277	231	199	199
query31	884	836	803	803
query32	74	63	62	62
query33	463	381	341	341
query34	769	886	532	532
query35	789	834	757	757
query36	973	1043	931	931
query37	123	104	77	77
query38	4241	4203	4085	4085
query39	1490	1470	1421	1421
query40	217	117	107	107
query41	57	56	51	51
query42	129	110	105	105
query43	490	531	489	489
query44	1368	829	846	829
query45	183	177	168	168
query46	859	1040	627	627
query47	1866	1905	1807	1807
query48	392	413	306	306
query49	699	506	428	428
query50	688	702	414	414
query51	4226	4402	4186	4186
query52	107	110	95	95
query53	235	264	196	196
query54	598	577	506	506
query55	89	84	85	84
query56	320	301	305	301
query57	1136	1191	1132	1132
query58	273	269	252	252
query59	2736	2868	2619	2619
query60	346	332	303	303
query61	128	128	133	128
query62	765	706	653	653
query63	229	190	196	190
query64	1721	1022	717	717
query65	4323	4240	4251	4240
query66	718	386	300	300
query67	15771	15489	15273	15273
query68	6971	885	509	509
query69	538	302	259	259
query70	1188	1108	1090	1090
query71	495	311	300	300
query72	5744	4740	4914	4740
query73	1439	677	350	350
query74	8924	9032	8606	8606
query75	3759	3184	2671	2671
query76	4190	1193	778	778
query77	644	361	281	281
query78	9998	10052	9233	9233
query79	3811	824	551	551
query80	637	555	453	453
query81	492	254	217	217
query82	498	128	93	93
query83	332	253	235	235
query84	291	105	85	85
query85	799	348	320	320
query86	393	318	276	276
query87	4455	4401	4375	4375
query88	3457	2244	2241	2241
query89	448	321	286	286
query90	1920	221	222	221
query91	141	141	110	110
query92	75	62	57	57
query93	2968	964	582	582
query94	669	402	305	305
query95	370	293	285	285
query96	495	571	276	276
query97	3168	3284	3173	3173
query98	226	201	207	201
query99	1411	1398	1323	1323
Total cold run time: 301061 ms
Total hot run time: 192346 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.43 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 441326b75e5ca348144262168016fcc9409f48ba, data reload: false

query1	0.04	0.04	0.03
query2	0.12	0.10	0.12
query3	0.25	0.19	0.20
query4	1.59	0.19	0.19
query5	0.60	0.57	0.58
query6	1.21	0.70	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.53	0.52
query10	0.58	0.58	0.58
query11	0.15	0.11	0.11
query12	0.15	0.11	0.12
query13	0.60	0.60	0.59
query14	1.18	1.16	1.20
query15	0.87	0.86	0.85
query16	0.40	0.39	0.39
query17	1.04	0.99	1.04
query18	0.22	0.19	0.20
query19	1.87	1.82	1.89
query20	0.01	0.01	0.01
query21	15.40	0.88	0.53
query22	0.76	1.26	0.59
query23	14.96	1.41	0.63
query24	7.15	2.00	0.57
query25	0.45	0.17	0.16
query26	0.65	0.15	0.14
query27	0.05	0.05	0.04
query28	9.58	0.85	0.44
query29	12.52	4.06	3.36
query30	0.26	0.09	0.07
query31	2.82	0.57	0.38
query32	3.22	0.54	0.45
query33	3.01	3.08	3.08
query34	15.75	5.11	4.51
query35	4.53	4.56	4.54
query36	0.66	0.50	0.48
query37	0.08	0.06	0.06
query38	0.06	0.04	0.04
query39	0.03	0.03	0.02
query40	0.17	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.02
Total cold run time: 103.77 s
Total hot run time: 29.43 s

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 7, 2025
Copy link
Contributor

github-actions bot commented May 7, 2025

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented May 7, 2025

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gavinchou gavinchou merged commit 32f5251 into apache:master May 14, 2025
29 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request May 14, 2025
…ndant tablets (#50200)

secondary BE in cloud mode, temporarily stores the be tablet mapping
relationship of the be abnormal rehash. If it is not cleaned up, report
the redundant tablet cleaning diff logic, which does not work as
expected.
dataroaring pushed a commit that referenced this pull request May 15, 2025
…used by redundant tablets #50200 (#50919)

Cherry-picked from #50200

Co-authored-by: deardeng <dengxin@selectdb.com>
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…ndant tablets (apache#50200)

secondary BE in cloud mode, temporarily stores the be tablet mapping
relationship of the be abnormal rehash. If it is not cleaned up, report
the redundant tablet cleaning diff logic, which does not work as
expected.
@gavinchou gavinchou added the cloud label Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. cloud dev/3.0.6-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants