Skip to content

[fix](routine load) fix incorrect auto-resume interval caused by excessive auto-resume attempts#47528

Merged
dataroaring merged 1 commit intoapache:masterfrom
sollhui:fix_auto_resume_interval
Feb 12, 2025
Merged

[fix](routine load) fix incorrect auto-resume interval caused by excessive auto-resume attempts#47528
dataroaring merged 1 commit intoapache:masterfrom
sollhui:fix_auto_resume_interval

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Feb 6, 2025

What problem does this PR solve?

Incorrect auto-resume interval observed when autoResumeCount is too large. Logs show that when autoResumeCount reaches high values, the auto-resume interval becomes approximately 20 seconds instead of the expected 5 minutes.

2025-02-05 14:58:46,830 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738701821,  autoResumeCount: 41555, pause reason: PARTITIONS_ERR
2025-02-05 14:59:11,837 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738726830,  autoResumeCount: 41556, pause reason: PARTITIONS_ERR
2025-02-05 14:59:36,844 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738751837,  autoResumeCount: 41557, pause reason: PARTITIONS_ERR

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Feb 6, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32695 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 883dcf395d7af241b894d45dde9d403eb9d690ed, data reload: false

------ Round 1 ----------------------------------
q1	17577	5521	5388	5388
q2	2050	294	163	163
q3	10497	1228	771	771
q4	10234	984	516	516
q5	7825	2431	2182	2182
q6	201	171	137	137
q7	909	764	612	612
q8	9229	1393	1259	1259
q9	5250	4947	4919	4919
q10	6850	2319	1867	1867
q11	479	286	260	260
q12	353	378	225	225
q13	17786	3699	3143	3143
q14	233	223	205	205
q15	515	476	482	476
q16	645	627	582	582
q17	568	869	312	312
q18	6930	6519	6657	6519
q19	3988	950	557	557
q20	316	333	199	199
q21	2914	2301	2070	2070
q22	374	364	333	333
Total cold run time: 105723 ms
Total hot run time: 32695 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5592	5469	5490	5469
q2	243	335	240	240
q3	2335	2603	2283	2283
q4	1455	1840	1405	1405
q5	4346	4753	4780	4753
q6	178	164	128	128
q7	2088	2016	1798	1798
q8	2645	2808	2797	2797
q9	7261	7153	7276	7153
q10	3014	3332	2772	2772
q11	566	509	477	477
q12	643	750	631	631
q13	3622	4040	3353	3353
q14	278	301	281	281
q15	523	476	464	464
q16	659	688	642	642
q17	1238	1730	1270	1270
q18	7742	7408	7264	7264
q19	857	1173	1124	1124
q20	2067	2048	1905	1905
q21	5868	5092	4860	4860
q22	619	618	589	589
Total cold run time: 53839 ms
Total hot run time: 51658 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191995 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 883dcf395d7af241b894d45dde9d403eb9d690ed, data reload: false

query1	1327	953	921	921
query2	6389	2094	2077	2077
query3	11347	4544	4718	4544
query4	32577	23358	22992	22992
query5	3634	611	444	444
query6	288	194	192	192
query7	3989	501	322	322
query8	304	244	229	229
query9	9243	2687	2688	2687
query10	485	308	261	261
query11	17962	15270	14788	14788
query12	156	117	107	107
query13	1585	552	431	431
query14	9172	7053	7347	7053
query15	242	207	204	204
query16	7698	656	507	507
query17	1592	774	611	611
query18	2072	410	337	337
query19	213	187	172	172
query20	119	128	114	114
query21	210	126	110	110
query22	4355	4451	4410	4410
query23	34279	33516	33481	33481
query24	6685	2302	2308	2302
query25	470	459	389	389
query26	792	272	161	161
query27	2008	456	333	333
query28	5747	2528	2552	2528
query29	612	597	433	433
query30	215	187	152	152
query31	950	912	836	836
query32	68	65	63	63
query33	513	366	319	319
query34	772	893	524	524
query35	821	869	763	763
query36	1017	1045	963	963
query37	115	101	77	77
query38	4369	4433	4246	4246
query39	1505	1464	1433	1433
query40	203	112	106	106
query41	62	55	50	50
query42	124	100	103	100
query43	528	530	487	487
query44	1359	813	806	806
query45	188	183	172	172
query46	888	1068	665	665
query47	1908	1949	1839	1839
query48	387	439	334	334
query49	732	486	407	407
query50	634	656	399	399
query51	4389	4316	4203	4203
query52	109	104	93	93
query53	239	256	193	193
query54	506	508	452	452
query55	80	84	79	79
query56	274	289	264	264
query57	1183	1216	1148	1148
query58	261	237	240	237
query59	3147	3248	3074	3074
query60	288	275	270	270
query61	120	121	117	117
query62	798	706	661	661
query63	237	198	193	193
query64	3332	1101	681	681
query65	3322	3299	3268	3268
query66	889	395	296	296
query67	15879	15654	15506	15506
query68	5516	835	550	550
query69	488	307	253	253
query70	1205	1155	1124	1124
query71	389	297	260	260
query72	5930	3798	3805	3798
query73	654	759	371	371
query74	9199	9204	8936	8936
query75	3183	3132	2642	2642
query76	3127	1182	769	769
query77	461	369	278	278
query78	9999	9923	9253	9253
query79	2837	878	600	600
query80	654	525	449	449
query81	480	267	242	242
query82	434	149	118	118
query83	173	172	152	152
query84	240	89	73	73
query85	778	347	315	315
query86	401	322	299	299
query87	4524	4668	4489	4489
query88	4767	2272	2189	2189
query89	390	323	292	292
query90	1808	188	197	188
query91	129	137	108	108
query92	65	57	53	53
query93	1970	899	550	550
query94	690	402	295	295
query95	337	268	259	259
query96	500	607	289	289
query97	2791	2866	2709	2709
query98	226	192	201	192
query99	1261	1359	1284	1284
Total cold run time: 283569 ms
Total hot run time: 191995 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.1 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 883dcf395d7af241b894d45dde9d403eb9d690ed, data reload: false

query1	0.04	0.06	0.03
query2	0.07	0.04	0.03
query3	0.24	0.06	0.07
query4	1.63	0.10	0.10
query5	0.42	0.41	0.40
query6	1.17	0.65	0.65
query7	0.03	0.02	0.02
query8	0.04	0.03	0.04
query9	0.58	0.51	0.50
query10	0.57	0.55	0.56
query11	0.16	0.11	0.11
query12	0.13	0.11	0.11
query13	0.60	0.59	0.59
query14	2.83	2.91	2.74
query15	0.90	0.82	0.82
query16	0.39	0.40	0.38
query17	1.06	1.01	1.03
query18	0.23	0.20	0.20
query19	1.92	1.89	2.08
query20	0.01	0.01	0.01
query21	15.37	0.92	0.56
query22	0.76	0.86	0.71
query23	15.19	1.42	0.52
query24	2.69	1.40	1.24
query25	0.18	0.17	0.13
query26	0.25	0.14	0.14
query27	0.04	0.04	0.07
query28	13.93	0.99	0.43
query29	12.60	3.96	3.29
query30	0.24	0.08	0.06
query31	2.83	0.59	0.38
query32	3.22	0.54	0.46
query33	2.99	3.07	2.99
query34	16.75	5.16	4.55
query35	4.53	4.61	4.57
query36	0.64	0.48	0.47
query37	0.10	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.14	0.14
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 105.73 s
Total hot run time: 31.1 s

dataroaring
dataroaring previously approved these changes Feb 10, 2025
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 10, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@sollhui
Copy link
Contributor Author

sollhui commented Feb 11, 2025

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Feb 11, 2025
@sollhui sollhui force-pushed the fix_auto_resume_interval branch from eda8d63 to 06fa316 Compare February 11, 2025 08:45
@sollhui
Copy link
Contributor Author

sollhui commented Feb 11, 2025

run buildall

@sollhui sollhui force-pushed the fix_auto_resume_interval branch from 06fa316 to 58494fb Compare February 11, 2025 09:00
@sollhui
Copy link
Contributor Author

sollhui commented Feb 11, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31866 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 58494fbdb87e0d6388a510d8bd400764e1acc036, data reload: false

------ Round 1 ----------------------------------
q1	17615	5284	5288	5284
q2	2045	293	162	162
q3	10447	1252	736	736
q4	10269	1021	559	559
q5	8147	2368	2370	2368
q6	193	167	134	134
q7	903	744	614	614
q8	9332	1315	1059	1059
q9	4857	4711	4774	4711
q10	6903	2334	1909	1909
q11	493	286	254	254
q12	339	347	223	223
q13	17999	3621	3083	3083
q14	239	236	207	207
q15	509	460	456	456
q16	640	598	583	583
q17	559	860	336	336
q18	6717	6220	6263	6220
q19	1239	950	550	550
q20	323	335	195	195
q21	2855	2147	1912	1912
q22	364	326	311	311
Total cold run time: 102987 ms
Total hot run time: 31866 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5151	5178	5269	5178
q2	230	326	235	235
q3	2130	2650	2301	2301
q4	1498	1854	1394	1394
q5	4198	4133	4110	4110
q6	204	166	127	127
q7	1865	1893	1776	1776
q8	2581	2628	2626	2626
q9	7246	7117	7051	7051
q10	3056	3262	2832	2832
q11	574	513	480	480
q12	689	819	634	634
q13	3425	3951	3233	3233
q14	289	308	268	268
q15	486	479	462	462
q16	654	687	628	628
q17	1139	1590	1345	1345
q18	7644	7499	7323	7323
q19	848	827	919	827
q20	2030	1998	1905	1905
q21	5746	4884	4651	4651
q22	662	577	548	548
Total cold run time: 52345 ms
Total hot run time: 49934 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190581 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 58494fbdb87e0d6388a510d8bd400764e1acc036, data reload: false

query1	1294	953	951	951
query2	6203	1803	1824	1803
query3	11112	4598	4530	4530
query4	55014	25805	23020	23020
query5	4895	536	479	479
query6	333	199	184	184
query7	4862	500	297	297
query8	305	245	234	234
query9	5284	2498	2506	2498
query10	434	311	251	251
query11	15152	15104	15246	15104
query12	165	110	108	108
query13	1036	521	393	393
query14	10614	7008	6733	6733
query15	205	203	175	175
query16	7107	679	489	489
query17	1088	738	595	595
query18	1604	438	322	322
query19	203	206	171	171
query20	129	129	125	125
query21	214	130	104	104
query22	4403	4423	4470	4423
query23	34241	33678	33507	33507
query24	5695	2414	2445	2414
query25	445	459	391	391
query26	718	272	157	157
query27	1838	479	332	332
query28	2742	2429	2435	2429
query29	541	543	427	427
query30	207	192	156	156
query31	903	896	829	829
query32	70	63	59	59
query33	445	349	302	302
query34	800	872	489	489
query35	822	831	791	791
query36	981	1006	914	914
query37	128	97	78	78
query38	4394	4229	4422	4229
query39	1486	1434	1420	1420
query40	215	110	104	104
query41	51	47	47	47
query42	125	107	101	101
query43	518	512	480	480
query44	1321	827	813	813
query45	177	174	161	161
query46	890	1066	678	678
query47	1897	1913	1849	1849
query48	383	413	307	307
query49	713	519	418	418
query50	708	761	426	426
query51	4224	4321	4288	4288
query52	103	108	97	97
query53	231	260	184	184
query54	483	489	442	442
query55	80	80	82	80
query56	258	273	272	272
query57	1151	1197	1145	1145
query58	232	257	240	240
query59	2684	2852	2670	2670
query60	295	291	260	260
query61	129	149	124	124
query62	773	760	670	670
query63	224	185	184	184
query64	1931	1010	661	661
query65	3307	3112	3116	3112
query66	714	441	301	301
query67	15713	15667	15575	15575
query68	5576	774	502	502
query69	506	292	264	264
query70	1198	1126	1155	1126
query71	431	310	271	271
query72	6273	3742	3743	3742
query73	1418	747	355	355
query74	9211	9079	8704	8704
query75	3256	3147	2693	2693
query76	3710	1178	728	728
query77	536	371	265	265
query78	10089	10156	9296	9296
query79	2041	814	579	579
query80	844	520	503	503
query81	544	268	235	235
query82	440	154	121	121
query83	184	176	149	149
query84	299	97	72	72
query85	761	361	302	302
query86	343	289	254	254
query87	4480	4470	4341	4341
query88	2805	2225	2193	2193
query89	389	317	288	288
query90	1672	187	186	186
query91	136	134	108	108
query92	62	57	57	57
query93	1443	1035	576	576
query94	680	404	295	295
query95	332	259	256	256
query96	478	556	267	267
query97	2773	2860	2774	2774
query98	222	207	202	202
query99	1306	1430	1275	1275
Total cold run time: 292080 ms
Total hot run time: 190581 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.42 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 58494fbdb87e0d6388a510d8bd400764e1acc036, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.04
query3	0.24	0.07	0.06
query4	1.62	0.11	0.10
query5	0.44	0.40	0.38
query6	1.17	0.66	0.66
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.59	0.52	0.51
query10	0.59	0.57	0.57
query11	0.15	0.11	0.11
query12	0.15	0.12	0.11
query13	0.63	0.60	0.60
query14	2.67	2.70	2.81
query15	0.93	0.87	0.85
query16	0.38	0.39	0.37
query17	1.00	1.07	1.03
query18	0.21	0.19	0.19
query19	1.92	1.77	1.98
query20	0.01	0.01	0.01
query21	15.74	0.88	0.54
query22	0.74	1.20	0.76
query23	14.79	1.41	0.64
query24	6.67	1.75	1.63
query25	0.47	0.20	0.08
query26	0.62	0.16	0.13
query27	0.05	0.06	0.05
query28	10.10	0.83	0.42
query29	12.58	3.90	3.25
query30	0.25	0.09	0.06
query31	2.84	0.59	0.38
query32	3.23	0.56	0.47
query33	3.08	3.00	2.99
query34	15.87	5.18	4.56
query35	4.56	4.54	4.55
query36	0.67	0.50	0.48
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 105.61 s
Total hot run time: 31.42 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 12, 2025
@dataroaring dataroaring merged commit 21e0d60 into apache:master Feb 12, 2025
24 of 26 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 12, 2025
…ssive auto-resume attempts (#47528)

Incorrect auto-resume interval observed when autoResumeCount is too
large. Logs show that when autoResumeCount reaches high values, the
auto-resume interval becomes approximately 20 seconds instead of the
expected 5 minutes.

```
2025-02-05 14:58:46,830 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738701821,  autoResumeCount: 41555, pause reason: PARTITIONS_ERR
2025-02-05 14:59:11,837 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738726830,  autoResumeCount: 41556, pause reason: PARTITIONS_ERR
2025-02-05 14:59:36,844 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738751837,  autoResumeCount: 41557, pause reason: PARTITIONS_ERR
```
github-actions bot pushed a commit that referenced this pull request Feb 12, 2025
…ssive auto-resume attempts (#47528)

Incorrect auto-resume interval observed when autoResumeCount is too
large. Logs show that when autoResumeCount reaches high values, the
auto-resume interval becomes approximately 20 seconds instead of the
expected 5 minutes.

```
2025-02-05 14:58:46,830 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738701821,  autoResumeCount: 41555, pause reason: PARTITIONS_ERR
2025-02-05 14:59:11,837 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738726830,  autoResumeCount: 41556, pause reason: PARTITIONS_ERR
2025-02-05 14:59:36,844 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738751837,  autoResumeCount: 41557, pause reason: PARTITIONS_ERR
```
dataroaring pushed a commit that referenced this pull request Feb 14, 2025
…used by excessive auto-resume attempts #47528 (#47811)

Cherry-picked from #47528

Co-authored-by: hui lai <laihui@selectdb.com>
lzyy2024 pushed a commit to lzyy2024/doris that referenced this pull request Feb 21, 2025
…ssive auto-resume attempts (apache#47528)

Incorrect auto-resume interval observed when autoResumeCount is too
large. Logs show that when autoResumeCount reaches high values, the
auto-resume interval becomes approximately 20 seconds instead of the
expected 5 minutes.

```
2025-02-05 14:58:46,830 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738701821,  autoResumeCount: 41555, pause reason: PARTITIONS_ERR
2025-02-05 14:59:11,837 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738726830,  autoResumeCount: 41556, pause reason: PARTITIONS_ERR
2025-02-05 14:59:36,844 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738751837,  autoResumeCount: 41557, pause reason: PARTITIONS_ERR
```
dataroaring pushed a commit that referenced this pull request Feb 24, 2025
…used by excessive auto-resume attempts #47528 (#47810)

Cherry-picked from #47528

Co-authored-by: hui lai <laihui@selectdb.com>
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…ssive auto-resume attempts (apache#47528)

Incorrect auto-resume interval observed when autoResumeCount is too
large. Logs show that when autoResumeCount reaches high values, the
auto-resume interval becomes approximately 20 seconds instead of the
expected 5 minutes.

```
2025-02-05 14:58:46,830 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738701821,  autoResumeCount: 41555, pause reason: PARTITIONS_ERR
2025-02-05 14:59:11,837 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738726830,  autoResumeCount: 41556, pause reason: PARTITIONS_ERR
2025-02-05 14:59:36,844 INFO (Routine load scheduler|187) [ScheduleRule.isNeedAutoSchedule():83] try to auto reschedule routine load 10103, latestResumeTimestamp: 1738738751837,  autoResumeCount: 41557, pause reason: PARTITIONS_ERR
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.9-merged dev/3.0.5-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants