Skip to content

[fix](s3) Avoid retrying object storage SlowDown errors#63776

Open
wyxxxcat wants to merge 1 commit into
apache:masterfrom
wyxxxcat:client_retry_strategy
Open

[fix](s3) Avoid retrying object storage SlowDown errors#63776
wyxxxcat wants to merge 1 commit into
apache:masterfrom
wyxxxcat:client_retry_strategy

Conversation

@wyxxxcat
Copy link
Copy Markdown
Collaborator

@wyxxxcat wyxxxcat commented May 28, 2026

What problem does this PR solve?

Object storage throttling errors can be retried by the SDK retry policy. When requests are already rate limited, these retries add extra sleep time and delay the caller from entering the next processing flow.
This change disables retry for throttling responses in object storage clients:

  • S3 SlowDown errors are not retried.
  • Azure 429 TooManyRequests is not added to retryable status codes.
    Other retryable errors keep the existing retry behavior.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat wyxxxcat force-pushed the client_retry_strategy branch from bb5a39c to ea244dd Compare May 28, 2026 03:02
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1854/2375)
Line Coverage 64.51% (33333/51668)
Region Coverage 65.17% (16520/25350)
Branch Coverage 55.75% (8842/15860)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.94% (20959/38855)
Line Coverage 37.53% (198805/529766)
Region Coverage 33.76% (155571/460779)
Branch Coverage 34.78% (67784/194881)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31606 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ea244ddd3702f5ae3eb2dc1d2f2cbe90bf46e643, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17626	3999	3968	3968
q2	q3	10758	1374	827	827
q4	4682	480	347	347
q5	7618	2261	2113	2113
q6	251	174	139	139
q7	920	778	651	651
q8	9336	1720	1674	1674
q9	6979	4928	4968	4928
q10	6436	2256	1890	1890
q11	428	270	245	245
q12	697	427	291	291
q13	18210	3372	2774	2774
q14	262	257	241	241
q15	q16	820	768	706	706
q17	919	942	1022	942
q18	6864	5669	5487	5487
q19	1189	1332	1183	1183
q20	525	425	286	286
q21	6080	2721	2609	2609
q22	457	370	305	305
Total cold run time: 101057 ms
Total hot run time: 31606 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4762	4782	5117	4782
q2	q3	4949	5280	4647	4647
q4	2122	2226	1389	1389
q5	4815	4818	4686	4686
q6	230	177	129	129
q7	1859	1686	1629	1629
q8	2461	2064	1931	1931
q9	7488	7418	7453	7418
q10	4797	4696	4230	4230
q11	544	388	363	363
q12	725	735	530	530
q13	3091	3421	2797	2797
q14	277	284	264	264
q15	q16	679	713	620	620
q17	1289	1252	1253	1252
q18	7337	6795	6685	6685
q19	1127	1114	1066	1066
q20	2224	2234	1949	1949
q21	5321	4571	4476	4476
q22	538	452	414	414
Total cold run time: 56635 ms
Total hot run time: 51257 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171937 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ea244ddd3702f5ae3eb2dc1d2f2cbe90bf46e643, data reload: false

query5	4319	646	509	509
query6	331	225	194	194
query7	4247	541	322	322
query8	323	227	252	227
query9	8822	4056	4017	4017
query10	458	352	290	290
query11	5790	2483	2257	2257
query12	179	130	126	126
query13	1301	630	450	450
query14	6087	5489	5197	5197
query14_1	4451	4511	4464	4464
query15	216	211	187	187
query16	1012	462	453	453
query17	1163	757	624	624
query18	2454	507	371	371
query19	223	212	174	174
query20	134	138	130	130
query21	220	139	118	118
query22	13808	13649	13402	13402
query23	17358	16560	16180	16180
query23_1	16393	16331	16553	16331
query24	7442	1780	1307	1307
query24_1	1363	1313	1345	1313
query25	576	508	455	455
query26	1312	335	175	175
query27	2684	555	355	355
query28	4459	1994	2027	1994
query29	1004	645	523	523
query30	315	248	199	199
query31	1128	1093	964	964
query32	94	76	73	73
query33	557	367	299	299
query34	1167	1122	658	658
query35	779	859	704	704
query36	1440	1401	1272	1272
query37	167	104	97	97
query38	3218	3163	3037	3037
query39	922	909	894	894
query39_1	873	887	864	864
query40	234	145	118	118
query41	67	63	63	63
query42	110	114	105	105
query43	325	329	291	291
query44	
query45	230	201	195	195
query46	1122	1210	730	730
query47	2453	2351	2239	2239
query48	414	444	302	302
query49	662	495	384	384
query50	996	354	257	257
query51	4334	4400	4219	4219
query52	105	102	93	93
query53	260	284	202	202
query54	326	272	258	258
query55	97	91	88	88
query56	317	314	298	298
query57	1436	1408	1324	1324
query58	301	270	267	267
query59	1537	1650	1403	1403
query60	335	317	288	288
query61	160	155	162	155
query62	689	647	575	575
query63	245	202	214	202
query64	2428	795	637	637
query65	
query66	1710	489	363	363
query67	29827	29694	29569	29569
query68	
query69	471	344	315	315
query70	992	1038	1001	1001
query71	308	272	262	262
query72	3050	2669	2451	2451
query73	838	752	433	433
query74	5122	4944	4796	4796
query75	2704	2618	2271	2271
query76	2300	1143	791	791
query77	397	412	340	340
query78	12458	12475	11790	11790
query79	1489	1039	773	773
query80	692	549	455	455
query81	458	286	238	238
query82	1377	159	124	124
query83	317	277	248	248
query84	271	144	115	115
query85	905	538	471	471
query86	407	367	329	329
query87	3444	3368	3259	3259
query88	3600	2750	2765	2750
query89	463	394	351	351
query90	1935	179	177	177
query91	183	173	146	146
query92	85	75	78	75
query93	1467	1557	854	854
query94	555	370	334	334
query95	681	371	437	371
query96	1030	806	348	348
query97	2734	2729	2626	2626
query98	231	228	236	228
query99	1178	1160	1024	1024
Total cold run time: 254591 ms
Total hot run time: 171937 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.09% (27432/38054)
Line Coverage 55.48% (293158/528410)
Region Coverage 52.50% (244212/465194)
Branch Coverage 53.75% (105137/195607)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking findings in the actual PR file set reported by GitHub (S3/object retry strategy changes only).

Critical checkpoint conclusions:

  • Goal/test: The change avoids retrying AWS S3 SlowDown responses by passing retry_slow_down=false in BE and recycler S3 clients; no automated test is included, so behavior relies on the narrow condition in S3CustomRetryStrategy.
  • Scope: The implementation is small and focused: common retry strategy plus the two S3 client construction sites; Azure explicit 429 insertion removal appears consistent with relying on SDK retry defaults.
  • Concurrency/lifecycle: No new shared mutable concurrency or lifecycle ownership changes beyond immutable retry-strategy configuration at client construction.
  • Config/compatibility: No new config or storage/protocol compatibility impact found.
  • Parallel paths: BE S3 client and cloud recycler S3 accessor were both updated; no other S3CustomRetryStrategy call sites were found.
  • Error handling/observability: SlowDown now returns false before retry metric/logging, which matches the intended no-retry behavior; other retryable errors still preserve existing metric/log behavior.
  • Data correctness/transactions/persistence: No data visibility, transaction, delete-bitmap, or persistence paths are changed.
  • Performance: The change reduces retry/backoff work for the targeted SlowDown response and does not add hot-path overhead.
  • Tests: I did not run tests in this review. The main residual risk is lack of a unit test for ShouldRetry covering SlowDown vs ordinary 503/retryable errors.

User focus: no additional user-provided review focus was specified.

@wyxxxcat wyxxxcat force-pushed the client_retry_strategy branch from ea244dd to e439889 Compare May 28, 2026 07:28
@wyxxxcat wyxxxcat requested a review from luwei16 as a code owner May 28, 2026 07:28
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31845 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e439889ead481a5c6706214d00a60abc77a5c9e0, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17874	4261	4007	4007
q2	q3	10756	1382	833	833
q4	4691	482	346	346
q5	7593	2265	2129	2129
q6	273	184	144	144
q7	928	823	644	644
q8	9372	1715	1636	1636
q9	6937	5014	5008	5008
q10	6446	2254	1867	1867
q11	438	270	249	249
q12	701	432	297	297
q13	18173	3390	2776	2776
q14	271	258	241	241
q15	q16	821	797	716	716
q17	1016	1021	971	971
q18	6938	5713	5578	5578
q19	1238	1444	1154	1154
q20	517	398	271	271
q21	5955	2726	2669	2669
q22	440	384	309	309
Total cold run time: 101378 ms
Total hot run time: 31845 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5061	5006	5200	5006
q2	q3	4886	5237	4770	4770
q4	2193	2259	1443	1443
q5	5042	4719	4759	4719
q6	243	182	134	134
q7	1941	1787	1655	1655
q8	2615	2119	2042	2042
q9	7500	7505	7434	7434
q10	4756	4730	4236	4236
q11	560	415	391	391
q12	771	759	581	581
q13	3070	3352	2789	2789
q14	273	293	255	255
q15	q16	685	701	617	617
q17	1310	1277	1292	1277
q18	7581	6978	6836	6836
q19	1111	1076	1116	1076
q20	2245	2228	1930	1930
q21	5393	4671	4594	4594
q22	524	471	402	402
Total cold run time: 57760 ms
Total hot run time: 52187 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171958 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e439889ead481a5c6706214d00a60abc77a5c9e0, data reload: false

query5	4331	661	532	532
query6	325	219	203	203
query7	4220	576	308	308
query8	330	234	226	226
query9	8785	4112	4100	4100
query10	467	346	305	305
query11	5812	2430	2200	2200
query12	189	130	132	130
query13	1274	623	460	460
query14	6174	5521	5170	5170
query14_1	4489	4508	4477	4477
query15	221	205	190	190
query16	990	481	430	430
query17	1149	762	639	639
query18	2484	510	368	368
query19	231	215	167	167
query20	141	132	131	131
query21	219	142	157	142
query22	13685	13626	13393	13393
query23	17424	16573	16128	16128
query23_1	16382	16458	16427	16427
query24	7497	1776	1313	1313
query24_1	1332	1329	1315	1315
query25	533	476	422	422
query26	1325	321	173	173
query27	2731	583	348	348
query28	4465	2041	2010	2010
query29	955	611	495	495
query30	303	237	205	205
query31	1136	1078	955	955
query32	91	77	74	74
query33	553	369	290	290
query34	1204	1168	660	660
query35	784	808	707	707
query36	1403	1426	1248	1248
query37	149	103	91	91
query38	3214	3180	3077	3077
query39	951	922	902	902
query39_1	901	917	884	884
query40	227	153	125	125
query41	65	66	62	62
query42	109	106	109	106
query43	328	335	298	298
query44	
query45	219	206	194	194
query46	1061	1245	759	759
query47	2342	2370	2245	2245
query48	395	434	297	297
query49	630	496	391	391
query50	976	348	254	254
query51	4415	4299	4298	4298
query52	107	107	95	95
query53	253	284	215	215
query54	328	300	254	254
query55	97	92	87	87
query56	298	301	304	301
query57	1437	1412	1332	1332
query58	314	290	280	280
query59	1571	1660	1422	1422
query60	335	335	319	319
query61	182	180	180	180
query62	701	644	589	589
query63	250	207	215	207
query64	2477	851	713	713
query65	
query66	1748	483	381	381
query67	29874	29702	29525	29525
query68	
query69	475	358	319	319
query70	1060	967	1053	967
query71	312	284	330	284
query72	3035	2634	2404	2404
query73	890	776	449	449
query74	5136	4930	4773	4773
query75	2730	2614	2281	2281
query76	2257	1135	779	779
query77	394	413	343	343
query78	12417	12534	11891	11891
query79	1466	1084	788	788
query80	638	535	456	456
query81	453	281	249	249
query82	1346	159	122	122
query83	352	285	249	249
query84	269	138	108	108
query85	897	584	462	462
query86	396	340	319	319
query87	3444	3390	3258	3258
query88	3636	2752	2722	2722
query89	460	393	341	341
query90	1980	182	188	182
query91	179	170	138	138
query92	79	79	75	75
query93	1454	1470	966	966
query94	551	345	305	305
query95	688	391	357	357
query96	1118	827	354	354
query97	2720	2713	2628	2628
query98	232	227	222	222
query99	1155	1174	1041	1041
Total cold run time: 254645 ms
Total hot run time: 171958 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.79% (28081/38054)
Line Coverage 57.76% (305226/528421)
Region Coverage 54.89% (255358/465245)
Branch Coverage 56.38% (110295/195627)

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.79% (28081/38054)
Line Coverage 57.76% (305226/528421)
Region Coverage 54.89% (255358/465245)
Branch Coverage 56.38% (110295/195627)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants