[fix](s3) Avoid retrying object storage SlowDown errors#63776
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
bb5a39c to
ea244dd
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31606 ms |
TPC-DS: Total hot run time: 171937 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Review result: no blocking findings in the actual PR file set reported by GitHub (S3/object retry strategy changes only).
Critical checkpoint conclusions:
- Goal/test: The change avoids retrying AWS S3 SlowDown responses by passing retry_slow_down=false in BE and recycler S3 clients; no automated test is included, so behavior relies on the narrow condition in S3CustomRetryStrategy.
- Scope: The implementation is small and focused: common retry strategy plus the two S3 client construction sites; Azure explicit 429 insertion removal appears consistent with relying on SDK retry defaults.
- Concurrency/lifecycle: No new shared mutable concurrency or lifecycle ownership changes beyond immutable retry-strategy configuration at client construction.
- Config/compatibility: No new config or storage/protocol compatibility impact found.
- Parallel paths: BE S3 client and cloud recycler S3 accessor were both updated; no other S3CustomRetryStrategy call sites were found.
- Error handling/observability: SlowDown now returns false before retry metric/logging, which matches the intended no-retry behavior; other retryable errors still preserve existing metric/log behavior.
- Data correctness/transactions/persistence: No data visibility, transaction, delete-bitmap, or persistence paths are changed.
- Performance: The change reduces retry/backoff work for the targeted SlowDown response and does not add hot-path overhead.
- Tests: I did not run tests in this review. The main residual risk is lack of a unit test for ShouldRetry covering SlowDown vs ordinary 503/retryable errors.
User focus: no additional user-provided review focus was specified.
ea244dd to
e439889
Compare
|
run buildall |
TPC-H: Total hot run time: 31845 ms |
TPC-DS: Total hot run time: 171958 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Object storage throttling errors can be retried by the SDK retry policy. When requests are already rate limited, these retries add extra sleep time and delay the caller from entering the next processing flow.
This change disables retry for throttling responses in object storage clients:
SlowDownerrors are not retried.TooManyRequestsis not added to retryable status codes.Other retryable errors keep the existing retry behavior.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)