Skip to content

Conversation

jeffswenson
Copy link
Collaborator

Backport 1/1 commits from #151817.

/cc @cockroachdb/release

Release justification: Adds a setting cloudstorage.s3.client_retry_token_bucket.enabled which can be set to false to mitigate an issue impacting backups on large AWS clusters.


This change adds two settings to control the S3 client retry behavior:

  1. cloudstorage.s3.max_retries: this replaces a constant that was hardcoded to 10. It controls how many times the s3 client will retry an error.
  2. cloudstorage.s3.client_retry_token_bucket.enabled: this disables the client retry token bucket. The token bucket has no time based refill, and can only refill when new operations are started, so it can get stuck in an empty state with fixed concurrency operations like backup and restore.

There is also a minor improvement to verbose logging in the S3 client:

  • --vmodule=s3_storage=1 will enable retry logging and deprecation
    warnings.
  • --vmodule=s3_storage=2 logs requests and responses.
  • --vmodule=s3_storage=3 logs headers+bodies of messages.

Informs: #151748
Release note: Tunes S3 client retry behavior to be more reliable in the presence of correlated errors.

This change adds two settings to control the S3 client retry behavior:
1. `cloudstorage.s3.max_retries`: this replaces a constant that was
	 hardcoded to 10. It controls how many times the s3 client will retry
	 an error.
2. `cloudstorage.s3.client_retry_token_bucket.enabled`: this disables the
	 client retry token bucket. The token bucket has no time based refill,
	 and can only refill when new operations are started, so it can get
	 stuck in an empty state with fixed concurrency operations like backup
	 and restore.

There is also a minor improvement to verbose logging in the S3 client:
* --vmodule=s3_storage=1 will enable retry logging and deprecation
	warnings.
* --vmodule=s3_storage=2 logs requests and responses.
* --vmodule=s3_storage=3 logs headers+bodies of messages.

Informs: cockroachdb#151748
Release note: Tunes S3 client retry behavior to be more reliable in the
presence of correlated errors.
@jeffswenson jeffswenson requested a review from a team as a code owner August 14, 2025 17:22
@jeffswenson jeffswenson requested review from kev-cao and removed request for a team August 14, 2025 17:22
Copy link

blathers-crl bot commented Aug 14, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-disaster-recovery labels Aug 14, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Aug 14, 2025

✅ PR #151875 is compliant with backport policy

Confidence: high
Feature flag detected: Yes
Backward compatible: true
Explanation: This pull request is compliant with the backport policy due to the introduction of new cluster settings that serve as feature flags and are disabled by default. Specifically, the cloudstorage.s3.client_retry_token_bucket.enabled setting is introduced to control the new S3 client retry behavior, which can be a critical adjustment for stable operation in specific scenarios like backups on large AWS clusters. The default state of this setting is explicitly set to false in the backport, ensuring it meets the feature flag requirement that new features or changes be disabled by default. Additionally, there are other improvements like logging enhancements and a replacement of a hardcoded retry count which now relies on a configurable setting, also defaulting the retry count to the historic norm of 10 retries. These types of changes, while altering the behavior, are configured in a way that adheres to the policy for backporting.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@jeffswenson jeffswenson requested a review from msbutler August 14, 2025 17:51
@jeffswenson jeffswenson merged commit 32530f7 into cockroachdb:release-24.3 Aug 14, 2025
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches T-disaster-recovery v24.3.20
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants