Skip to content

Conversation

jeffswenson
Copy link
Collaborator

Backport 1/1 commits from #151817.

/cc @cockroachdb/release

Release justification: Adds a setting cloudstorage.s3.client_retry_token_bucket.enabled which can be set to false to mitigate an issue impacting backups on large AWS clusters.


This change adds two settings to control the S3 client retry behavior:

  1. cloudstorage.s3.max_retries: this replaces a constant that was hardcoded to 10. It controls how many times the s3 client will retry an error.
  2. cloudstorage.s3.client_retry_token_bucket.enabled: this disables the client retry token bucket. The token bucket has no time based refill, and can only refill when new operations are started, so it can get stuck in an empty state with fixed concurrency operations like backup and restore.

There is also a minor improvement to verbose logging in the S3 client:

  • --vmodule=s3_storage=1 will enable retry logging and deprecation
    warnings.
  • --vmodule=s3_storage=2 logs requests and responses.
  • --vmodule=s3_storage=3 logs headers+bodies of messages.

Informs: #151748
Release note: Tunes S3 client retry behavior to be more reliable in the presence of correlated errors.

This change adds two settings to control the S3 client retry behavior:
1. `cloudstorage.s3.max_retries`: this replaces a constant that was
	 hardcoded to 10. It controls how many times the s3 client will retry
	 an error.
2. `cloudstorage.s3.client_retry_token_bucket.enabled`: this disables the
	 client retry token bucket. The token bucket has no time based refill,
	 and can only refill when new operations are started, so it can get
	 stuck in an empty state with fixed concurrency operations like backup
	 and restore.

There is also a minor improvement to verbose logging in the S3 client:
* --vmodule=s3_storage=1 will enable retry logging and deprecation
	warnings.
* --vmodule=s3_storage=2 logs requests and responses.
* --vmodule=s3_storage=3 logs headers+bodies of messages.

Informs: cockroachdb#151748
Release note: Tunes S3 client retry behavior to be more reliable in the
presence of correlated errors.
@jeffswenson jeffswenson requested a review from a team as a code owner August 14, 2025 17:17
@jeffswenson jeffswenson requested review from kev-cao and removed request for a team August 14, 2025 17:17
Copy link

blathers-crl bot commented Aug 14, 2025

Thanks for opening a backport.

Before merging, please confirm that it falls into one of the following categories (select one):

  • Non-production code changes. Includes test-only changes, build system changes, etc.
  • Fixes for serious issues. Defined in the policy as correctness, stability, or security issues, data corruption/loss, significant performance regressions, breaking working and widely used functionality, or an inability to detect and debug production issues.
  • Other approved changes. These changes must be gated behind a disabled-by-default feature flag unless there is a strong justification not to.

Add a brief release justification to the PR description explaining your selection.

Also, confirm that the change does not break backward compatibility and complies with all aspects of the backport policy.

All backports must be reviewed by the TL and EM for the owning area.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-disaster-recovery labels Aug 14, 2025
Copy link

blathers-crl bot commented Aug 14, 2025

❌ PR #151874 does not comply with backport policy

Confidence: medium
Explanation: The PR introduces new settings for controlling S3 client retry behavior, which could potentially address performance stability. However, these changes do not explicitly fix a 'critical bug' as defined by CockroachDB's backport policy. Since the changes introduce new behavior and settings, they should ideally be gated behind a disabled feature flag. The criteria for critical bugs include stability/security issues, data corruption/loss, significant performance regressions, or incorrect results/suboptimal performance, none of which are directly evidenced in the PR details. The justification provided in the PR, related to mitigating an issue in large AWS clusters, needs more explicit linking to the critical bugs criteria. While the PR does introduce a feature flag (cloudstorage.s3.client_retry_token_bucket.enabled), it's unclear if this setting is disabled by default which is a requirement for compliance.
Recommendation: Request confirmation that the new settings (cloudstorage.s3.max_retries and cloudstorage.s3.client_retry_token_bucket.enabled) are disabled by default to ensure compliance with the feature flag requirements or reconsider the backport until it can be confirmed.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jeffswenson jeffswenson requested a review from msbutler August 14, 2025 17:18
@jeffswenson jeffswenson merged commit 045390c into cockroachdb:release-25.2 Aug 14, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches T-disaster-recovery v25.2.6
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants