Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

Merged
merged 7 commits into from
Jul 15, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Jul 11, 2024

After fixing concurrent connections issue (#31710), the only blocker to making Storage API batch scalable is managing AppendRows throughput quota. The Storage API backend sets up this quota by having a short-term (cell) quota and a long-term (region) quota:

  • short-term quota can take up to 10s to refill
  • long-term quota is an aggregate of multiple cells and can take up to 10min to refill

It's important to note that all append operations are rejected while a quota is being refilled.

The standard throughput quota is not sufficient for large writes. Large pipeline will typically exhaust the long-term quota quickly, leading to consistent failures for 10 min. With enough failures (10 fails per bundle, 4 failed bundles per Dataflow pipeline), the pipeline eventually gives up and fails.

To deal with this, we should increase the number of retries so that pipelines can survive long enough until the throughput quota is refilled.

Disclaimer:

Before this change, in the worst case where all append operations fail due to quota error, each bundle will retry for 66 seconds. After 4 bundle failures, a Dataflow pipeline can wait up to 4.4min to fail.

With this change, the worst-case wait time goes up to 10min per bundle, and 40min per pipeline.

@ahmedabu98 ahmedabu98 marked this pull request as draft July 11, 2024 00:28
Copy link

codecov bot commented Jul 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.45%. Comparing base (8153867) to head (49f0ba3).
Report is 47 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #31837      +/-   ##
============================================
- Coverage     58.62%   58.45%   -0.18%     
- Complexity     3020     3023       +3     
============================================
  Files          1120     1119       -1     
  Lines        172215   171442     -773     
  Branches       3257     3262       +5     
============================================
- Hits         100968   100220     -748     
+ Misses        67956    67927      -29     
- Partials       3291     3295       +4     
Flag Coverage Δ
java 69.64% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot removed the python label Jul 11, 2024
@ahmedabu98
Copy link
Contributor Author

[Before fix] Job writing 20B records: 2024-07-10_14_16_17-4717587713994683558

Eventually hits long-term quota and experiences consistent failure until the pipeline gives up:

image


[After fix] identical job writing 20B records: 2024-07-10_15_51_42-2248769606448533416

Enough breathing space is created so that the pipeline can survive until the quota cools off:

image

Throughout a pipeline's lifecycle, the long-term quota can be repeatedly exhausted but the pipeline remains resilient:

image

@ahmedabu98 ahmedabu98 changed the title Increase retry backoff for Storage API batch Increase retry backoff for Storage API batch to survive AppendRows quota refill Jul 11, 2024
@ahmedabu98 ahmedabu98 marked this pull request as ready for review July 11, 2024 01:49
@ahmedabu98
Copy link
Contributor Author

R: @Abacn
R: @reuvenlax

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@liferoad liferoad added this to the 2.58.0 Release milestone Jul 11, 2024
@liferoad
Copy link
Collaborator

Can we do a cherry-pick for this? @jrmccluskey

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM

@ahmedabu98
Copy link
Contributor Author

ahmedabu98 commented Jul 11, 2024

Turns out managing AppendRows quota actually isn't the last blocker. I tried writing with a much bigger load (2024-07-11_13_09_24-157554027781390667 and the sink handled all the append operations well but it got stuck at the finalize and commit step:

RESOURCE_EXHAUSTED: Exceeds quota limit subject: bigquerystorage.googleapis.com/write/pending_stream_bytes
image

Pending stream bytes is a quota placed on PENDING stream types (which is what we use for batch). It's a maximum of 1TB for small regions.

In our finalize and commit step, we finalize each stream one by one then perform a single commit on all of them:

return writeStreamService.commitWriteStreams(tableId, streamsToCommit);

For large writes, the aggregate byte size of all streams can easily get over 1TB. Instead, we should probably break this up into multiple commit operations.

@Abacn
Copy link
Contributor

Abacn commented Jul 11, 2024

Just wondering if is commit once intended. Conversely, if the data is committed and later on the job failed, will it leave partially written data to the sink?

@liferoad
Copy link
Collaborator

Can we create a new issue to track #31837 (comment)? For this PR, can we update CHANGES.md?

@ahmedabu98
Copy link
Contributor Author

Just wondering if is commit once intended. Conversely, if the data is committed and later on the job failed, will it leave partially written data to the sink?

This is a good question, it could be indeed be intentional. I raised a tracker here #31872

@ahmedabu98 ahmedabu98 merged commit 0b61035 into apache:master Jul 15, 2024
19 checks passed
This was referenced Jul 16, 2024
jrmccluskey pushed a commit that referenced this pull request Jul 16, 2024
* Increase retry backoff for Storage API batch

* longer waits for quota error only

* cleanup

* add to CHANGES.md

* no need for quota backoff. just increase allowed retries

* cleanup
acrites pushed a commit to acrites/beam that referenced this pull request Jul 17, 2024
…ota refill (apache#31837)

* Increase retry backoff for Storage API batch

* longer waits for quota error only

* cleanup

* add to CHANGES.md

* no need for quota backoff. just increase allowed retries

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants