Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

ahmedabu98 · 2024-07-11T00:26:24Z

After fixing concurrent connections issue (#31710), the only blocker to making Storage API batch scalable is managing AppendRows throughput quota. The Storage API backend sets up this quota by having a short-term (cell) quota and a long-term (region) quota:

short-term quota can take up to 10s to refill
long-term quota is an aggregate of multiple cells and can take up to 10min to refill

It's important to note that all append operations are rejected while a quota is being refilled.

The standard throughput quota is not sufficient for large writes. Large pipeline will typically exhaust the long-term quota quickly, leading to consistent failures for 10 min. With enough failures (10 fails per bundle, 4 failed bundles per Dataflow pipeline), the pipeline eventually gives up and fails.

To deal with this, we should increase the number of retries so that pipelines can survive long enough until the throughput quota is refilled.

Disclaimer:

Before this change, in the worst case where all append operations fail due to quota error, each bundle will retry for 66 seconds. After 4 bundle failures, a Dataflow pipeline can wait up to 4.4min to fail.

With this change, the worst-case wait time goes up to 10min per bundle, and 40min per pipeline.

codecov · 2024-07-11T01:31:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.45%. Comparing base (8153867) to head (49f0ba3).
Report is 47 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #31837      +/-   ##
============================================
- Coverage     58.62%   58.45%   -0.18%     
- Complexity     3020     3023       +3     
============================================
  Files          1120     1119       -1     
  Lines        172215   171442     -773     
  Branches       3257     3262       +5     
============================================
- Hits         100968   100220     -748     
+ Misses        67956    67927      -29     
- Partials       3291     3295       +4

Flag	Coverage Δ
java	`69.64% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ahmedabu98 · 2024-07-11T01:40:57Z

[Before fix] Job writing 20B records: 2024-07-10_14_16_17-4717587713994683558

Eventually hits long-term quota and experiences consistent failure until the pipeline gives up:

[After fix] identical job writing 20B records: 2024-07-10_15_51_42-2248769606448533416

Enough breathing space is created so that the pipeline can survive until the quota cools off:

Throughout a pipeline's lifecycle, the long-term quota can be repeatedly exhausted but the pipeline remains resilient:

ahmedabu98 · 2024-07-11T01:49:55Z

R: @Abacn
R: @reuvenlax

github-actions · 2024-07-11T01:50:53Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

liferoad · 2024-07-11T02:09:39Z

Can we do a cherry-pick for this? @jrmccluskey

Abacn

Thanks, this LGTM

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RetryManager.java

ahmedabu98 · 2024-07-11T20:56:47Z

Turns out managing AppendRows quota actually isn't the last blocker. I tried writing with a much bigger load (2024-07-11_13_09_24-157554027781390667 and the sink handled all the append operations well but it got stuck at the finalize and commit step:

RESOURCE_EXHAUSTED: Exceeds quota limit subject: bigquerystorage.googleapis.com/write/pending_stream_bytes

Pending stream bytes is a quota placed on PENDING stream types (which is what we use for batch). It's a maximum of 1TB for small regions.

In our finalize and commit step, we finalize each stream one by one then perform a single commit on all of them:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiFinalizeWritesDoFn.java

Line 166 in 50a3403

return writeStreamService.commitWriteStreams(tableId, streamsToCommit);

For large writes, the aggregate byte size of all streams can easily get over 1TB. Instead, we should probably break this up into multiple commit operations.

Abacn · 2024-07-11T23:59:50Z

Just wondering if is commit once intended. Conversely, if the data is committed and later on the job failed, will it leave partially written data to the sink?

liferoad · 2024-07-12T13:58:06Z

Can we create a new issue to track #31837 (comment)? For this PR, can we update CHANGES.md?

…er_backoff_storapi

ahmedabu98 · 2024-07-12T20:31:36Z

Just wondering if is commit once intended. Conversely, if the data is committed and later on the job failed, will it leave partially written data to the sink?

This is a good question, it could be indeed be intentional. I raised a tracker here #31872

* Increase retry backoff for Storage API batch * longer waits for quota error only * cleanup * add to CHANGES.md * no need for quota backoff. just increase allowed retries * cleanup

…ota refill (apache#31837) * Increase retry backoff for Storage API batch * longer waits for quota error only * cleanup * add to CHANGES.md * no need for quota backoff. just increase allowed retries * cleanup

Increase retry backoff for Storage API batch

49f0ba3

github-actions bot added java io gcp labels Jul 11, 2024

ahmedabu98 marked this pull request as draft July 11, 2024 00:28

longer waits for quota error only

cbbe3f0

github-actions bot added the python label Jul 11, 2024

cleanup

8019b8d

github-actions bot removed the python label Jul 11, 2024

ahmedabu98 changed the title ~~Increase retry backoff for Storage API batch~~ Increase retry backoff for Storage API batch to survive AppendRows quota refill Jul 11, 2024

ahmedabu98 marked this pull request as ready for review July 11, 2024 01:49

liferoad added this to the 2.58.0 Release milestone Jul 11, 2024

Abacn reviewed Jul 11, 2024

View reviewed changes

...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RetryManager.java Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/ahmedabu98/beam into larg…

9207f94

…er_backoff_storapi

ahmedabu98 mentioned this pull request Jul 12, 2024

[Task]: Break up single pending stream commit into multiple commits #31872

Open

16 tasks

ahmedabu98 added 3 commits July 12, 2024 16:41

add to CHANGES.md

347ade2

no need for quota backoff. just increase allowed retries

72c8968

cleanup

a99b776

ahmedabu98 merged commit 0b61035 into apache:master Jul 15, 2024
19 checks passed

This was referenced Jul 16, 2024

Cp 31837 #31903

Closed

Cherrypicking #31837 #31904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

ahmedabu98 commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024

ahmedabu98 commented Jul 11, 2024

ahmedabu98 commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

liferoad commented Jul 11, 2024

Abacn left a comment

ahmedabu98 commented Jul 11, 2024 •

edited

Loading

Abacn commented Jul 11, 2024

liferoad commented Jul 12, 2024

ahmedabu98 commented Jul 12, 2024

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

Conversation

ahmedabu98 commented Jul 11, 2024 • edited Loading

To deal with this, we should increase the number of retries so that pipelines can survive long enough until the throughput quota is refilled.

Disclaimer:

codecov bot commented Jul 11, 2024

Codecov Report

ahmedabu98 commented Jul 11, 2024

ahmedabu98 commented Jul 11, 2024

github-actions bot commented Jul 11, 2024

liferoad commented Jul 11, 2024

Abacn left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Jul 11, 2024 • edited Loading

Abacn commented Jul 11, 2024

liferoad commented Jul 12, 2024

ahmedabu98 commented Jul 12, 2024

ahmedabu98 commented Jul 11, 2024 •

edited

Loading

ahmedabu98 commented Jul 11, 2024 •

edited

Loading