-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #31837 +/- ##
============================================
- Coverage 58.62% 58.45% -0.18%
- Complexity 3020 3023 +3
============================================
Files 1120 1119 -1
Lines 172215 171442 -773
Branches 3257 3262 +5
============================================
- Hits 100968 100220 -748
+ Misses 67956 67927 -29
- Partials 3291 3295 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
[Before fix] Job writing 20B records: 2024-07-10_14_16_17-4717587713994683558 Eventually hits long-term quota and experiences consistent failure until the pipeline gives up: [After fix] identical job writing 20B records: 2024-07-10_15_51_42-2248769606448533416 Enough breathing space is created so that the pipeline can survive until the quota cools off: Throughout a pipeline's lifecycle, the long-term quota can be repeatedly exhausted but the pipeline remains resilient: |
R: @Abacn |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
Can we do a cherry-pick for this? @jrmccluskey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this LGTM
...io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/RetryManager.java
Outdated
Show resolved
Hide resolved
Turns out managing AppendRows quota actually isn't the last blocker. I tried writing with a much bigger load (2024-07-11_13_09_24-157554027781390667 and the sink handled all the append operations well but it got stuck at the finalize and commit step:
Pending stream bytes is a quota placed on PENDING stream types (which is what we use for batch). It's a maximum of 1TB for small regions. In our finalize and commit step, we finalize each stream one by one then perform a single commit on all of them: Line 166 in 50a3403
For large writes, the aggregate byte size of all streams can easily get over 1TB. Instead, we should probably break this up into multiple commit operations. |
Just wondering if is commit once intended. Conversely, if the data is committed and later on the job failed, will it leave partially written data to the sink? |
Can we create a new issue to track #31837 (comment)? For this PR, can we update CHANGES.md? |
…er_backoff_storapi
This is a good question, it could be indeed be intentional. I raised a tracker here #31872 |
…ota refill (apache#31837) * Increase retry backoff for Storage API batch * longer waits for quota error only * cleanup * add to CHANGES.md * no need for quota backoff. just increase allowed retries * cleanup
After fixing concurrent connections issue (#31710), the only blocker to making Storage API batch scalable is managing AppendRows throughput quota. The Storage API backend sets up this quota by having a short-term (cell) quota and a long-term (region) quota:
It's important to note that all append operations are rejected while a quota is being refilled.
The standard throughput quota is not sufficient for large writes. Large pipeline will typically exhaust the long-term quota quickly, leading to consistent failures for 10 min. With enough failures (10 fails per bundle, 4 failed bundles per Dataflow pipeline), the pipeline eventually gives up and fails.
To deal with this, we should increase the number of retries so that pipelines can survive long enough until the throughput quota is refilled.
Disclaimer:
Before this change, in the worst case where all append operations fail due to quota error, each bundle will retry for 66 seconds. After 4 bundle failures, a Dataflow pipeline can wait up to 4.4min to fail.
With this change, the worst-case wait time goes up to 10min per bundle, and 40min per pipeline.