-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-9439] Return split instead of total backlog size in KinesisIO #11377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9439] Return split instead of total backlog size in KinesisIO #11377
Conversation
The split backlog size can be over-estimated as it reports the size of the records across all shards (and also assumes that all shards in the split have the same progress). This can lead to unnecessary decisions to scale up the number of workers but will never fail to scale up when this is necessary. Also the watermark policy of the reader is user configurable and we don't have a control over it. The calculation of backlog bytes should always take into account the time that the record was inserted into Kinesis as this gives accurate results which are independent of current processing time or any other conditions that reader watermark can be using. Therefore a separate, fixed policy is used to track processed records for the backlog calculation.
|
@sgraca Thank you for contribution! I'll take a look on this next week. |
|
Hi @aromanenko-dev Have you been able to make any progress with the review? |
|
Run Java PreCommit |
|
@sgraca Thank you for pinging me, sorry for delay. I'll try to review this asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it LGTM in general, I just left one minor question.
sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/ShardRecordsIterator.java
Show resolved
Hide resolved
aromanenko-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @sgraca for working on this.
|
@aromanenko-dev Thank you for the review and the merge. |
This merge request changes how backlog bytes are reported by KinesisReader.
Instead of total backlog bytes, which could result in Dataflow pipelines not scaling up,
split backlog is now reported.
The split backlog size can be over-estimated as it reports the size
of the records across all shards (and also assumes that all shards
in the split have the same progress).
This can lead to unnecessary decisions to scale up the number of workers
but will never fail to scale up when this is necessary.
Also the watermark policy of the reader is user configurable and we
don't have a control over it. The calculation of backlog bytes should
always take into account the time that the record was inserted into
Kinesis as this gives accurate results which are independent of
current processing time or any other conditions that reader watermark
can be using. Therefore a separate, fixed policy is used to track
processed records for the backlog calculation.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.