Skip to content

Conversation

@sgraca
Copy link
Contributor

@sgraca sgraca commented Apr 10, 2020

This merge request changes how backlog bytes are reported by KinesisReader.

Instead of total backlog bytes, which could result in Dataflow pipelines not scaling up,
split backlog is now reported.

The split backlog size can be over-estimated as it reports the size
of the records across all shards (and also assumes that all shards
in the split have the same progress).
This can lead to unnecessary decisions to scale up the number of workers
but will never fail to scale up when this is necessary.

Also the watermark policy of the reader is user configurable and we
don't have a control over it. The calculation of backlog bytes should
always take into account the time that the record was inserted into
Kinesis as this gives accurate results which are independent of
current processing time or any other conditions that reader watermark
can be using. Therefore a separate, fixed policy is used to track
processed records for the backlog calculation.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

The split backlog size can be over-estimated as it reports the size
of the records across all shards (and also assumes that all shards
in the split have the same progress).
This can lead to unnecessary decisions to scale up the number of workers
but will never fail to scale up when this is necessary.

Also the watermark policy of the reader is user configurable and we
don't have a control over it. The calculation of backlog bytes should
always take into account the time that the record was inserted into
Kinesis as this gives accurate results which are independent of
current processing time or any other conditions that reader watermark
can be using. Therefore a separate, fixed policy is used to track
processed records for the backlog calculation.
@sgraca
Copy link
Contributor Author

sgraca commented Apr 10, 2020

R: @aromanenko-dev

@aromanenko-dev aromanenko-dev self-requested a review April 10, 2020 15:33
@aromanenko-dev
Copy link
Contributor

@sgraca Thank you for contribution! I'll take a look on this next week.

@sgraca
Copy link
Contributor Author

sgraca commented May 6, 2020

Hi @aromanenko-dev Have you been able to make any progress with the review?

@aromanenko-dev
Copy link
Contributor

Run Java PreCommit

@aromanenko-dev
Copy link
Contributor

@sgraca Thank you for pinging me, sorry for delay. I'll try to review this asap.

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it LGTM in general, I just left one minor question.

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @sgraca for working on this.

@aromanenko-dev aromanenko-dev merged commit 8fdc9ce into apache:master May 11, 2020
@aromanenko-dev aromanenko-dev changed the title [BEAM-9439] Return split instead of total backlog size [BEAM-9439] KinesisIOreturn split instead of total backlog size in KinesisIO May 11, 2020
@aromanenko-dev aromanenko-dev changed the title [BEAM-9439] KinesisIOreturn split instead of total backlog size in KinesisIO [BEAM-9439] Return split instead of total backlog size in KinesisIO May 11, 2020
@sgraca
Copy link
Contributor Author

sgraca commented May 11, 2020

@aromanenko-dev Thank you for the review and the merge.

@sgraca sgraca deleted the BEAM-9439-fix-backlog-estimate branch May 11, 2020 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants