-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9877] Estimate sizes of group-by-key values behind a key lazily only. #11598
Conversation
… such iterables can have enormous sizes and size estimates would cause reading the data multiple times. Furthermore, the size of the entire group-by-key collection is already known precisely at read time (and encoded into corresponding counters), which means the estimation is in fact not only expensive but redundant and unnecessary.
R: @kennknowles |
retest this please |
How will this impact PCollection size estimation shown in the Dataflow UI? |
In general, the Dataflow UI ought to use the exact counter GBK has for
effecting the sanity checking, if it doesn't do that already. However, I
believe the estimate in this case is for the PCollection output one step
down from the read from shuffle, at which point I don't know how useful any
estimation is at all when reading from a GBK (except for accounting the
amount of data read and possibly reduced / transformed).
…On Mon, May 4, 2020 at 8:38 AM Lukasz Cwik ***@***.***> wrote:
How will this impact PCollection size estimation shown in the Dataflow UI?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#11598 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABRMIED6PZHHSF32BM24HSLRP3OKTANCNFSM4MYNL2QA>
.
|
…bservableIterable so that size estimation is lazy
…bservableIterable so that size estimation is lazy
Updated the change to perform lazy estimation instead of side-stepping the estimation altogether. |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -165,12 +168,17 @@ public WindowReiterable( | |||
} | |||
|
|||
@Override | |||
public Reiterator<V> iterator() { | |||
public WindowReiterator<V> iterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that ElementByteSizeObservableIterable::iterator
adds observers within the method body. I assume this is for historic reasons, since it doesn't seem to do anything now, and the comment documenting references a setObserver
method that doesn't exist. Anyway, your change looks fine. But we should consider cleaning this up.
Lines 49 to 61 in 6453e85
/** | |
* Returns a new iterator for this iterable. If an observer was set in a previous call to | |
* setObserver(), it will observe the iterator returned. | |
*/ | |
@Override | |
public InputT iterator() { | |
InputT iterator = createIterator(); | |
for (Observer observer : observers) { | |
iterator.addObserver(observer); | |
} | |
observers.clear(); | |
return iterator; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed BEAM-9878 as a wish item.
LGTM |
Perform lazy estimation only for such iterables since they can have enormous sizes and size estimates would cause reading the data multiple times.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.