[BEAM-7972] Always use Global window in reshuffle and then apply wind…#9334
[BEAM-7972] Always use Global window in reshuffle and then apply wind…#9334angoenka merged 1 commit intoapache:masterfrom
Conversation
e003e25 to
47caba9
Compare
|
R: @lukecwik |
|
LGTM. |
robertwb
left a comment
There was a problem hiding this comment.
This is just needed to work around a Dataflow JRH bug, right?
| key, value = element | ||
| return key, TimestampedValue(value, timestamp) | ||
| # Transport the window as part of the value and restore it later. | ||
| return key, TimestampedValue((value, window), timestamp) |
There was a problem hiding this comment.
Any reason not to use a WindowedValue here?
There was a problem hiding this comment.
Not really. I will make it windowed values.
There was a problem hiding this comment.
Actually while making the change i realized that the timestamp will be duplicated when using the windowed values without any benefit. Hence dropping the use of windowed value.
There was a problem hiding this comment.
Why would the timestamp be duplicated? The (window, timestamp, value) tuple seems best represented by a WindowedValue. (I would be OK with a plain old 3-tuple as well, but only going half way seems odd).
There was a problem hiding this comment.
The defaulting to global window should be deleted since the Python SDK now does send a proper windowing strategy (same as Go SDK). The code was added as a migration path to allow for differences in where the Python/Go/Java SDKs were when submitting jobs to Dataflow.
So we should update the reshuffle code to not pass the non standard window from python.
We shouldn't have to, but if the alternative is significant JRH refactoring, then this code should be OK and we can add a comment that we're working around bugs in the Dataflow JRH.
There was a problem hiding this comment.
From a quick look, JRH refactoring seems significant.
I would like to keep it simple for now and will continue with changes in Reshuffle transform with additional comment.
There was a problem hiding this comment.
Why would the timestamp be duplicated? The (window, timestamp, value) tuple seems best represented by a WindowedValue. (I would be OK with a plain old 3-tuple as well, but only going half way seems odd).
Ohh, I see what you mean. Updated the code.
angoenka
left a comment
There was a problem hiding this comment.
I think this will be needed irrespective of the JRH bug as we don't want to introduce a new windowing function which should be interpreted by the runner.
Here is the code which defaults to globalWindow when its not able to deserialize the windowing strategy. https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/GroupAlsoByWindowParDoFnFactory.java#L106
| key, value = element | ||
| return key, TimestampedValue(value, timestamp) | ||
| # Transport the window as part of the value and restore it later. | ||
| return key, TimestampedValue((value, window), timestamp) |
There was a problem hiding this comment.
Not really. I will make it windowed values.
|
Gentle reminder for the review. |
|
OK, yuck, that JRH code is really bad (and possibly buggy). The windowing function need not be interpreted by the runner, it should just note it's non-merging and pass things through. |
|
That piece of JRH seems to be very intricate and I would don't think it has an easy fix without a lot of refactoring. |
The defaulting to global window should be deleted since the Python SDK now does send a proper windowing strategy (same as Go SDK). The code was added as a migration path to allow for differences in where the Python/Go/Java SDKs were when submitting jobs to Dataflow. |
So we should update the reshuffle code to not pass the non standard window from python. |
47caba9 to
ec92139
Compare
|
Request for another pass for the review. |
| return [ | ||
| windowed_value.WindowedValue( | ||
| (key, value.value), value.timestamp, [window]) | ||
| (key, value.value), value.timestamp, value.windows) |
There was a problem hiding this comment.
You could do
key, windowed_values = element
return [wv.with_value((key, wv.value)) for wv in windowed_values]
| ungrouped = pcoll | Map(reify_timestamps) | ||
|
|
||
| # TODO(BEAM-8104) Using global window as one of the standard window. | ||
| # This is to mitigate the Java Runner Harness limitation to |
There was a problem hiding this comment.
s/Java Runner Harness/Dataflow Java Runner Harness/
ec92139 to
f22231c
Compare
|
LGTM
…On Fri, Aug 30, 2019, 3:37 PM Ankur ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sdks/python/apache_beam/transforms/util.py
<#9334 (comment)>:
> for value in values]
ungrouped = pcoll | Map(reify_timestamps)
+
+ # TODO(BEAM-8104) Using global window as one of the standard window.
+ # This is to mitigate the Java Runner Harness limitation to
done
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9334?email_source=notifications&email_token=AADWVAIYP6CSYQZXV56IL73QHGOLBA5CNFSM4ILPPYYKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCDJRJHA#discussion_r319692142>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADWVALJN4X2J53JSWOLBULQHGOLBANCNFSM4ILPPYYA>
.
|
|
Thanks! |
…ow again.
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.