-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7386] Adding EventTimeEquiJoin #15275
Conversation
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Outdated
Show resolved
Hide resolved
* </pre> | ||
*/ | ||
@AutoValue | ||
public abstract class EventTimeEquiJoin<K, V1, V2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to think on the name a bit more, as EventTimeEquiJoin seems a bit awkward to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder about whether this PTransform should be in terms of KVs or not. However I'm starting to think that it should be and then we have the option to wrap it in a higher-level Join transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up what Flink and Spark call them. Spark doesn't really have a name. Flink's joins seem to be different (processing on a table at a point of time instead of things being equivalent). Tyson's suggestion in the doc was EventTimeBoundedEquiJoin. Not sure I have any other suggestions. Maybe we could shorten it and move some to the function: E.g. EventTimeBoundedJoin.innerEquiOf() or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other options that were suggested previously:
- TimestampBoundedEquijoin
- EventTimeLimitedDurationEquiJoin
- EventTimeScopedDurationInnerJoin
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/Pair.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Outdated
Show resolved
Hide resolved
Instant nextBucketStart = | ||
Instant.ofEpochMilli( | ||
cleanupTime.getMillis() / TIMER_BUCKET * TIMER_BUCKET + TIMER_BUCKET); | ||
cleanupTimers.get(Long.toString(nextBucketStart.getMillis())).set(nextBucketStart); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we should call withOutputTimestamp to hold the watermark back to the earliest buffered element in that timer's range. Doing this probably requires keeping a histogram in a ValueState tracking the min timestamp per minute bucket.
Let me know when you're ready for another review round! |
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/EventTimeEquiJoin.java
Outdated
Show resolved
Hide resolved
@reuvenlax @kennknowles I implemented changes as if we were to remove the checks we discussed offline (one I avoided with the skew and the other by removing it). Please take a look, thanks! |
ec252da
to
7a8ba6d
Compare
Ping. :) @reuvenlax @kennknowles |
@kennknowles @reuvenlax - Could you please take a look at this? |
a17efd4
to
9c8eca7
Compare
@kennknowles @reuvenlax - Could you please take a look at this? @laraschmidt - Could you look at the failing precommit test or re-run it? |
Run Java PreCommit |
Run Java_Examples_Dataflow PreCommit |
1 similar comment
Run Java_Examples_Dataflow PreCommit |
Run Java PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
@Override | ||
public void encode(Pair<V1, V2> value, OutputStream outStream) | ||
throws CoderException, IOException { | ||
firstCoder.encode(value.getFirst(), outStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we deprecated the coder "context" idea? Or do you just not want to apply it here? I would expect PairCoder to be identical to KvCoder, anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the deprecated context thing but I copied it over from KVCoder. What exactly is it used for? I assume that in my case I want context on both (whereas the KV coder only put it on the value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, having both failed so I swapped it to just be on the second element.
Instant ts, | ||
OrderedListState<ThisT> thisCollection, | ||
OrderedListState<OtherT> otherCollection, | ||
ValueState<Instant> oldestTimestampState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how having a specific timestamp queue state would make it easier to have all the methods we need in a single "state type". Interestingly enough, the original design for state (before all the annotations) made it pretty easy to define composite types of state, whereas now we really cannot.
* @param <V2> the type of the value in the second {@code PCollection} | ||
* </pre> | ||
*/ | ||
@AutoValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
be5bf5c
to
08fc756
Compare
Addressed comments, PTAL. :) Thanks! @kennknowles |
@kennknowles - Could you please take a look? |
08fc756
to
ddaf28b
Compare
Ignore that. I saw the recent changes. |
I think it's still relevant. It was blocked on Lara's other PR.
…On Wed, Dec 29, 2021 at 10:23 AM Ahmet Altay ***@***.***> wrote:
Is this still relevant? Should we close it?
—
Reply to this email directly, view it on GitHub
<#15275 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVLMOFDSNGLDW7JIHBTUTNGZDANCNFSM5BPZWL6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@laraschmidt - how is your other PR doing? :) |
The other one is in but it does not work for portable runner because making it work for protable runner breaks dataflow on portable runner. :) But the changes can actually be treated as pretty independent. That one was making the function we use here not as deprecated but we can still use it in this PR even if deprecated. So this is still ready for review. |
But you needed to make that function apply to timers as well, right?
…On Fri, Jan 7, 2022 at 9:10 AM laraschmidt ***@***.***> wrote:
The other one is in but it does not work for portable runner because
making it work for protable runner breaks dataflow on portable runner. :)
But the changes can actually be treated as pretty independent. That one
was making the function we use here not as deprecated but we can still use
it in this PR even if deprecated. So this is still ready for review.
—
Reply to this email directly, view it on GitHub
<#15275 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVPVQDONYV2V4STERPTUU4NB5ANCNFSM5BPZWL6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That was for an earlier iteration. The solution we ended up going with only outputs from processElement and only uses timers to hold the watermark. So we didn't end up actually needing to allow older elements from timers because we accepted that we will allow older elements from processElement function. |
Right, but you need to be able to set a timer with a hold older than the
current input element, right?
…On Fri, Jan 7, 2022 at 10:23 AM laraschmidt ***@***.***> wrote:
That was for an earlier iteration. The solution we ended up going with
only outputs from the actual doFn and only uses timers to hold the
watermark. So we didn't end up actually needing to allow older elements
from timers because we accepted that we will allow older elements from the
actual DoFn.
—
Reply to this email directly, view it on GitHub
<#15275 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVIA3SCSRCB5NQQW47TUU4VSVANCNFSM5BPZWL6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm trying to page this back in. But if I remember correctly, the first element holds the watermark back until all not-late second elements would have appeared. So we don't actually need to hold it backwards. It's just the processElement that puts out older elements. Here's the relevant code for the timer which are not negative: |
I need to take a look at the code, since it's changed a lot. However
elements don't arrive in timestamp order, so I'm not sure how you
would avoid this.
…On Fri, Jan 7, 2022 at 10:35 AM laraschmidt ***@***.***> wrote:
I'm trying to page this back in. But if I remember correctly, the first
element holds the watermark back until all not-late second elements would
have appeared. So we don't actually need to hold it backwards. It's just
the processElement that puts out older elements.
Here's the relevant code for the timer which are not negative:
if (elementAffectsWatermarkHolds) {
addTimer(watermarkHolds,
ts.plus(thisCollectionValidFor)).withOutputTimestamp(ts);
—
Reply to this email directly, view it on GitHub
<#15275 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVOZTULKXCPAB2WT3VTUU4W7ZANCNFSM5BPZWL6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We use the same watermark hold for a bucket of elements (E.g. arriving between t and t+e). And we just hold it back til t+e. So it shouldn't matter that they come out of order from what I can tell. |
What is the next step on this PR? - (I will stop pinging. Let me know if I can do anything to help.) |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Adding EventTimeEquiJoin which is a PTransform that joins two streams where the comparison is time-stamp bounded.
@reuvenlax @kennknowles
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.