Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7386] Introduce EventTimeBoundedEquijoin. #12915

Closed
wants to merge 3 commits into from

Conversation

tysonjh
Copy link
Contributor

@tysonjh tysonjh commented Sep 23, 2020

Similar to other inner joins except it includes a temporal predicate,
allowing users to join unbounded PCollections in the GlobalWindow.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@tysonjh
Copy link
Contributor Author

tysonjh commented Sep 23, 2020

R: @reuvenlax

@tysonjh
Copy link
Contributor Author

tysonjh commented Sep 23, 2020

/cc @kennknowles

@tysonjh tysonjh force-pushed the bitemporaljoin-noremove branch 2 times, most recently from 037894b to 8a163ad Compare September 23, 2020 18:52
@reuvenlax
Copy link
Contributor

Out of curiosity, why are you adding this here instead of the schema join library (which SQL uses)?

@tysonjh
Copy link
Contributor Author

tysonjh commented Sep 23, 2020

Out of curiosity, why are you adding this here instead of the schema join library (which SQL uses)?

I wasn't aware of the other join library. I saw the join extension library implementations, plus the previously closed PR in BEAM-7386, and thought that this new one should be placed near and made the assumption that SQL would reuse the implementation. Looking at it now though it seems like the SQL joins don't use the join extension library.

Should I keep this one around or refactor into the SQL schema join library?

@tysonjh
Copy link
Contributor Author

tysonjh commented Oct 6, 2020

Out of curiosity, why are you adding this here instead of the schema join library (which SQL uses)?

I wasn't aware of the other join library. I saw the join extension library implementations, plus the previously closed PR in BEAM-7386, and thought that this new one should be placed near and made the assumption that SQL would reuse the implementation. Looking at it now though it seems like the SQL joins don't use the join extension library.

Should I keep this one around or refactor into the SQL schema join library?

After looking at the SQL schema join library I think it would be useful to keep this join in join-extension so it can be used with non-schema'd PCollections and support more than equijoins. The schema join should be able to reuse this implementation in the future by refactoring the Join#expand method. Maybe at that point we would discuss inducting the join-extension into core.

@tysonjh tysonjh closed this Oct 6, 2020
@tysonjh tysonjh reopened this Oct 6, 2020
@tysonjh
Copy link
Contributor Author

tysonjh commented Oct 6, 2020

Oops - didn't realize that 'Close with Comment' was for the whole PR. I thought it was just for the comment thread.

@kennknowles kennknowles self-requested a review October 21, 2020 17:53
@reuvenlax
Copy link
Contributor

Sorry for the delay.

AFAIK both this and the schema library are limited today to equijoins. The schema API is designed so that we can extend it later with non equijoins, however doing arbitrary join conditions in a distributed manner can be a hard problem.

@tysonjh
Copy link
Contributor Author

tysonjh commented Oct 30, 2020

Sorry for the delay.

AFAIK both this and the schema library are limited today to equijoins. The schema API is designed so that we can extend it later with non equijoins, however doing arbitrary join conditions in a distributed manner can be a hard problem.

This implementation allows for simple comparisons between records for the join beyond an equijoin by allowing the user to provide a SimpleFunction<KV<V1, V2>, Boolean> compareFn. Do you see an issue with this?

@reuvenlax
Copy link
Contributor

I am a bit confused about the usage of compareFn here. State is per key, so I believe that your DoFn will only join items that have the same key - the compareFn will never even get to compare items with different keys. Is the idea to allow the user to generate a subset of an equijoin?

@tysonjh
Copy link
Contributor Author

tysonjh commented Nov 2, 2020

I am a bit confused about the usage of compareFn here. State is per key, so I believe that your DoFn will only join items that have the same key - the compareFn will never even get to compare items with different keys. Is the idea to allow the user to generate a subset of an equijoin?

Yes, it will be a subset of an equijoin. Sorry for the confusion.

Copy link
Contributor

@reuvenlax reuvenlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay!

@kennknowles
Copy link
Member

To be clear, this join is a temporal join. You have to have a condition relating the timestamps of the two elements that can translate into garbage collection threshold.

@kennknowles
Copy link
Member

(the garbage collection can be looser than the actual comparison)

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 2, 2020 via email

@kennknowles
Copy link
Member

I think there is a hidden innovation here when you refer to AS OF joins. In standard SQL this refers to the past state of a table as it is mutated in processing time. It is primarily useful for inspecting the evolution of a table.

To yield correct results for a quotes/trades pipeline it must refer to event time. If all events from quotes and trades are ordered (and jointly ordered) then AS OF can answer the query. Otherwise it cannot. This is true for standard SQL databases: if a quote is inserted that contradicts a prior result of the quote/trade match, then the ordering used by AS OF cannot be used to determine the price for a trade.

I think the re-interpretation of AS OF to refer to event time is a good change. I know that Flink has done this same thing. I think having a transform that can correctly address the quote/trade problem is also good. In documentation and API just be very careful to make sure it is clear. We already have a lot of users / StackOverflow questions talking about "before" and "after" and "as of" in terms of processing time, mixing it up with event time.

@kennknowles
Copy link
Member

We could call this one "timestamp-bounded equijoin" or some such.

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 2, 2020 via email

@tysonjh
Copy link
Contributor Author

tysonjh commented Nov 3, 2020

I am a bit confused about the usage of compareFn here. State is per key, so I believe that your DoFn will only join items that have the same key - the compareFn will never even get to compare items with different keys. Is the idea to allow the user to generate a subset of an equijoin?

Yes, it will be a subset of an equijoin. Sorry for the confusion.

I am a bit confused about the usage of compareFn here. State is per key, so I believe that your DoFn will only join items that have the same key - the compareFn will never even get to compare items with different keys. Is the idea to allow the user to generate a subset of an equijoin?

Yes, it will be a subset of an equijoin. Sorry for the confusion.

Now that i'm thinking about this further, the compareFn may be unnecessarily complicating the API for this join. I imagined it would be helpful for a user who wants to add logic before emitting a matched result, like a filter, but it would be more idiomatic for the user to apply a filter transform to the join result instead.

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 3, 2020 via email

@tysonjh
Copy link
Contributor Author

tysonjh commented Nov 3, 2020

We could call this one "timestamp-bounded equijoin" or some such.

Ya this is a tough one to name, more input is welcome. I floated the following: EventTimeLimitedDurationInnerJoin, EventTimeScopedDurationInnerJoin

@reuvenlax
Copy link
Contributor

FYI the actual class name can be a bit longer, as long there is a good builder method. e.g. You could do something like:

Join.boundedInnerJoin(pc1, pc2);

This would be easier to deal with if this contrib Join library used PTransforms instead of functions.

Similar to other inner joins except it includes a temporal predicate,
allowing users to join unbounded PCollection<KV>s in the GlobalWindow.
@tysonjh tysonjh force-pushed the bitemporaljoin-noremove branch 2 times, most recently from 8c0e05b to b074975 Compare November 20, 2020 05:08
Copy link
Contributor Author

@tysonjh tysonjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I had some questions, resolved some of your comments, PTAL.

V1 left = e.getValue().getKey();
V2 right = e.getValue().getValue();
if (left != null) {
leftState.add(TimestampedValue.of(left, timestamp));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit please? I don't understand why we want the hold, or what it accomplishes. The docs are a bit tricky to follow regarding this.

if (left != null) {
leftState.add(TimestampedValue.of(left, timestamp));
rightState
.readRange(timestamp.minus(temporalBound), timestamp.plus(temporalBound))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the timer family reduce the worst case? The O(n^2) comes from searching through the state on each input, won't that still be required for finding the 'joined elements' to output in the timer?

@tysonjh tysonjh changed the title [BEAM-7386] Introduce temporal inner join. [BEAM-7386] Introduce EventTimeBoundedEquijoin. Nov 20, 2020
@tysonjh tysonjh force-pushed the bitemporaljoin-noremove branch 3 times, most recently from d3ba9ec to b5198b2 Compare November 20, 2020 05:44
Tyson Hamilton added 2 commits December 2, 2020 16:38
Refactor name of classes and methods, remove the compareFn,
fix eviction bug, convert boolean to state, and other smaller changes.
@tysonjh
Copy link
Contributor Author

tysonjh commented Aug 5, 2021

Obsolete.

@tysonjh tysonjh closed this Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants