Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LINQ] Add timestamp join #215

Merged
merged 9 commits into from Mar 26, 2022

Conversation

pschafhalter
Copy link
Member

Introduces the TimestampJoin, which joins messages with matching timestamps from two different streams by performing a cartesian product. I've tested the operator locally, and it seems to work.

A few things I wasn't sure about that can hopefully be addressed in review:

  1. Consistent naming: TimestampJoin seems more ergonomic, but this should be named TimestampJoinOperator to stay consistent with the other operators.
  2. Example: I wasn't sure how to best add the timestamp join to the existing LINQ and full pipeline examples. These examples may also need to be revamped.

Copy link
Collaborator

@sukritkalra sukritkalra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm in favor of renaming it to TimestampJoinOperator to remain consistent with the rest of the operators. Maybe at some point we could have a generic JoinOperator whose joining behavior (i.e., window, tumbling, timestamp etc.) could be chosen through an enum.
  2. We have a JoinSumOperator in the full_pipeline example that could be replaced with a TimestampJoinOperator followed by a MapOperator. (I believe Flink does this using an apply method, which we could also provide in the LINQ API and just use a MapOperator underneath)

erdos/src/dataflow/operators/join.rs Outdated Show resolved Hide resolved
@pschafhalter
Copy link
Member Author

I revamped the LINQ example with the timestamp join.

Do we also want to include it in the full pipeline example? If so, can you let me know the purpose of the full pipeline example? Is the intent to mirror the LINQ example using the connect_* functions, or is it trying to demonstrate operator implementation as well? I notice there are several unused and redundant operators (e.g. SquareOperator could be replaced by a FlatMapOperator).

@pschafhalter pschafhalter changed the title [LINQ] Add TimestampJoin operator [LINQ] Add timestamp join Mar 26, 2022
Copy link
Collaborator

@sukritkalra sukritkalra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the intent of the full_pipeline example was to show the operator construction and connection API, which is simplified by the LINQ API since they are all data transformation operators. Maybe we can leave it untouched for now, since we're planning to replace that with a Perception example anyways.

Also, what is the behavior of a TimestampJoinOperator when you get no inputs from one of the streams for a timestamp?

@pschafhalter
Copy link
Member Author

Messages are dropped when there are no inputs from one of the streams for a given timestamp. The behavior is documented in the table above the TimestampJoinOperator.

Copy link
Collaborator

@sukritkalra sukritkalra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you add a comment on the connect methods about why we require them to take a state initialization just in case we forget again?

@pschafhalter pschafhalter merged commit 82d9a6a into erdos-project:redesign Mar 26, 2022
@pschafhalter pschafhalter deleted the timestamp-join branch March 26, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants