Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

APEXMALHAR-2085: Operator supporting the Beam concepts of windowing, watermarks, triggering and accumulation #319

Merged
merged 1 commit into from Jul 12, 2016

Conversation

davidyan74
Copy link
Contributor

@davidyan74 davidyan74 commented Jun 15, 2016

This review-only PR contains the interfaces and a first rough draft of the implementation of the operator that supports the concepts of windowing, watermarks, triggering and accumulation mode specified by the Apache Beam API.

When you review this PR, please:

  • note the TODO comments
  • look at how the pi and wordcount samples work
  • try running the pi and wordcount samples (there is a main method that you can run in your IDE)

thank you.

@davidyan74 davidyan74 force-pushed the windowedOperator branch 6 times, most recently from 348a570 to 7fb7543 Compare June 18, 2016 00:07
*
*/
@InterfaceStability.Evolving
public interface Accumulation<InputT, AccumT, OutputT>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to add an example where AccumT is different from OutputT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@siyuanh
Copy link
Contributor

siyuanh commented Jun 20, 2016

@davidyan74 There are some compile errors on Stream API and classes depends on that due to your change, Shall I resolve those issues?

@davidyan74
Copy link
Contributor Author

@siyuanh Yes, please do so. Also review my changes and see if you're okay with them

@davidyan74
Copy link
Contributor Author

@siyuanh Note that I removed the support for count based window (for now), because the concepts of watermarks, timestamp, early/late triggers don't apply any more. let me know if you want to discuss this further.

@davidyan74 davidyan74 force-pushed the windowedOperator branch 6 times, most recently from 7129580 to b1c6b0c Compare June 22, 2016 22:25
@@ -94,6 +94,18 @@
<artifactId>cglib</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<!-- required by twitter demo -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this dependency?

@davidyan74 davidyan74 force-pushed the windowedOperator branch 3 times, most recently from 20fc1d3 to de84a96 Compare June 23, 2016 21:30
* This interface describes the individual window.
*/
@InterfaceStability.Evolving
public interface Window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Window, we could only provide one interface isWithinWIndow(long time); So for GlobalWindow, we can simplly return true. And in SessionWindow, instead of using static method to merge the window, we can just extendWindow(). The Window interface looks like assume there has a fixed boundary(it could be the interface of FixedWindow). Both GlobalWindow and SessionWindow don't have fixed boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session window can be extended, and two session windows can be merged into one.

long horizon = currentWatermark - allowedLatenessMillis;
if (allowedLatenessMillis >= 0) {
// purge window that are too late to accept any more input
dataStorage.removeUpTo(horizon);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before removing from storage, should we check whether triggers are fired and data is send to downstream for window that is too late?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will make the changes

*
* @param <T> The type of the data that is stored per window
*
* TODO: Look at the possibility of integrating spillable data structure: https://issues.apache.org/jira/browse/APEXMALHAR-2026
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very similar to the managed state implementations. Can we also look at managed state directly before looking at Spillable data structures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhupeshchawda We will for sure look at ManagedState.

@davidyan74 davidyan74 changed the title APEXMALHAR-2085: REVIEW ONLY: Operator supporting the Beam concepts of windowing, watermarks, triggering and accumulation APEXMALHAR-2085: Operator supporting the Beam concepts of windowing, watermarks, triggering and accumulation Jul 8, 2016
@davidyan74
Copy link
Contributor Author

@siyuanh please review and merge

@siyuanh
Copy link
Contributor

siyuanh commented Jul 9, 2016

@chinmaykolhatkar @brightchen @tweise If you guys are all ok with the change. I will merge it.
@vrozov I think you make a good point of the Accumulation interface, but can we discuss and solve the problem in separate jira? If so, could you please create a ticket for that? Thanks!

@tweise
Copy link
Contributor

tweise commented Jul 9, 2016

I would prefer a bit more clarity on the storage part. We already have ManagedState and the spillable data structures in-memory implementation, is there a reason why we don't use them here?

@chinmaykolhatkar
Copy link
Contributor

@siyuanh I agree with the changes done and is good to go as first version. Further changes to this can follow after this.

@davidyan74
Copy link
Contributor Author

@tweise We don't have an implementation of SpillableByteMap yet in master. Looks like this PR from @ilooner #324 has not provided that yet, but when we have an in-memory spillable map implementation, we will make another PR to change the current in-memory implementation of WindowedStorage. This should not be a blocker for merging this PR.

@ilooner
Copy link
Contributor

ilooner commented Jul 11, 2016

@davidyan74 This PR may not be the place for it. But it would be good to have discussion about what Spillable datastructures will be needed. To me it looks like your WindowStorage interfaces are basically array list multimaps. If that's the case I would recommend you make a note on your Storage interfaces that they are temporary and will likely disappear very soon, since SpillableDatastructures already provides that abstraction.

As long as we do those things, I think it would be acceptable to pull this in and make the improvements in the next round of changes.

@ilooner
Copy link
Contributor

ilooner commented Jul 11, 2016

@davidyan74 In fact it may be a good exercise to validate the Spillable Data Structures interfaces by using the InMemorySpillableArrayListMultimap which is already in master as a next step.

@davidyan74
Copy link
Contributor Author

@ilooner Thanks will take a look at InMemorySpillableArrayListMultimap

@davidyan74
Copy link
Contributor Author

@tweise I already marked the WindowedStorage interface Unstable and a note in the javadoc that the WindowedStorage interfaces may change or go away entirely soon and we have plans to integrate with spillable data structures in the very near future. Let me know if you think this PR is good to go.

addressing PR comments

Added trigger unit tests

Fixed sliding window bug, added unit tests

Fixed session window bug; added more unit tests

Gives window a chance to trigger before purging because of lateness

added more unit tests

Process watermark only at end window

Renamed WatermarkOpt to Type, and fixed bug when window in retraction storage is not dropped when it's too late

added copyright

rat check

changed WindowOption from an abstract class to an interface

Use mutable objects for accumulated types

Retraction storage needs to be based on the output, not accumulated type

Added more unit tests

added support of fixed lateness in case watermark is not available from upstream

changed name from fixed lateness to fixed watermark

support inheriting windowed tuple from upstream when window option is not given
@tweise
Copy link
Contributor

tweise commented Jul 12, 2016

I think this PR is good to go but we need to see the follow up work for scalable checkpointing and benchmarking would be good also.

@asfgit asfgit merged commit 7a77274 into apache:master Jul 12, 2016
@vrozov
Copy link
Member

vrozov commented Jul 13, 2016

Filed two JIRAs - one to revisit Accumulation interface (https://issues.apache.org/jira/browse/APEXMALHAR-2145) and one to benchmark new operators and evaluate boxing impact (https://issues.apache.org/jira/browse/APEXMALHAR-2146)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
10 participants