[BEAM-3708] Adding grouping table to Precombine step. #5795

youngoli · 2018-06-27T21:29:49Z

Adding a grouping table to the Precombine step of a lifted Combine Per
Key. This enables performing a Partial Group by Key optimization. The
grouping table code is somewhat generic, so it can be reused in other
runners that want to perform a Partial Group by Key.

Note for any reviewers:
I wasn't entirely sure where to commit the GroupingTable code, since it's somewhat generic, so I'm starting with the most specific directory it would fit in, but I may move the GroupingTable files to a new sub-directory named "utils" or something similar, or a completely different directory if anyone has any suggestions.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Spark
Go	---	---	---	---	---
Java
Python	---		---	---	---

youngoli · 2018-06-27T21:31:29Z

R: @lukecwik

lukecwik · 2018-06-27T22:44:11Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/CombineRunners.java

+              accumCoder);
+
+      // Register the appropriate handlers.
+      addStartFunction.accept(runner::startBundle);


You add the start function twice.

lukecwik · 2018-06-27T22:45:31Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/CombineRunners.java

+      return runner;
+    }
+  }
+
  static <KeyT, InputT, AccumT>
  ThrowingFunction<KV<KeyT, InputT>, KV<KeyT, AccumT>> createPrecombineMapFunction(


This code is no longer used.

Adding a grouping table to the Precombine step of a lifted Combine Per Key. This enables performing a Partial Group by Key optimization. The grouping table code is somewhat generic, so it can be reused in other runners that want to perform a Partial Group by Key.

Simplifying the grouping table code by making it more specific to the precombine. The class doesn't need to be so generic when only the Precombine runner is going to use it in the foreseeable future.

youngoli · 2018-06-29T23:59:14Z

Run Java PreCommit

lukecwik

All the comments are improvements to the code. I'll merge as-is and let you work on the next iteration improving the implementation based upon the comments I left in this PR.

lukecwik · 2018-06-29T23:49:17Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/CombineRunners.java

+      // Input coder may sometimes be WindowedValueCoder depending on runner, instead of the
+      // expected KvCoder.
+      Coder<?> uncastInputCoder = rehydratedComponents.getCoder(mainInput.getCoderId());
+      KvCoder<KeyT, InputT> inputCoder;


You don't use the inputCoder anywhere except to get the key coder.

Consider dropping the local variable inputCoder and setting keyCoder directly.

lukecwik · 2018-06-29T23:51:56Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/CombineRunners.java

+
+    void processElement(WindowedValue<KV<KeyT, InputT>> elem) throws Exception {
+      groupingTable.put(
+          elem, (Object outputElem) -> output.accept((WindowedValue<KV<KeyT, AccumT>>) outputElem));


if you use a cast, you should be able to pass this in as a method reference instead of using a lambda

lukecwik · 2018-06-29T23:52:04Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/CombineRunners.java

+
+    void finishBundle() throws Exception {
+      groupingTable.flush(
+          (Object outputElem) -> output.accept((WindowedValue<KV<KeyT, AccumT>>) outputElem));


ditto here, if you use a cast, you should be able to pass this in as a method reference instead of using a lambda

lukecwik · 2018-06-29T23:53:00Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/GroupingTable.java

+public interface GroupingTable<K, InputT, AccumT> {
+
+  /** Abstract interface of things that accept inputs one at a time via process(). */
+  interface Receiver {


I don't think we'll need to make this generic in this sense. Consider using FnDataReceiver directly here instead of Receiver.

lukecwik · 2018-06-29T23:53:33Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/GroupingTable.java

+  }
+
+  /** Adds a pair to this table, possibly flushing some entries to output if the table is full. */
+  void put(Object pair, Receiver receiver) throws Exception;


You can use KV<InputT, AccumT>

lukecwik · 2018-06-30T00:07:17Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/PrecombineGroupingTable.java

+  }
+
+  /** Provides client-specific operations for combining values. */
+  public interface Combiner<K, InputT, AccumT, OutputT> {


You'll only have one implementation of a Combiner. You should be able to use the CombineFn directly everywhere.

lukecwik · 2018-06-30T00:11:27Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/PrecombineGroupingTable.java

+
+    if (size >= maxSize) {
+      long targetSize = (long) (TARGET_LOAD * maxSize);
+      Iterator<GroupingTableEntry<K, InputT, AccumT>> entries = table.values().iterator();


We would do a lot better if we used an LRU strategy for cache eviction.

lukecwik · 2018-06-30T00:12:33Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/PrecombineGroupingTable.java

+  private long size = 0;
+  private Map<Object, GroupingTableEntry<K, InputT, AccumT>> table;
+
+  PrecombineGroupingTable(


I know that this is a copy from Dataflow internally but could be replaced with a much better implementation such as caffeine.

lukecwik · 2018-07-02T18:02:18Z

sdks/java/harness/src/main/java/org/apache/beam/fn/harness/PrecombineGroupingTable.java

+  }
+
+  @VisibleForTesting
+  public void setMaxSize(long maxSize) {


It would be much better if this was a final value and configurable via construction only.

lukecwik · 2018-07-02T18:04:14Z

sdks/java/harness/src/test/java/org/apache/beam/fn/harness/CombineRunnersTest.java

@@ -93,6 +93,7 @@ public Integer extractOutput(Integer accum) {
  private RunnerApi.PTransform pTransform;
  private String inputPCollectionId;
  private String outputPCollectionId;
+  private RunnerApi.Pipeline pProto;


nit: pProto -> pipeline or pipelineProto

lukecwik · 2018-07-02T18:06:26Z

Run Java PreCommit

lukecwik requested changes Jun 27, 2018

View reviewed changes

youngoli added 2 commits June 29, 2018 16:26

[BEAM-3708] Simplifying precombine grouping tables.

69b34f6

Simplifying the grouping table code by making it more specific to the precombine. The class doesn't need to be so generic when only the Precombine runner is going to use it in the foreseeable future.

youngoli force-pushed the beam3708 branch from 3660668 to 69b34f6 Compare June 29, 2018 23:40

lukecwik approved these changes Jul 2, 2018

View reviewed changes

lukecwik merged commit 6e3c055 into apache:master Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3708] Adding grouping table to Precombine step. #5795

[BEAM-3708] Adding grouping table to Precombine step. #5795

youngoli commented Jun 27, 2018

youngoli commented Jun 27, 2018

lukecwik Jun 27, 2018

youngoli Jun 29, 2018

lukecwik Jun 27, 2018

youngoli Jun 29, 2018

youngoli commented Jun 29, 2018

lukecwik left a comment

lukecwik Jun 29, 2018

lukecwik Jun 29, 2018

lukecwik Jun 29, 2018

lukecwik Jun 29, 2018

lukecwik Jun 29, 2018

lukecwik Jun 30, 2018

lukecwik Jun 30, 2018

lukecwik Jun 30, 2018

lukecwik Jul 2, 2018

lukecwik Jul 2, 2018

lukecwik commented Jul 2, 2018

[BEAM-3708] Adding grouping table to Precombine step. #5795

[BEAM-3708] Adding grouping table to Precombine step. #5795

Conversation

youngoli commented Jun 27, 2018

Post-Commit Tests Status (on master branch)

youngoli commented Jun 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youngoli commented Jun 29, 2018

lukecwik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukecwik commented Jul 2, 2018