[BEAM-2287] UDAF support #3447

mingmxu · 2017-06-27T05:16:22Z

add an abstract class BeamSqlUdaf following the UDAF definition in Calcite, also COUNT/SUM/AVG/MAX/MIN/ are rewritten with this new format.

Note that the unit test is ignored after rebase BEAM-2446. Will re-open it in BEAM-2520.

coveralls · 2017-06-27T06:45:26Z

Changes Unknown when pulling 0fc4724 on XuMingmin:BEAM-2287 into ** on apache:DSL_SQL**.

xumingming

One question: Can we do not expose BeamSqlRow to UDF writer, it might look strange to them.

xumingming · 2017-06-27T05:51:23Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

+   * Build-in aggregation for SUM.
+   */
+  public static class Sum<T> extends BeamSqlUdaf<T, T> {
+    private static List<Integer> supportedType = Arrays.asList(Types.INTEGER,


how about Decimal?

Decimal is supported, will update.

xumingming · 2017-06-27T08:48:08Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

+  }
+
+  /**
+   * Build-in aggregation for MIN.


mingmxu · 2017-06-27T16:57:48Z

BeamSqlRow seems a good option to me to store the intermediate data of UDAF, as 1) it's self-described with variable data type, 2) can keep multiple data elements like AVG, and 3) no need to write customized coder. I would consider it as a sub-row in aggregator, may need some JavaDoc to explain a bit.

Other suggestion?

xumingming · 2017-06-28T03:21:45Z

No need to write customized Coder is a strong point for exposing BeamSqlRow, it will simplify the implementation for UDF developer, just in exchange they need to know what a BeamSqlRow is, I'm ok to do it this way.

xumingming

LGTM

1. support DECIMAL in built-in aggregators; 2. add JavaDoc for BeamSqlUdaf;

coveralls · 2017-06-28T06:03:06Z

Changes Unknown when pulling 449d1fa on XuMingmin:BEAM-2287 into ** on apache:DSL_SQL**.

takidau

Thanks! This will be nice to have. Some initial comments.

takidau · 2017-06-28T23:55:12Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/schema/BeamSqlUdaf.java

+ * /Float/Date/BigDecimal, mapping as SQL type INTEGER/BIGINT/SMALLINT/TINYINE/DOUBLE/FLOAT
+ * /TIMESTAMP/DECIMAL;<br>
+ * 3. wrap intermediate data in a {@link BeamSqlRow}, and do not rely on elements in class;<br>
+ * 4. The intermediate value of UDAF function is stored in a {@code BeamSqlRow} object.<br>


Why is constraint #4 necessary? Are SQL operations actually applied somewhere to the intermediate result? It looks like BeamSqlRow is just being used for the convenience of its built-in casting methods, which seems unnecessary.

Right, BeamSqlRow is introduced here to hold variable data types, the intermediate result is only used by CombineFn. I'm thinking of a generic type here as well, and ask developer to provide a coder if customized object.

I like that. It sounds cleaner.

takidau · 2017-06-28T23:58:34Z

dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java

+
+    //The test case is disabled temporally as BeamSql doesn't have methods to regester UDF/UDAF,
+    //pending on task BEAM-2520
+//    BeamSqlEnv.registerUdaf("squaresum", SquareSum.class);


Uncomment and remove note?

maybe let's remove this test case, and move to BEAM-2520

takidau · 2017-06-28T23:59:20Z

dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java

+   */
+  public static class SquareSum extends BeamSqlUdaf<Integer, Integer> {
+    private int outputFieldType;
+    private BeamSqlRecordType accType;


final on both?

takidau · 2017-06-28T23:59:40Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

+   * Built-in aggregation for COUNT.
+   */
+  public static class Count<T> extends BeamSqlUdaf<T, Long> {
+    private BeamSqlRecordType accType;


takidau · 2017-06-29T00:21:42Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamAggregationTransforms.java

+              throw new IllegalStateException(e);
+            }
+          } else {
+            throw new UnsupportedOperationException();


Provide more details here? This will be a pretty opaque error when it gets hit otherwise.

takidau · 2017-06-29T00:27:37Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamAggregationTransforms.java

-        case "SUM":
-          //for both AVG/SUM, a summary value is hold at first.
-          switch (ex.getOutputType()) {
+          switch (call.type.getSqlTypeName()) {


Can you please pull each of these switch statements on getSqlTypeName into a factory method in the corresponding aggregation (e.g., BeamBuiltinAggregations.Max.create(SqlTypeName typeName))? That will help manage some of the size of this method, and also put the per-type construction logic somewhere reusable.

takidau · 2017-06-29T00:51:30Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamAggregationTransforms.java

+      return deltaAcc;
+    }
+    @Override
+    public List<BeamSqlRow> mergeAccumulators(Iterable<List<BeamSqlRow>> accumulators) {


Please add tests to exercising merging. It looks like none of the merge methods are getting tested; many of them loop forever.

seems merge is not covered in unit test with TestPipeline, any suggestion?

Is it just a matter of exercising a UDAF query with session windowing?

takidau · 2017-06-29T00:52:46Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamAggregationTransforms.java

+      List<BeamSqlRow> deltaAcc = new ArrayList<>();
+      for (int idx = 0; idx < aggregators.size(); ++idx) {
+        List<BeamSqlRow> accs = new ArrayList<>();
+        while (accumulators.iterator().hasNext()) {


Use a for loop.

Also note that Iterable.iterator() always returns a new iterator, so you need to store it in a variable before calling hasNext() and next() unless you're only trying to get the very first element. There are a bunch of instances of this across this PR that should be fixed.

takidau · 2017-06-29T00:56:39Z

dsls/sql/src/test/java/org/apache/beam/dsls/sql/BeamSqlDslAggregationTest.java

+  /**
+   * GROUP-BY with UDAF.
+   */
+  @Ignore


mingmxu · 2017-06-30T04:50:44Z

@takidau I change the definition of BeamSqlUdaf to BeamSqlUdaf<InputT, AccumT, OutputT>, to get rid of BeamSqlRow. Can you have a review? Let's finish this before your vocation.

@xumingming do you want to have a peak again as it's a little big change?

mingmxu · 2017-06-30T06:19:28Z

retest this please

coveralls · 2017-06-30T07:35:03Z

Changes Unknown when pulling 5201359 on XuMingmin:BEAM-2287 into ** on apache:DSL_SQL**.

takidau

Much nicer, thank you! A few more minor comments, then I think this is probably ready to go.

takidau · 2017-06-30T17:21:54Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

-    public Max(int outputFieldType) {
-      this.accType = BeamSqlRecordType.create(Arrays.asList("__max"),
-          Arrays.asList(outputFieldType));
+    private SqlTypeName fieldType;


takidau · 2017-06-30T17:23:25Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

-    public Min(int outputFieldType) {
-      this.accType = BeamSqlRecordType.create(Arrays.asList("__min"),
-          Arrays.asList(outputFieldType));
+    private SqlTypeName fieldType;


takidau · 2017-06-30T17:37:02Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

    }

    @Override
-    public T result(BeamSqlRow accumulator) {
-      return (T) accumulator.getFieldValue(0);
+    public Coder<T> getAccumulatorCoder(CoderRegistry registry) throws CannotProvideCoderException {


Pull this into a shared getNumericSqlTypeCoder method that MAX and MIN can both use?

can you give more hints, do you mean a helper function for subclass? Don't think it can be called in BeamSqlUdaf.getAccumulatorCoder() as no guarantee that AccuT is one of Beam SQL type.

I was just thinking a static method in BeamBuiltinAggregations. It could take a registry parameter and a String describing the command name for the UnsupportedOperationException. What do you think?

Agree, that can avoid duplicate code in MAX and MIN. Meanwhile, I could add a default coder for Short/Float/Date/BigDecimal in BeamSqlUdaf.getAccumulatorCoder, so that developers donot need to care about coder if is uses Beam SQL field types.

Helper method looks good, thanks! Please see my other concern about the auto-registration stuff, though.

takidau · 2017-06-30T19:49:21Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/schema/BeamSqlUdaf.java

+    registry.registerCoderForClass(Short.class, SerializableCoder.of(Short.class));
+    registry.registerCoderForClass(Float.class, SerializableCoder.of(Float.class));
+    registry.registerCoderForClass(BigDecimal.class, BigDecimalCoder.of());
+    registry.registerCoderForClass(Date.class, SerializableCoder.of(Date.class));


Adding coders to the registry automatically in a get method feels wrong. The user won't be expecting that, and if for some reason they've registered their own custom coders for those types, they won't be able to use SQL accumulators with them. I'd rather see us look into adding these to the common registry somehow as a separate change, possibly as part of creating a BeamSqlEnv or something similar?

sounds reasonable, let me remove these lines and leave a a further task.

takidau · 2017-06-30T19:50:40Z

dsls/sql/src/main/java/org/apache/beam/dsls/sql/transform/BeamBuiltinAggregations.java

    }

    @Override
-    public T result(BeamSqlRow accumulator) {
-      return (T) accumulator.getFieldValue(0);
+    public Coder<T> getAccumulatorCoder(CoderRegistry registry) throws CannotProvideCoderException {


Helper method looks good, thanks! Please see my other concern about the auto-registration stuff, though.

takidau · 2017-06-30T21:01:20Z

Am I right that we still have no test coverage for the merge methods in the accumulators? If so, what's the plan for getting test coverage there?

Other than that, this PR LGTM.

mingmxu · 2017-06-30T21:36:25Z

I guess it's only called with a distributed environment, and TestPipeline is not. I could add some tests directly for the built-in aggregators.

…

Sent from my iPhone

On Jun 30, 2017, at 2:01 PM, Tyler Akidau ***@***.***> wrote: Am I right that we still have no test coverage for the merge methods in the accumulators? If so, what's the plan for getting test coverage there? Other than that, this PR LGTM. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

takidau · 2017-06-30T22:28:29Z

That's okay, let's call this good enough for now. LGTM, I'll go ahead and merge. However, can you please:

File a Jira to make sure we add tests exercising merge for accumulators
Follow up with @tgroh to find out how we can exercise merge methods in combiners in a SQL DSL test?

Thank you!

takidau · 2017-06-30T22:41:58Z

Merged.

coveralls · 2017-07-01T05:15:33Z

Changes Unknown when pulling 4e8e4ed on XuMingmin:BEAM-2287 into ** on apache:DSL_SQL**.

mingmxu · 2017-07-01T05:38:42Z

Thanks @takidau @xumingming !

mingmxu force-pushed the BEAM-2287 branch from eab43c0 to 0fc4724 Compare June 27, 2017 05:22

xumingming reviewed Jun 27, 2017

View reviewed changes

xumingming approved these changes Jun 28, 2017

View reviewed changes

mingmxu force-pushed the BEAM-2287 branch from d42581d to 447e96e Compare June 28, 2017 04:25

mingmxu added 2 commits June 27, 2017 21:26

support of UDAF + rebase

a8d7213

1. support DECIMAL in built-in aggregators; 2. add JavaDoc for BeamSqlUdaf;

fix findbug reports

449d1fa

mingmxu force-pushed the BEAM-2287 branch from 447e96e to 449d1fa Compare June 28, 2017 04:46

takidau requested changes Jun 29, 2017

View reviewed changes

change BeamSqlUdaf to BeamSqlUdaf<InputT, AccumT, OutputT>

5201359

takidau requested changes Jun 30, 2017

View reviewed changes

cleanup and update Coder behaviour in BeamSqlUdaf

c81d52f

takidau requested changes Jun 30, 2017

View reviewed changes

cleanup Coders in BeamSqlUdaf

4e8e4ed

asfgit pushed a commit that referenced this pull request Jun 30, 2017

[BEAM-2287] This closes #3447

7ba77dd

mingmxu closed this Jul 1, 2017

mingmxu deleted the BEAM-2287 branch July 1, 2017 05:38

[BEAM-2287] UDAF support #3447

[BEAM-2287] UDAF support #3447

Conversation

mingmxu commented Jun 27, 2017

coveralls commented Jun 27, 2017

xumingming left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmxu commented Jun 27, 2017

xumingming commented Jun 28, 2017

xumingming left a comment

Choose a reason for hiding this comment

coveralls commented Jun 28, 2017

takidau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmxu commented Jun 30, 2017

mingmxu commented Jun 30, 2017

coveralls commented Jun 30, 2017

takidau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmxu Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takidau commented Jun 30, 2017

mingmxu commented Jun 30, 2017 via email

takidau commented Jun 30, 2017

takidau commented Jun 30, 2017

coveralls commented Jul 1, 2017

mingmxu commented Jul 1, 2017

mingmxu Jun 30, 2017 •

edited

Loading