[FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. #4585

suez1224 · 2017-08-25T06:37:45Z

What is the purpose of the change

This change add COLLECT aggregate function for Flink SQL API.

Brief change log

Add Multiset SQL type support
Add COLLECT aggregate function

Verifying this change

This change added tests and can be verified as follows:

Added unittests for MultisetTypeInfo
Added unittests for Collect Aggregate functions
Added integration test for Collect Aggregate functions in stream/batch SQL

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving):no
The serializers: no
The runtime per-record code paths (performance sensitive): yes
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? don't know, can not find a place in sql.md to document it, would like some suggestions

wuchong · 2017-08-25T08:04:28Z

Hi @suez1224 thanks for the PR, I think we can use Array instead of AbstractMultiSet. AbstractMultiSet is too obscure for users. In that case, we do not need the MultiSetSerilizer and MultiSetTypeInfo, also the following queries can use UDF on the field with array type as the evel(...) parameters.

fhueske · 2017-08-30T07:59:09Z

Hi @suez1224, please read and fill out the template in the PR description. Thank you.

suez1224 · 2017-09-01T21:01:45Z

Hi @fhueske , I've filled out the PR template. Please take a look. Thanks a lot.

fhueske · 2017-09-04T16:42:44Z

Thanks @suez1224, I'm quite busy atm but will try to have a look soon.
Thanks, Fabian

twalthr

Thanks for the PR @suez1224. I agree to @wuchong's comment, we should use a different Java type not depend on a library. See my inline comments.

twalthr · 2017-09-14T13:12:05Z

flink-core/pom.xml

@@ -80,6 +80,13 @@ under the License.
 			<!-- managed version -->
 		</dependency>

+		<!-- For multiset -->
+		<dependency>
+			<groupId>org.apache.commons</groupId>


We should not add additional dependencies to Flink just because of a new data type. There is also no reason behind choosing this library. Couldn't we not just use a usual Java Map? Otherwise I would propose to add class for our own type like we did it for org.apache.flink.types.Row. Calcite is using List, which is not very nice, but would also work.

Thanks. Use java.util.Map instead.

twalthr · 2017-09-14T13:13:47Z

...k-libraries/flink-table/src/main/scala/org/apache/flink/table/calcite/FlinkTypeFactory.scala

@@ -211,6 +218,14 @@ class FlinkTypeFactory(typeSystem: RelDataTypeSystem) extends JavaTypeFactoryImp
    canonize(relType)
  }

+  override def createMultisetType(elementType: RelDataType, maxCardinality: Long): RelDataType = {
+    val relType = new MultisetRelDataType(


There are multiple location where a new type has to be added like FlinkRelNode.

Added changes in FlinkRelNode & ExpressionReducer

fhueske

Thanks for the PR @suez1224. I had a brief look at it and it looks mostly good. I left a few comments.
@twalthr is more familiar with the type system, so it would be good if he would have another look as well.

Thanks, Fabian

fhueske · 2017-09-20T15:43:56Z

...-table/src/main/scala/org/apache/flink/table/functions/aggfunctions/CollectAggFunction.scala

+import scala.collection.JavaConverters._
+
+/** The initial accumulator for Collect aggregate function */
+class CollectAccumulator[E] extends JTuple1[util.Map[E, Integer]]


We can use a MapView here. This feature was recently added and automatically backs the Map with a MapState if possible. Otherwise, it uses a Java HashMap (as right now). The benefit of backing the accumulator by MapState is that only the keys and values that are accessed need to be deserialized. In contrast, a regular HashMap is completely de/serialized every time the accumulator is read. Using MapView would require that the accumulator is implemented as a POJO (instead of a Tuple1).

Check this class for details MapView and let me know if you have questions.

Please take another look, I've updated to use MapView.

fhueske · 2017-09-20T16:31:25Z

...-table/src/main/scala/org/apache/flink/table/functions/aggfunctions/CollectAggFunction.scala

+  def accumulate(accumulator: CollectAccumulator[E], value: E): Unit = {
+    if (value != null) {
+      if (accumulator.f0.containsKey(value)) {
+        val add = (x: Integer, y: Integer) => x + y


add is not used, right?

yes, removed.

fhueske · 2017-09-20T16:36:01Z

...-table/src/main/scala/org/apache/flink/table/functions/aggfunctions/CollectAggFunction.scala

+  override def getAccumulatorType: TypeInformation[CollectAccumulator[E]] = {
+    new TupleTypeInfo(
+      classOf[CollectAccumulator[E]],
+      new GenericTypeInfo[util.Map[E, Integer]](classOf[util.Map[E, Integer]]))


Don't use a generic type here. This will result in a KryoSerializer which can be quite inefficient and result in state that cannot be upgraded. Rather use MapTypeInformation.

Changed to use MapViewTypeInfo here. However, if E is not basic type, I can only use GenericTypeInfo(please see ObjectCollectAggFunction), is there a better way? @fhueske

We could have an abstract method getElementTypeInfo() that returns the type info for the elements. The basic types can be properly handled and for Object we fall back to GenericType.

@fhueske Thanks. I think that 's what exactly the current code is. Please take another look.

fhueske · 2017-09-20T16:40:41Z

...ries/flink-table/src/main/scala/org/apache/flink/table/plan/schema/MultisetRelDataType.scala

+    elementType,
+    isNullable) {
+
+  override def toString = s"MULTISET($typeInfo)"


should be rather s"MULTISET($elementType)". TypeInformation is a Flink concept whereas RelDataType is in the Calcite context.

fhueske · 2017-09-20T16:46:49Z

...es/flink-table/src/test/scala/org/apache/flink/table/runtime/batch/sql/AggregateITCase.scala

+    val tEnv = TableEnvironment.getTableEnvironment(env, config)
+
+    val sqlQuery =
+      "SELECT b, COLLECT(b)" +


Collect should be added to the SQL documentation under "Built-in Function" -> "Aggregate Functions"

Moreover, we should add MULTISET to the supported data types.

It would also be nice if you could open a JIRA to add support for COLLECT to the Table API. We try to keep both in sync and it helps if we have a list of things that need to be added.

Updated the documentation.

Table API ticket created: https://issues.apache.org/jira/browse/FLINK-7658?filter=-1

fhueske

Hi @suez1224, thanks for the update!
I added a few minor comments.

A major question is how null values are handled. I'm not familiar with the semantics of COLLECT but if we want to support null values, we need to change some serialization code.

Best, Fabian

fhueske · 2017-09-26T15:28:44Z

docs/dev/table/sql.md

@@ -746,6 +746,7 @@ The SQL runtime is built on top of Flink's DataSet and DataStream APIs. Internal
 | `Types.PRIMITIVE_ARRAY`| `ARRAY`                     | e.g. `int[]`           |
 | `Types.OBJECT_ARRAY`   | `ARRAY`                     | e.g. `java.lang.Byte[]`|
 | `Types.MAP`            | `MAP`                       | `java.util.HashMap`    |
+| `Types.MULTISET`       | `MULTISET`                  | `java.util.HashMap`    |


should we explain how the HashMap is used to represent the multiset, i.e., that a multiset of String is a HashMap<String, Integer>?

fhueske · 2017-09-26T15:30:12Z

flink-core/src/main/java/org/apache/flink/api/java/typeutils/MultisetTypeInfo.java

+public final class MultisetTypeInfo<T> extends MapTypeInfo<T, Integer> {
+
+	private static final long serialVersionUID = 1L;
+


fhueske · 2017-09-26T15:35:29Z

flink-core/src/main/java/org/apache/flink/api/java/typeutils/MultisetTypeInfo.java

+ * @param <T> The type of the elements in the Multiset.
+ */
+@PublicEvolving
+public final class MultisetTypeInfo<T> extends MapTypeInfo<T, Integer> {


Add to org.apache.flink.table.api.Types class for easy creation of TypeInformation

Does SQL Multiset also support null values? If yes, we would need to wrap the MapSerializer.
Otherwise, the problem would be that we would need to rely on the key serializer to support null which many serializers do not. An solution would be to wrap the MapSerializer and additionally serialize the count for null elements.

I took a look at Calcite tests for Collect function, null will be ignored.

Great! That makes things a lot easier :-)

fhueske · 2017-09-26T15:39:02Z

flink-core/src/main/java/org/apache/flink/api/java/typeutils/MultisetTypeInfo.java

+	// ------------------------------------------------------------------------
+
+	@Override
+	public boolean isBasicType() {


implemented by MapTypeInfo, no need to override.

fhueske · 2017-09-26T15:39:06Z

flink-core/src/main/java/org/apache/flink/api/java/typeutils/MultisetTypeInfo.java

+	}
+
+	@Override
+	public boolean isTupleType() {


implemented by MapTypeInfo, no need to override.

fhueske · 2017-09-26T16:11:51Z

...-table/src/main/scala/org/apache/flink/table/functions/aggfunctions/CollectAggFunction.scala

+      }
+      map
+    } else {
+      null.asInstanceOf[util.Map[E, Integer]]


According to the specs of COLLECT, is null the correct return value or an empty Multiset?

Check with Calcite tests, should return an empty Multiset instead.

fhueske · 2017-09-26T16:31:19Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

@@ -1414,8 +1414,29 @@ object AggregateUtil {
          aggregates(index) = udagg.getFunction
          accTypes(index) = udagg.accType

-        case unSupported: SqlAggFunction =>
-          throw new TableException(s"unsupported Function: '${unSupported.getName}'")
+        case other: SqlAggFunction =>


Change this case to case collect: SqlAggFunction if collect.getKind == SqlKind.COLLECT => to have a dedicated case for this built-in function. Also the case after case _: SqlCountAggFunction to have all built-in functions together.

fhueske · 2017-09-26T16:32:39Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

@@ -1414,8 +1414,29 @@ object AggregateUtil {
          aggregates(index) = udagg.getFunction
          accTypes(index) = udagg.accType

-        case unSupported: SqlAggFunction =>


Since we add a dedicated case for COLLECT, this case should not be remain at the end of this match.

fhueske · 2017-09-26T16:36:08Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

+              case _ =>
+                new ObjectCollectAggFunction
+            }
+          } else {


else case can be removed because we keep the catch all.

fhueske · 2017-09-26T16:39:26Z

docs/dev/table/sql.md

+          {% endhighlight %}
+      </td>
+      <td>
+          <p>Returns a multiset of the <i>value</i>s.</p>


Be more specific about the handling of null values. Are they ignored? What is returned if only null values are added (null or empty multiset)?

fhueske

Thanks for the update @suez1224.
I have only a few more comments. After that the PR should be good to merge.

@twalthr, would you like to have a look as well?

Thanks, Fabian

fhueske · 2017-10-04T13:08:10Z

...-table/src/main/scala/org/apache/flink/table/functions/aggfunctions/CollectAggFunction.scala

+    }
+}
+
+abstract class CollectAggFunction[E]


I don't think we need to make this class abstract. Instead, we should add a constructor that asks for the TypeInformation of the value. Then we don't need to subclass the aggregation function and avoid most generic value types for non-primitive fields.

fhueske · 2017-10-04T13:09:31Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

@@ -1410,6 +1410,26 @@ object AggregateUtil {
        case _: SqlCountAggFunction =>
          aggregates(index) = new CountAggFunction

+        case collect: SqlAggFunction if collect.getKind == SqlKind.COLLECT =>
+          aggregates(index) = sqlTypeName match {


We can pass the actual TypeInformation of the argument type here to the constructor of the CollectAggFunction and don't need to distinguish the different argument types.

fhueske · 2017-10-04T13:12:59Z

...braries/flink-table/src/test/scala/org/apache/flink/table/runtime/stream/sql/SqlITCase.scala

+  def testUnboundedGroupByCollect(): Unit = {
+
+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    val tEnv = TableEnvironment.getTableEnvironment(env)


add env.setStateBackend(this.getStateBackend) to enforce serialization through the MapView.

fhueske · 2017-10-04T13:22:48Z

...braries/flink-table/src/test/scala/org/apache/flink/table/runtime/stream/sql/SqlITCase.scala

+  def testUnboundedGroupByCollectWithObject(): Unit = {
+
+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    val tEnv = TableEnvironment.getTableEnvironment(env)


add env.setStateBackend(this.getStateBackend) to enforce serialization through the MapView.

fhueske · 2017-10-04T13:23:42Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

+            case _ =>
+              new ObjectCollectAggFunction
+          }
+


we need to set accTypes(index) = aggregates(index).getAccumulatorType in order to activate the MapView feature.

suez1224 · 2017-10-05T07:13:24Z

@fhueske Addressed your comments. PTAL. Much appreciated.

fhueske · 2017-10-05T10:36:16Z

...ries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/AggregateUtil.scala

-              new FloatCollectAggFunction
-            case DOUBLE =>
-              new DoubleCollectAggFunction
+            case TINYINT | SMALLINT | INTEGER | BIGINT | VARCHAR | CHAR | FLOAT | DOUBLE =>


I was rather thinking to remove the match case block completely and set

aggregates(index) = new CollectAggFunction(FlinkTypeFactory.toTypeInfo(relDataType))

…on to SQL. This closes apache#4585.

fhueske · 2017-10-10T16:50:42Z

Merging

suez1224 force-pushed the collect-multiset branch 3 times, most recently from 7bcb0db to a50ad5b Compare August 25, 2017 07:41

suez1224 force-pushed the collect-multiset branch from a50ad5b to 874bbca Compare August 25, 2017 08:06

suez1224 changed the title ~~[FLINK-7491] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL.~~ [FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. Aug 30, 2017

suez1224 force-pushed the collect-multiset branch from 874bbca to 924589a Compare September 1, 2017 20:37

suez1224 force-pushed the collect-multiset branch 2 times, most recently from 69a96ec to 138e78d Compare September 1, 2017 22:53

twalthr reviewed Sep 14, 2017

View reviewed changes

suez1224 force-pushed the collect-multiset branch 3 times, most recently from 29ba8e0 to d58071e Compare September 16, 2017 05:33

fhueske reviewed Sep 20, 2017

View reviewed changes

suez1224 force-pushed the collect-multiset branch 2 times, most recently from 4ce5223 to f07216a Compare September 21, 2017 07:45

fhueske reviewed Sep 26, 2017

View reviewed changes

suez1224 force-pushed the collect-multiset branch 5 times, most recently from 851cd36 to 1741f10 Compare September 29, 2017 05:25

Shuyi Chen added 4 commits September 28, 2017 22:33

add MultiSetTypeInfo; add Collect SQL feature

27c9bb5

use java.util.Map to implement Multiset

8eacfa5

update documentation, and minor fix for MultisetRelDataType

0a9e8f9

use MapView for the accumulator

af22fb8

suez1224 force-pushed the collect-multiset branch from 1741f10 to d597ab9 Compare September 29, 2017 05:34

fix comments

03a609a

suez1224 force-pushed the collect-multiset branch from d597ab9 to 03a609a Compare September 29, 2017 21:20

fhueske reviewed Oct 4, 2017

View reviewed changes

fix comments

84f1cb9

suez1224 force-pushed the collect-multiset branch from 48c7b4c to 84f1cb9 Compare October 5, 2017 00:13

fhueske reviewed Oct 5, 2017

View reviewed changes

fhueske pushed a commit to fhueske/flink that referenced this pull request Oct 10, 2017

[FLINK-7491] [table] Add MultiSet type and COLLECT aggregation functi…

bf7f0bc

…on to SQL. This closes apache#4585.

asfgit closed this in dccdba1 Oct 10, 2017

rmetzger added the component=API/TableSQL label Mar 14, 2019

flinkbot added component=TableSQL/API and removed component=API/TableSQL labels Mar 17, 2022

		public final class MultisetTypeInfo<T> extends MapTypeInfo<T, Integer> {

		private static final long serialVersionUID = 1L;

[FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. #4585

[FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. #4585

Conversation

suez1224 commented Aug 25, 2017 • edited

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

wuchong commented Aug 25, 2017

fhueske commented Aug 30, 2017

suez1224 commented Sep 1, 2017

fhueske commented Sep 4, 2017

twalthr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suez1224 commented Oct 5, 2017

Choose a reason for hiding this comment

fhueske commented Oct 10, 2017

suez1224 commented Aug 25, 2017 •

edited