Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. #4585

Closed
wants to merge 6 commits into from

Conversation

suez1224
Copy link

@suez1224 suez1224 commented Aug 25, 2017

What is the purpose of the change

This change add COLLECT aggregate function for Flink SQL API.

Brief change log

  • Add Multiset SQL type support
  • Add COLLECT aggregate function

Verifying this change

This change added tests and can be verified as follows:

  • Added unittests for MultisetTypeInfo
  • Added unittests for Collect Aggregate functions
  • Added integration test for Collect Aggregate functions in stream/batch SQL

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving):no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? don't know, can not find a place in sql.md to document it, would like some suggestions

@suez1224 suez1224 force-pushed the collect-multiset branch 3 times, most recently from 7bcb0db to a50ad5b Compare August 25, 2017 07:41
@wuchong
Copy link
Member

wuchong commented Aug 25, 2017

Hi @suez1224 thanks for the PR, I think we can use Array instead of AbstractMultiSet. AbstractMultiSet is too obscure for users. In that case, we do not need the MultiSetSerilizer and MultiSetTypeInfo, also the following queries can use UDF on the field with array type as the evel(...) parameters.

@fhueske
Copy link
Contributor

fhueske commented Aug 30, 2017

Hi @suez1224, please read and fill out the template in the PR description. Thank you.

@suez1224 suez1224 changed the title [FLINK-7491] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. [FLINK-7491] [Table API & SQL] add MultiSetTypeInfo; add built-in Collect Aggregate Function for Flink SQL. Aug 30, 2017
@suez1224
Copy link
Author

suez1224 commented Sep 1, 2017

Hi @fhueske , I've filled out the PR template. Please take a look. Thanks a lot.

@suez1224 suez1224 force-pushed the collect-multiset branch 2 times, most recently from 69a96ec to 138e78d Compare September 1, 2017 22:53
@fhueske
Copy link
Contributor

fhueske commented Sep 4, 2017

Thanks @suez1224, I'm quite busy atm but will try to have a look soon.
Thanks, Fabian

Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @suez1224. I agree to @wuchong's comment, we should use a different Java type not depend on a library. See my inline comments.

@@ -80,6 +80,13 @@ under the License.
<!-- managed version -->
</dependency>

<!-- For multiset -->
<dependency>
<groupId>org.apache.commons</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not add additional dependencies to Flink just because of a new data type. There is also no reason behind choosing this library. Couldn't we not just use a usual Java Map? Otherwise I would propose to add class for our own type like we did it for org.apache.flink.types.Row. Calcite is using List, which is not very nice, but would also work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Use java.util.Map instead.

@@ -211,6 +218,14 @@ class FlinkTypeFactory(typeSystem: RelDataTypeSystem) extends JavaTypeFactoryImp
canonize(relType)
}

override def createMultisetType(elementType: RelDataType, maxCardinality: Long): RelDataType = {
val relType = new MultisetRelDataType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple location where a new type has to be added like FlinkRelNode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added changes in FlinkRelNode & ExpressionReducer

@suez1224 suez1224 force-pushed the collect-multiset branch 3 times, most recently from 29ba8e0 to d58071e Compare September 16, 2017 05:33
Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @suez1224. I had a brief look at it and it looks mostly good. I left a few comments.
@twalthr is more familiar with the type system, so it would be good if he would have another look as well.

Thanks, Fabian

import scala.collection.JavaConverters._

/** The initial accumulator for Collect aggregate function */
class CollectAccumulator[E] extends JTuple1[util.Map[E, Integer]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a MapView here. This feature was recently added and automatically backs the Map with a MapState if possible. Otherwise, it uses a Java HashMap (as right now). The benefit of backing the accumulator by MapState is that only the keys and values that are accessed need to be deserialized. In contrast, a regular HashMap is completely de/serialized every time the accumulator is read. Using MapView would require that the accumulator is implemented as a POJO (instead of a Tuple1).

Check this class for details MapView and let me know if you have questions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look, I've updated to use MapView.

def accumulate(accumulator: CollectAccumulator[E], value: E): Unit = {
if (value != null) {
if (accumulator.f0.containsKey(value)) {
val add = (x: Integer, y: Integer) => x + y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add is not used, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed.

override def getAccumulatorType: TypeInformation[CollectAccumulator[E]] = {
new TupleTypeInfo(
classOf[CollectAccumulator[E]],
new GenericTypeInfo[util.Map[E, Integer]](classOf[util.Map[E, Integer]]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use a generic type here. This will result in a KryoSerializer which can be quite inefficient and result in state that cannot be upgraded. Rather use MapTypeInformation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use MapViewTypeInfo here. However, if E is not basic type, I can only use GenericTypeInfo(please see ObjectCollectAggFunction), is there a better way? @fhueske

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an abstract method getElementTypeInfo() that returns the type info for the elements. The basic types can be properly handled and for Object we fall back to GenericType.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhueske Thanks. I think that 's what exactly the current code is. Please take another look.

elementType,
isNullable) {

override def toString = s"MULTISET($typeInfo)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be rather s"MULTISET($elementType)". TypeInformation is a Flink concept whereas RelDataType is in the Calcite context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

val tEnv = TableEnvironment.getTableEnvironment(env, config)

val sqlQuery =
"SELECT b, COLLECT(b)" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect should be added to the SQL documentation under "Built-in Function" -> "Aggregate Functions"

Moreover, we should add MULTISET to the supported data types.

It would also be nice if you could open a JIRA to add support for COLLECT to the Table API. We try to keep both in sync and it helps if we have a list of things that need to be added.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation.

Table API ticket created: https://issues.apache.org/jira/browse/FLINK-7658?filter=-1

@suez1224 suez1224 force-pushed the collect-multiset branch 2 times, most recently from 4ce5223 to f07216a Compare September 21, 2017 07:45
Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suez1224, thanks for the update!
I added a few minor comments.

A major question is how null values are handled. I'm not familiar with the semantics of COLLECT but if we want to support null values, we need to change some serialization code.

Best, Fabian

@@ -746,6 +746,7 @@ The SQL runtime is built on top of Flink's DataSet and DataStream APIs. Internal
| `Types.PRIMITIVE_ARRAY`| `ARRAY` | e.g. `int[]` |
| `Types.OBJECT_ARRAY` | `ARRAY` | e.g. `java.lang.Byte[]`|
| `Types.MAP` | `MAP` | `java.util.HashMap` |
| `Types.MULTISET` | `MULTISET` | `java.util.HashMap` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we explain how the HashMap is used to represent the multiset, i.e., that a multiset of String is a HashMap<String, Integer>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public final class MultisetTypeInfo<T> extends MapTypeInfo<T, Integer> {

private static final long serialVersionUID = 1L;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm newline

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param <T> The type of the elements in the Multiset.
*/
@PublicEvolving
public final class MultisetTypeInfo<T> extends MapTypeInfo<T, Integer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to org.apache.flink.table.api.Types class for easy creation of TypeInformation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SQL Multiset also support null values? If yes, we would need to wrap the MapSerializer.
Otherwise, the problem would be that we would need to rely on the key serializer to support null which many serializers do not. An solution would be to wrap the MapSerializer and additionally serialize the count for null elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at Calcite tests for Collect function, null will be ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! That makes things a lot easier :-)

// ------------------------------------------------------------------------

@Override
public boolean isBasicType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented by MapTypeInfo, no need to override.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

@Override
public boolean isTupleType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented by MapTypeInfo, no need to override.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
map
} else {
null.asInstanceOf[util.Map[E, Integer]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the specs of COLLECT, is null the correct return value or an empty Multiset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with Calcite tests, should return an empty Multiset instead.

@@ -1414,8 +1414,29 @@ object AggregateUtil {
aggregates(index) = udagg.getFunction
accTypes(index) = udagg.accType

case unSupported: SqlAggFunction =>
throw new TableException(s"unsupported Function: '${unSupported.getName}'")
case other: SqlAggFunction =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this case to case collect: SqlAggFunction if collect.getKind == SqlKind.COLLECT => to have a dedicated case for this built-in function. Also the case after case _: SqlCountAggFunction to have all built-in functions together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1414,8 +1414,29 @@ object AggregateUtil {
aggregates(index) = udagg.getFunction
accTypes(index) = udagg.accType

case unSupported: SqlAggFunction =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we add a dedicated case for COLLECT, this case should not be remain at the end of this match.

case _ =>
new ObjectCollectAggFunction
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else case can be removed because we keep the catch all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{% endhighlight %}
</td>
<td>
<p>Returns a multiset of the <i>value</i>s.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific about the handling of null values. Are they ignored? What is returned if only null values are added (null or empty multiset)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@suez1224 suez1224 force-pushed the collect-multiset branch 5 times, most recently from 851cd36 to 1741f10 Compare September 29, 2017 05:25
Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @suez1224.
I have only a few more comments. After that the PR should be good to merge.

@twalthr, would you like to have a look as well?

Thanks, Fabian

}
}

abstract class CollectAggFunction[E]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make this class abstract. Instead, we should add a constructor that asks for the TypeInformation of the value. Then we don't need to subclass the aggregation function and avoid most generic value types for non-primitive fields.

@@ -1410,6 +1410,26 @@ object AggregateUtil {
case _: SqlCountAggFunction =>
aggregates(index) = new CountAggFunction

case collect: SqlAggFunction if collect.getKind == SqlKind.COLLECT =>
aggregates(index) = sqlTypeName match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the actual TypeInformation of the argument type here to the constructor of the CollectAggFunction and don't need to distinguish the different argument types.

def testUnboundedGroupByCollect(): Unit = {

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add env.setStateBackend(this.getStateBackend) to enforce serialization through the MapView.

def testUnboundedGroupByCollectWithObject(): Unit = {

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add env.setStateBackend(this.getStateBackend) to enforce serialization through the MapView.

case _ =>
new ObjectCollectAggFunction
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to set accTypes(index) = aggregates(index).getAccumulatorType in order to activate the MapView feature.

@suez1224
Copy link
Author

suez1224 commented Oct 5, 2017

@fhueske Addressed your comments. PTAL. Much appreciated.

new FloatCollectAggFunction
case DOUBLE =>
new DoubleCollectAggFunction
case TINYINT | SMALLINT | INTEGER | BIGINT | VARCHAR | CHAR | FLOAT | DOUBLE =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was rather thinking to remove the match case block completely and set

aggregates(index) = new CollectAggFunction(FlinkTypeFactory.toTypeInfo(relDataType))

fhueske pushed a commit to fhueske/flink that referenced this pull request Oct 10, 2017
@fhueske
Copy link
Contributor

fhueske commented Oct 10, 2017

Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants