[FLINK-6250] Distinct procTime with Rows boundaries #3732

huawei-flink · 2017-04-18T12:31:54Z

Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.

General
- The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
- The pull request addresses only one issue
- Each commit in the PR has a meaningful commit message (including the JIRA id)
Documentation
- Documentation has been added for new functionality
- Old documentation affected by the pull request has been updated
- JavaDoc for public methods has been added
Tests & Build
- Functionality added by the pull request is covered by tests
- mvn clean verify has been executed successfully locally or a Travis build has passed

hongyuhong · 2017-04-19T02:08:52Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+    smallestTsState = getRuntimeContext.getState(smallestTimestampDescriptor)
+
+    val distinctValDescriptor : MapStateDescriptor[Any, Row] =
+      new MapStateDescriptor[Any, Row]("distinctValuesBufferMapState", classOf[Any], classOf[Row])


It seems we should use detail Key type and Row type, otherwise it can not be serialize.

you are right

ok, I have bypassed the problem by using a different distinctMapState per aggregation [value, long]. I don't think it is necessary to preserve the type in the serialization, as aggregation works with numbers, and these do not have problems with serialization. or am I missing something?

hongyuhong · 2017-04-19T02:30:23Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+      i = 0
+      while (i < aggregates.length) {
+        val accumulator = accumulators.getField(i).asInstanceOf[Accumulator]
+        retractVal = retractList.get(0).getField(aggFields(i)(0))


Why use two-dimensional array？ It seems enough to use one-dim to record the aggregate index.

If I understand your question correctly, the list is necessary to include co-occurring events (i.e. function processed in the same MS). However this part was included in the original code already merged.

hongyuhong · 2017-04-19T02:37:10Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+        retractVal = retractList.get(0).getField(aggFields(i)(0))
+        if(distinctAggsFlag(i)){
+          distinctCounter += 1
+          val counterRow = distinctValueState.get(retractVal)


Different aggregate param has different type, i think we can not use one mapstate to store like e1,e2 in sum(dist e1), sum(dist e2)

the map state takes Any as key, but I agree with you that I didn't test this aspect through

the test you suggested indeed failed. great point! the new version passes the problem.

hongyuhong · 2017-04-19T02:40:47Z

...aries/flink-table/src/test/scala/org/apache/flink/table/api/scala/stream/sql/SqlITCase.scala

+
+    val sqlQuery = "SELECT a,  " +
+      "  SUM(DIST(e)) OVER (" +
+      " ORDER BY procTime() ROWS BETWEEN 10 PRECEDING AND CURRENT ROW) AS sumE " +


I think we can add some test for multi aggregation with distinct.

fhueske

Hi @huawei-flink,

thanks for working on this. I think the approach with the MapState[X, Long] is very good. We should integrate this with the code generation. This could work as follows: GeneratedAggregrations is extended by a method initialize(ctx: RuntimeContext). Then the function can register its own state objects. We would use this to generate MapStates for all distinct aggregates. The code generating function CodeGenerator.generateAggregations() would need another Array[Boolean] parameter. If these are set, it generates the initialize method to register state objects. The accumulate() and retract() methods would be generated to check the map state first for distinct aggregates. It would also be good if the code-gen would try to reuse the same MapState if two function aggregate the same distinct field.

Regarding the approach with the DIST() function, I have a few concerns. Although it would work, I think it has a few issues:

we would need to remove it (and possibly break the queries of users)
using a function to indicate DISTINCT is no semantically correct
users could try to applied it anywhere else in a query

Does @rtudoran have an estimate when DISTINCT would be available in Calcite master? The Calcite community usually releases quite often (from my experience at least once in a Flink release cycle).

What do you think?
Best, Fabian

fhueske · 2017-04-21T13:49:18Z

...src/main/scala/org/apache/flink/table/functions/UnsupportedOperatorsIndicatorFunctions.scala

+import java.nio.charset.Charset
+import java.util.List
+
+import org.apache.calcite.rel.`type`._


Many unused imports

fhueske · 2017-04-21T13:49:53Z

...src/main/scala/org/apache/flink/table/functions/UnsupportedOperatorsIndicatorFunctions.scala

+ */
+object DistinctAggregatorExtractor extends SqlFunction("DIST", SqlKind.OTHER_FUNCTION,
+  ReturnTypes.ARG0, InferTypes.RETURN_TYPE,
+  OperandTypes.NUMERIC, SqlFunctionCategory.NUMERIC) {


An aggregation function can also return non-numeric types such as MIN(String). So SqlFunctionCategory.NUMERIC might not be the right category. OTOH, I don't know how Calcite uses this category.

one never ends to learn. :-)

fhueske · 2017-04-21T14:01:11Z

...le/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamOverAggregate.scala

@@ -91,6 +93,22 @@ class DataStreamOverAggregate(

    val overWindow: org.apache.calcite.rel.core.Window.Group = logicWindow.groups.get(0)

+    val distinctVarMap: Map[String,Boolean] = new HashMap[String, Boolean]


I would do the extraction in the DataStreamOverAggregateRule. There we have proper access to the input Calc and the RexProgram. Extraction the function call as a String is quite fragile. The calc could for instance contain an attribute called "DISTRIBUTION".

The rule would remove unnest the expression from the DIST() RexNode and remove DIST. The distinct information would need to be added to the DataStreamOverAggregate.

This is a good point. The string trick is anyway a temporary workaround.

fhueske · 2017-04-21T14:02:44Z

...le/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamOverAggregate.scala

+        val exp = iter.next
+        if(exp.contains("DIST")){
+          val varName = exp.substring(exp.indexOf("$"))
+           distinctVarMap.put(varName,true)


fhueske · 2017-04-21T14:03:41Z

...le/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamOverAggregate.scala

    isRowTimeType: Boolean,
    isRowsClause: Boolean): DataStream[Row] = {

    val overWindow: Group = logicWindow.groups.get(0)
    val partitionKeys: Array[Int] = overWindow.keys.toArray
    val namedAggregates: Seq[CalcitePair[AggregateCall, String]] = generateNamedAggregates

+    val aggregateCalls = overWindow.getAggregateCalls(logicWindow)
+    val distinctAggFlags: Array[Boolean] =  new Array[Boolean](aggregateCalls.size)
+    for (i <- 0 until aggregateCalls.size()){


+space .size()) {

fhueske · 2017-04-21T14:37:35Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+            // decrease the counter and continue
+          else {
+            distinctValCounter -= 1
+            distinctValueStateList(i).put(retractVal,distinctValCounter)


fhueske · 2017-04-21T14:37:41Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+            distinctValCounter -= 1
+            distinctValueStateList(i).put(retractVal,distinctValCounter)
+          }
+        }else {


fhueske · 2017-04-21T14:54:01Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+
+      // get oldest element beyond buffer size
+      // and if oldest element exist, retract value
+      var removeCounter :Integer = 0


fhueske · 2017-04-21T14:54:08Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+      // get oldest element beyond buffer size
+      // and if oldest element exist, retract value
+      var removeCounter :Integer = 0
+      var distinctCounter : Integer = 0


fhueske · 2017-04-21T15:00:09Z

...rc/main/scala/org/apache/flink/table/runtime/aggregate/ProcTimeBoundedDistinctRowsOver.scala

+          var distinctValCounter: Long = distinctValueStateList(i).get(inputValue)
+          // if counter is 0L first time we aggregate
+          // for a seen value but never accumulated
+          if(distinctValCounter == 0L){


doesn't MapState.get return null when the key is not contained?

if it is a long, it returns 0L, don't ask me why. :-D

OK, good to know :-)

stefanobortoli · 2017-04-21T15:33:53Z

@fhueske, I agree with you about the risk of temporary DIST() ingestion. Perhaps we could meanwhile just work on the "ProcessFunction + Code generation" keeping the DIST function for test purposes tests. My concern is that the code my change again and all the work would just be wasted. To be honest, the code generation is quite new to me, and I will have to learn to work on that. Meanwhile, I have almost completed a version that relies on current code generation, nesting the distinct logic. As it is almost done, I will share this one as well and then if necessary move to the code generation. what do you think?

fhueske · 2017-04-21T15:36:25Z

sounds good to me. IMO, we can also add the runtime code to the code base even if there is no API support if we cover it with test cases. Then we could quickly enable it once DISTINCT in OVER becomes available.

fhueske · 2017-04-21T15:42:40Z

Btw, the code generation is not so fancy. The best way to learn it would be to debug a simple batch GROUP BY query (once batch aggregations are code-gen'd) as well

stefanobortoli · 2017-04-21T16:24:04Z

@fhueske I have just pushed a version working with code generation (without modifying the code generation) There will be the need for some refactoring in the AggregateUtil function, but if the overall concept is sound, I will fix things.

@hongyuhong , @shijinkui you could also have a look if you have time.

rtudoran · 2017-04-25T07:22:07Z

@fhueske @stefanobortoli I suggest we merge this temporary solution into flink (with using a special marker for distinct) until the flink module will be upgraded to the next calcite release. I have fixed the issue into calcite.
However, the advantages of pushing already this is that:

we can reuse the code
when we have the distinct marker we can simply modify the check for distinct for the aggregates in the DataStreamOver

stefanobortoli · 2017-04-26T09:00:42Z

@fhueske @sunjincheng121 @shijinkui @hongyuhong I have created a PR with the latest master with the code generated distinct, #3771 please have a look. If we it is fine, we can basically support distinct for all the window types

fhueske · 2017-04-26T12:45:26Z

Thanks @stefanobortoli! I'll have a look at #3771

fhueske · 2017-04-26T13:26:24Z

Since PR #3771 is the follow up of this PR, could you close this one? Thanks

hongyuhong reviewed Apr 19, 2017

View reviewed changes

stefanobortoli added 2 commits April 19, 2017 16:41

DIST() for aggregation on procTime row bounded windows

c9bc6dd

multi mapstate distinct management, tests improved

a0d5951

stefanobortoli force-pushed the FLINK-6250 branch from 4e3da4c to a0d5951 Compare April 19, 2017 14:42

fixed scala style

2d5decf

fhueske reviewed Apr 21, 2017

View reviewed changes

row bounded over working with code generated distinct aggregations

13d460c

fixed scala style

fe97e8e

huawei-flink closed this Apr 26, 2017

rmetzger added the component=API/TableSQL label Mar 14, 2019

flinkbot added component=TableSQL/API and removed component=API/TableSQL labels Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-6250] Distinct procTime with Rows boundaries #3732

[FLINK-6250] Distinct procTime with Rows boundaries #3732

huawei-flink commented Apr 18, 2017

hongyuhong Apr 19, 2017

huawei-flink Apr 19, 2017

huawei-flink Apr 19, 2017

hongyuhong Apr 19, 2017

huawei-flink Apr 19, 2017

hongyuhong Apr 19, 2017

huawei-flink Apr 19, 2017

huawei-flink Apr 19, 2017

hongyuhong Apr 19, 2017

fhueske left a comment •

edited

fhueske Apr 21, 2017

fhueske Apr 21, 2017

stefanobortoli Apr 21, 2017

fhueske Apr 21, 2017

stefanobortoli Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

fhueske Apr 21, 2017

stefanobortoli Apr 21, 2017

fhueske Apr 21, 2017

stefanobortoli commented Apr 21, 2017

fhueske commented Apr 21, 2017

fhueske commented Apr 21, 2017

stefanobortoli commented Apr 21, 2017

rtudoran commented Apr 25, 2017

stefanobortoli commented Apr 26, 2017

fhueske commented Apr 26, 2017

fhueske commented Apr 26, 2017

		@@ -91,6 +93,22 @@ class DataStreamOverAggregate(

		val overWindow: org.apache.calcite.rel.core.Window.Group = logicWindow.groups.get(0)

		val distinctVarMap: Map[String,Boolean] = new HashMap[String, Boolean]

[FLINK-6250] Distinct procTime with Rows boundaries #3732

[FLINK-6250] Distinct procTime with Rows boundaries #3732

Conversation

huawei-flink commented Apr 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanobortoli commented Apr 21, 2017

fhueske commented Apr 21, 2017

fhueske commented Apr 21, 2017

stefanobortoli commented Apr 21, 2017

rtudoran commented Apr 25, 2017

stefanobortoli commented Apr 26, 2017

fhueske commented Apr 26, 2017

fhueske commented Apr 26, 2017

fhueske left a comment •

edited