Skip to content

Conversation

@haohui
Copy link

@haohui haohui commented Apr 24, 2017

No description provided.

@haohui
Copy link
Author

haohui commented Apr 24, 2017

Note that this PR contains minimal amount of tests. Would love the feedbacks on what kinds of tests are required here.

@fhueske
Copy link
Contributor

fhueske commented Apr 25, 2017

Hi @haohui, thanks for the PR! I like the approach of the wrapping distinct aggregator.

Unfortunately, this approach won't work with the upcoming changes for the the UDAGG interface. The AggregateFunction interface won't define methods to accumulate, etc. Instead, these methods can be implemented for different types, will be identified by reflection and called from generated code. We handle scalar and table functions the same way. See PR #3762 for how the AggregateFunction interface will evolve.

Best, Fabian

@haohui
Copy link
Author

haohui commented Apr 27, 2017

Updated the PR to codegen the parts used by distinct accumulator. Each column is calculated independently.

@fhueske
Copy link
Contributor

fhueske commented Apr 28, 2017

Hi @haohui,

I suggested before that PR #3771 might be used for DISTINCT group window functions. However, this does not work because we cannot register state for an AggregateFunction. The benefit of the approach of #3771 would have been that it does not need to deserialize the Map every time a record is accumulated (or retracted). Instead the distinct values are kept in a MapState that can be accessed (and deserialized) per look up key. But this approach does not work with the AggregateFunction that we use for early aggregation.

To be honest, I'm a bit concerned about the performance of the approach of this PR because the state of the DistinctAccumulator accumulator (i.e., the complete map) will be de/serialized every time we access it.

I think we can use this approach for now, but should look out, whether we can use an approach similar to the batch side where distinct aggregations (on different keys) are translated into multiple aggregations which are later joined together (the join would be rather cheap because its a 1-to-1 join).

I'll have a look at this PR later today.
Thanks, Fabian

@sunjincheng121
Copy link
Member

sunjincheng121 commented May 2, 2017

Hi @haohui @fhueske I am very interested in DISTINCT, Let me share some ideas about this:
First up, in standard database there are two situations can using DISTINCT keyword.

  • in SELECT Clause, e.g.: SELECT DISTINCT name FROM table
  • in AGG Clause, e.g.: COUNT([ALL|DISTINCT] expression),SUM([ALL|DISTINCT] expression), etc.

In this post,we talk about AGG Clause. The DISTINCT keyword tells the database system to aggregate only the distinct, or unique, values within the scope of the aggregate function. i.e. database system will deal with the DISTINCT keyword, and put the unique value into AGG. Based on this understanding, I think FLINK FRAMEWORK(not the AGG) should deal with the DISTINCT keyword. we do not need DistinctAccumulator.java. About GROUP WINDOW, I think I like analyze whether the data is duplicated in XXXWindowFunction and DataSetXXXAggFunction, And add Map[String, Boolean] variable Mapping aggName -> isFirstTimeProcess( identifies whether the data is duplicated as a parameter of GeneratedAggregationsFunction. GeneratedAggregationsFunction process data according to aggCall.isDistinct(we must keep this boolean value in GeneratedAggregationsFunction) and isFirstTimeProcess. What do you think? @haohui @fhueske

Best,
SunJincheng

@fhueske
Copy link
Contributor

fhueske commented May 7, 2018

The features that this PR was going to implement has been resolved by PR #5555.
I will close it.

fhueske pushed a commit to fhueske/flink that referenced this pull request May 7, 2018
…eaming tables.

This closes apache#3764.
This closes apache#3765. // Has been resolved by another PR.
@asfgit asfgit closed this in d65d932 May 7, 2018
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
…eaming tables.

This closes apache#3764.
This closes apache#3765. // Has been resolved by another PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants