-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-6388] Add support for DISTINCT into Code Generated Aggregations #3783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@fhueske please have a look at this PR, it contains just the code generation part with optional distinct. |
|
Please fix the PR title, you are referencing the wrong JIRA. |
|
Thanks for this PR @huawei-flink! I think I made a mistake when I suggested to use the code-gen'd functions with registered What does this mean? We cannot use the current approach of registering However, we can still use your code for distinct over windows ( I'll try to have a closer look at this PR soon. Best, Fabian |
|
@fhueske @stefanobortoli |
|
@fhueske @stefanobortoli
Based on how things are implemented now - this would involved to have a separate list of aggregatefunctions for the distinct. In order to be able to control when to accumulate to these values. |
|
@rtudoran @fhueske the first implementation I made was with the state in the ProcessFunction without code generated aggregation function. Second, I pushed a branch with the state in the process function using the code generated process function. Then, third I moved the state within the code generated function. It is not clear to me why the state cannot be within the code generated function. Could you please clarify so that we can understand whether it is worth working around it. This feature is quite important for us. Anyway, you could have a look at the branch that uses the state in the process function and uses the code generated aggregation functions. Basically, rather than generate one code generated function for all the aggregations, I create one class for each, and then I call the corresponding accumulate/retract using the distinct logic when marked in the process function. |
|
Hi @huawei-flink, I did not say that we need to move the state out of the code-gen'd function. We can and should leave the PR as it is. However, we cannot use this code for distinct group window aggregates but only for distinct over aggregates once they are supported in the API. |
|
@fhueske thanks for the clarification. I think it is good also to have the solution for the over windows :) I also wanted to ask you about the calcite and DISTINCT/DIST syntax. What do you think should be the right plan to proceed? Do we push it with DIST and sync with Calcite community of when they will have the next release and than create a pull request to upgrade the calcite version used? |
|
@fhueske , this is the PR with the code generated distinct aggregation for OVER. You mentioned that the value of the aggregation should be a Row, but what is kept in the distinct state is just the event value, not its "aggregation value state". Perhaps you can try to explain it better to me so that I can complete this PR and we can move on. What do you think? |
fhueske
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @huawei-flink, thanks for the PR and sorry for the long time it took me to review your code. The overall logic looks good, but I think we need to add a bit more to it:
- We need support for user-defined aggregate functions with multiple parameters.
- A deduplication of the distinct attributes would be good to reduce the size of the state.
- The generated function should be tested with a unit tests. the
HarnessTestBaseprovides the infrastructure for state related tests (RuntimeContext, access to the number of state objects).
What do you think?
Best, Fabian
| val initDist: String = if( distinctAggsFlags.isDefined ) { | ||
| val statePackage = "org.apache.flink.api.common.state" | ||
| val distAggsFlags = distinctAggsFlags.get | ||
| for(i <- distAggsFlags.indices) yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a space after for, if, while, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to share distinct state across aggregations if they are on the same fields, i.e., make distinct state distinct ;-). I would do this in a preprocessing step in the generateAggregations method.
It would also change the way how we access the state, because we may increment / decrement a key only once per record (call of accumulate if state is shared.
| val distAggsFlags = distinctAggsFlags.get | ||
| for(i <- distAggsFlags.indices) yield | ||
| if(distAggsFlags(i)) { | ||
| val typeString = javaTypes(aggFields(i)(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDAGGs can have more than a single parameter.
Since the key can only be a single object, we have to put all arguments into a TupleX (I think the limitation of 25 fields is reasonable and we can throw an exception if we observe a DISTINCT UDAGG with more than 25 fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, why shouldn't we use directly rows? is there any specific reason to prefer tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a very good point! We have to use Row because Tuple doesn't support null values.
| }.mkString("\n") | ||
| if (distinctAggsFlags.isDefined && distAggsFlags(i)){ | ||
| j""" | ||
| | Long distValCount$i = (Long) distStateList[$i].get(${parameters(i)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to put the parameters into a tuple before accessing the map state. The tuple should be reused.
| } | ||
| } | ||
| if(existDistinct){ | ||
| val initReusMember = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initReusMember -> initReuseMember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for the distinct deduplication and tuple reusage for multi-argument aggregate function could go here.
| def initialize(ctx: RuntimeContext) | ||
|
|
||
| /** | ||
| * Sets the results of the aggregations (partial or final) to the output row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you like to change this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was some formatting issue
|
@fhueske thanks for the comments. Did we include the latest calcite already? |
|
Not yet, but AFAIK is @twalthr working on that. |
|
Hi @huawei-flink , I'll close this PR later today when merging #5555. Best, Fabian |
- distinct values are stored in a MapView, either on the heap or in a StateBackend depending on the context. This closes apache#5555. This closes apache#3783 (PR was superseded by apache#5555)
Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.
General
Documentation
Tests & Build
mvn clean verifyhas been executed successfully locally or a Travis build has passed