[FLINK-5582] [streaming] Add a general distributive aggregate function #3186

StephanEwen · 2017-01-20T19:42:33Z

The DataStream API currently has two aggregation functions that can be used on windows and in state, both of which have limitations:

ReduceFunction only supports one type as the type that is added and aggregated/returned.
FoldFunction Supports different types to add and return, but is not distributive, i.e. it cannot be used for hierarchical aggregation, for example to split the aggregation into to pre- and final-aggregation.

This pull request adds a generic and powerful aggregation function that supports:

Different types to add, accumulate, and return
The ability to merge partial aggregated by merging the accumulated type.

The requirement for this addition came from both the Table API and from plans to improve the efficiency of internal state handling.

Part1: Cleaning up the internal state abstraction

The main change is in commit ca5ce84

State interfaces (like ListState, ValueState, ReducingState) are very sparse and contain only methods exposed to the users. That makes sense to keep the public stable API minimal.
At the same time, the runtime needs more methods for its internal interaction with state, such as:

setting namespaces
accessing raw values
merging namespaces

These are currently realized by re-creating or re-obtaining the state objects from the KeyedStateBackend. That method causes quite an overhead for each access to the state. The KeyedStateBackend tries to do some tricks to reduce that overhead, but does it only partially and induces other overhead in the course.

The root cause of all these issues is a problem in the design: There is no proper "internal state abstraction" in a similar way as there is an external state abstraction (the public state API).
This adds a similar hierarchy of states for the internal methods:

             State
               |
               +-------------------InternalKvState
               |                         |
          MergingState                   |
               |                         |
               +-----------------InternalMergingState
               |                         |
      +--------+------+                  |
      |               |                  |
 ReducingState    ListState        +-----+-----------------+
      |               |            |                       |
      +-----------+   +-----------   -----------------InternalListState
                  |                |
                  +---------InternalReducingState

Part 2: The AggregateFunction

The main change is in commit 4a9fe96

The proposed interface is below. This type of interface is found in many APIs, like that of various databases, and also in Apache Beam:

The accumulator is the state of the running aggregate
Accumulators can be merged
Values are added to the accumulator
Getting the result from the accumulator perform an optional finalizing operation

public interface AggregateFunction<IN, ACC, OUT> extends Function {

	/** create a holder for the intermediate state */
	ACC createAccumulator();

	/** adds a value into the accumulator +/
	void add(IN value, ACC accumulator);

	/** Gets the aggregate from the accumulator, possibly finalizing it */
	OUT getResult(ACC accumulator);

	/** Merges two accumulators +/
	ACC merge(ACC a, ACC b);
}

Example use:

public class AverageAccumulator {
    long count;
    long sum;
}

// implementation of a simple average
public class Average implements AggregateFunction<Integer, AverageAccumulator, Double> {

    public AverageAccumulator createAccumulator() {
        return new AverageAccumulator();
    }

    public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
        a.count += b.count;
        a.sum += b.sum;
        return a;
    }

    public void add(Integer value, AverageAccumulator acc) {
        acc.sum += value;
        acc.count++;
    }

    public Double getResult(AverageAccumulator acc) {
        return acc.sum / (double) acc.count;
    }
}

// implementation of a weighted average
// this reuses the same accumulator type as the aggregate function for 'average'
public class WeightedAverage implements AggregateFunction<Datum, AverageAccumulator, Double> {

    public AverageAccumulator createAccumulator() {
        return new AverageAccumulator();
    }

    public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
        a.count += b.count;
        a.sum += b.sum;
        return a;
    }

    public void add(Datum value, AverageAccumulator acc) {
        acc.count += value.getWeight();
        acc.sum += value.getValue();
    }

    public Double getResult(AverageAccumulator acc) {
        return acc.sum / (double) acc.count;
    }
}

shaoxuan-wang

Hi, Stephan. Thanks for implementing this sweet new API for aggregate. I was trying to conducted an integration test with your new API. But it seems KVStateRequestSerializerRocksDBTest is broken. Can you please take a look.

Error:(62, 22) java: <N,T,ACC>createFoldingState(org.apache.flink.api.common.typeutils.TypeSerializer,org.apache.flink.api.common.state.FoldingStateDescriptor<T,ACC>) in org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend cannot override <N,T,ACC>createFoldingState(org.apache.flink.api.common.typeutils.TypeSerializer,org.apache.flink.api.common.state.FoldingStateDescriptor<T,ACC>) in org.apache.flink.runtime.state.AbstractKeyedStateBackend
return type org.apache.flink.api.common.state.FoldingState<T,ACC> is not compatible with org.apache.flink.runtime.state.internal.InternalFoldingState<N,T,ACC>

Error:(87, 53) java: incompatible types: org.apache.flink.api.common.state.ListState cannot be converted to org.apache.flink.runtime.state.internal.InternalListState<N,T>

shaoxuan-wang

I skipped (disabled) KVStateRequestSerializerRocksDBTest. Then with your new aggregate API on WindowedStream, I have tested some aggregates with tumble window as well as session window. It works pretty well (both tests returned expected results). Good job!

shaoxuan-wang · 2017-01-22T16:22:50Z

flink-core/src/main/java/org/apache/flink/api/common/functions/AggregateFunction.java

+
+	ACC createAccumulator();
+
+	void add(IN value, ACC accumulator);


As proposed in https://goo.gl/00ea5j, this function will not only handle the accumulate, but also handle the retract. Instead of "add", can you please consider to use "update".

Retractions are specific to the Table API. Do you expect this same interface to be used for user-defined aggregations there as well?

My first feeling is to keep the name add() because it fits better together with the term Accumulator. One can view retractions as adding negative values. What do you think about that?

TableAPI UDAGG will be eventually translated to this windowStream API. The accumulate and retract will be handled in this add function. I think it is OK if we "view retractions as adding negative values".

shaoxuan-wang · 2017-01-22T16:25:03Z

flink-core/src/main/java/org/apache/flink/api/common/functions/AggregateFunction.java

+
+	OUT getResult(ACC accumulator);
+
+	ACC merge(ACC a, ACC b);


Do you think it is useful to extend merge function to accept a list of ACC: ACC merge(List a). There are cases where the group merging a list of instances is much more efficient than just merge only two instances.

This could be done, true. It would currently mean a slight overhead (for list creation), but that is probably okay.

I would like to address that change in a separate pull request: We should probably adjust the merging state implementation as well, to exploit that new signature.

shaoxuan-wang

Add a few comments/suggestions on the AggregateFunction interface.

StephanEwen · 2017-01-22T16:33:49Z

Update: The first version had an issue with binary compatibility in the Scala DataStream API:

This Scala API accidentally exposed in Flink 1.0 a public method with an internal type as a parameter (an internal utility method). That should not have been the case, since method cannot be guaranteed when a parameter is a non-public type.

Unfortunately, the automatic tooling for binary compatibility checks is not flexible enough to work around that: Exclusions for methods do not work if parameter types were altered.

Due to that, the newer version rolls back the cleanup commit that renames the internal AggregateFunction.

StephanEwen · 2017-01-22T20:20:54Z

@shaoxuan-wang I cannot reproduce the compile error you posted.
The latest commit also gets a green light from Travis CI: https://travis-ci.org/StephanEwen/incubator-flink/builds/194239397

EDIT: I think you have pulled in "intermediate" commit from my private repository. That seems to have been broken there, but is not broken in this branch any more.

…ionException instead of RuntimeException

This introduces an internal state hierarchy that mirrors the external state hierarchy, but gives the runtime access to methods that should not be part of the user facing API, such as: - setting namespaces - accessing raw values - merging namespaces

…n heap state restoring code

…several warnings.

The AggregateFunction implements a very flexible interface for distributive aggregations.

StephanEwen · 2017-01-22T21:12:40Z

Since this requires constantly extensive merge conflict resolving with the master, I want to merge this soon.

@shaoxuan-wang has tested it life and the CI pass as well...

shaoxuan-wang reviewed Jan 22, 2017

View reviewed changes

StephanEwen force-pushed the aggregtions branch from 4a9fe96 to ce509d4 Compare January 22, 2017 16:31

StephanEwen changed the title ~~[FLINK-5582] [streaming] dd a general distributive aggregate function~~ [FLINK-5582] [streaming] Add a general distributive aggregate function Jan 22, 2017

StephanEwen added 7 commits January 22, 2017 21:22

[hotfix] Cleanups of the AbstractKeyedStateBackend

51c02ee

[hotfix] Remove no longer used Generic State classes

254a700

[hotfix] [streaming api] Non-merging triggers throw UnsupportedOperat…

2f1c474

…ionException instead of RuntimeException

[hotfix] [runtime] Various code cleanups and reductions of warnings i…

b8a784e

…n heap state restoring code

[hotfix] Remove sysout logging in SavepointMigrationTestBase and fix …

2b3fd39

…several warnings.

[FLINK-5582] [streaming] Add 'AggregateFunction' and 'AggregatingState'.

09380e4

The AggregateFunction implements a very flexible interface for distributive aggregations.

StephanEwen force-pushed the aggregtions branch from ce509d4 to 09380e4 Compare January 22, 2017 21:17

asfgit merged commit 09380e4 into apache:master Jan 22, 2017

rmetzger added the component=API/DataStream label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-5582] [streaming] Add a general distributive aggregate function #3186

[FLINK-5582] [streaming] Add a general distributive aggregate function #3186

Uh oh!

StephanEwen commented Jan 20, 2017

Uh oh!

shaoxuan-wang left a comment

Uh oh!

shaoxuan-wang left a comment

Uh oh!

shaoxuan-wang Jan 22, 2017

Uh oh!

StephanEwen Jan 22, 2017

Uh oh!

StephanEwen Jan 22, 2017

Uh oh!

shaoxuan-wang Jan 23, 2017

Uh oh!

shaoxuan-wang Jan 22, 2017

Uh oh!

StephanEwen Jan 22, 2017

Uh oh!

shaoxuan-wang left a comment

Uh oh!

StephanEwen commented Jan 22, 2017

Uh oh!

StephanEwen commented Jan 22, 2017 •

edited

Loading

Uh oh!

StephanEwen commented Jan 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		ACC createAccumulator();

		void add(IN value, ACC accumulator);

[FLINK-5582] [streaming] Add a general distributive aggregate function #3186

[FLINK-5582] [streaming] Add a general distributive aggregate function #3186

Uh oh!

Conversation

StephanEwen commented Jan 20, 2017

Part1: Cleaning up the internal state abstraction

Part 2: The AggregateFunction

Uh oh!

shaoxuan-wang left a comment

Choose a reason for hiding this comment

Uh oh!

shaoxuan-wang left a comment

Choose a reason for hiding this comment

Uh oh!

shaoxuan-wang Jan 22, 2017

Choose a reason for hiding this comment

Uh oh!

StephanEwen Jan 22, 2017

Choose a reason for hiding this comment

Uh oh!

StephanEwen Jan 22, 2017

Choose a reason for hiding this comment

Uh oh!

shaoxuan-wang Jan 23, 2017

Choose a reason for hiding this comment

Uh oh!

shaoxuan-wang Jan 22, 2017

Choose a reason for hiding this comment

Uh oh!

StephanEwen Jan 22, 2017

Choose a reason for hiding this comment

Uh oh!

shaoxuan-wang left a comment

Choose a reason for hiding this comment

Uh oh!

StephanEwen commented Jan 22, 2017

Uh oh!

StephanEwen commented Jan 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StephanEwen commented Jan 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

StephanEwen commented Jan 22, 2017 •

edited

Loading