Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8690][table]Support group window distinct aggregation on DataStream #5940

Closed
wants to merge 3 commits into from

Conversation

walterddr
Copy link
Contributor

What is the purpose of the change

  • Allow FlinkLogicalAggregate to support distinct aggregations on DataStream, while keeping DataSet to decompose distinct aggs into GROUP BY follow by normal aggregates.

Brief change log

  • Moved AggregateExpandDistinctAggregatesRule.JOIN to DATASET_NORM_RULES
  • Enabled DataStreamGroupWindowAggregate to support distinct agg while maintaining unsupported for [DataStream/DataSet]GroupAggregate.
  • Fixed typo in codegen for distinct aggregate when merge
  • Fixed a possible codegen test error for UNION ALL.

Verifying this change

  • Unit-test are added for DistinctAggregateTest
  • Added ITCase for distinct group window agg

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes (codegen)
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not yet, should we put in Aggregate section or Group Window section? inputs are highly appreciated. Also distinct over aggregate is bug-fixed in FLINK-8689 but not documented.

@walterddr walterddr changed the title [FLINK-8690][table]Support DistinctAgg on DataStream [FLINK-8690][table]Support group window distinct aggregation on DataStream Apr 28, 2018
@@ -82,7 +82,7 @@ private class FlinkLogicalAggregateConverter
case _ => true
}

!agg.containsDistinctCall() && supported
supported

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need this extra local variable here after your change.

@@ -49,11 +49,16 @@ trait CommonAggregate {

val aggs = namedAggregates.map(_.getKey)
val aggStrings = aggs.map( a => s"${a.getAggregation}(${
if (a.getArgList.size() > 0) {
val prefix = if (a.isDistinct) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one line should be fine here, slightly more compact IMO.

val prefix = if (a.isDistinct) "DISTINCT " else ""

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @walterddr!

I noticed that some of the changes are based on @haohui's PR #3764. I think it would be good to put your changes on top of his commit.
I left a few comments that should be easy to address.

FYI, I also created a JIRA issue to support distinct aggregates also in non-windowed grouping queries.
Given the runtime support, this should be very easy to achieve and basically just require a few tests for validation.

Thanks, Fabian

@@ -641,7 +641,8 @@ class AggregationCodeGenerator(
| java.util.Map.Entry entry = (java.util.Map.Entry) mergeIt$i.next();
| Object k = entry.getKey();
| Long v = (Long) entry.getValue();
| if (aDistinctAcc$i.add(k, v)) {
| if (aDistinctAcc$i.add(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key in the entry is a Row already

@@ -641,7 +641,8 @@ class AggregationCodeGenerator(
| java.util.Map.Entry entry = (java.util.Map.Entry) mergeIt$i.next();
| Object k = entry.getKey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this line to ${classOf[Row].getCanonicalName} k = (${classOf[Row].getCanonicalName}) entry.getKey();

}

@Test
def testDistinctAggregateWithNonDistinctAndGrouping(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test can be removed

def testDistinctAggWithMergeOnEventTimeSessionGroupWindow(): Unit = {
// create a watermark with 10ms offset to delay the window emission by 10ms to verify merge
val sessionWindowTestdata = List(
(1L, 1, "Hello"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not checking for DISTINCT semantics since all aggregated values are distinct. We could do COUNT(DISTINCT num) (int has to be renamed to num because its a SQL keyword).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check the correct merge behavior, we need two windows which aggregate the same value that is than deduplicated in merge.

Some data like:

      (1L, 2, "Hello"), // 1. Hello window
      (2L, 2, "Hello"), // 1. Hello window, deduped
      (8L, 2, "Hello"), // 2. Hello window, deduped during merge
      (10L, 3, "Hello"), // 2. Hello window, forwarded during merge
      (9L, 9, "Hello World"), // 1. Hello World window
      (4L, 1, "Hello"), // 1. Hello window, triggering merge of 1. and 2. Hello windows
      (16L, 16, "Hello")) // 3. Hello window (not merged)

tEnv.registerTable("MyTable", table)

val sqlQuery = "SELECT string, " +
" COUNT(DISTINCT long) " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add the end timestamp of the windows (SESSION_END(rowtime, INTERVAL '0.005' SECOND)) to make it easier to eyeball the expected test results.

@walterddr
Copy link
Contributor Author

Thanks @suez1224 @fhueske for the comments. I will change them accordingly.

Yes I copied a lot of test cases from @haohui's PR for my own testing. I can definitely put it on top given the runtime support is already merged in #5555. Procedure-wise question: should I rebase his commit then add my change on top, then attached to this PR? I am not sure if there's a clever way to both preserve the discussion in this thread and rebase on top of his change.

@fhueske
Copy link
Contributor

fhueske commented May 4, 2018

Hmmm, good point. The discussion would be lost.
How about I put your changes on top of Haohui's changes before merging?

@walterddr
Copy link
Contributor Author

LOL. I think I found a way:

  1. Rebase [FLINK-6335] Parse DISTINCT over grouped windows in stream SQL. #3764 over to current master;
  2. Rebase this branch to the rebased [FLINK-6335] Parse DISTINCT over grouped windows in stream SQL. #3764;
  3. Make changes on top

:-)

Haohui Mai and others added 2 commits May 4, 2018 09:16
@fhueske
Copy link
Contributor

fhueske commented May 7, 2018

Thanks for the update @walterddr.
The PR is good to merge.

@fhueske
Copy link
Contributor

fhueske commented May 7, 2018

merging

fhueske pushed a commit to fhueske/flink that referenced this pull request May 7, 2018
@asfgit asfgit closed this in 53610c3 May 7, 2018
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants