Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-11074] [table][tests] Enable harness tests with RocksdbStateBackend and add harness tests for CollectAggFunction #7253

Closed
wants to merge 4 commits into from

Conversation

dianfu
Copy link
Contributor

@dianfu dianfu commented Dec 6, 2018

What is the purpose of the change

This pull request enables the harness tests to test with RocksdbStateBackend and adds harness test for CollectAggFunction

Brief change log

  • Adds AggFunctionHarnessTest
  • Fix an issue in CollectAggFunction

Verifying this change

  • Added tests in AggFunctionHarnessTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the PR together so quickly @dianfu . I haven't got a close look and will checkout and play around over the weekend. I just left a few suggestions based on a quick look. Please kindly take a look and see if they make sense :-)


@Test
def testCollectAggregate(): Unit = {
val processFunction = new LegacyKeyedProcessOperator[String, CRow, CRow](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is deprecated right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. That's because GroupAggProcessFunction extends ProcessFunction, not KeyedProcessFunction and so we have to use LegacyKeyedProcessOperator at present. We can definitely improve GroupAggProcessFunction to extend KeyedProcessFunction, what about improving it in another ticket?

def testCollectAggregate(): Unit = {
val processFunction = new LegacyKeyedProcessOperator[String, CRow, CRow](
new GroupAggProcessFunction(
genCollectAggFunction,
Copy link
Contributor

@walterddr walterddr Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is the only place related to the actual agg function. we can probably make this more generic (see how window operator test harness was implemented)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reimplemented this PR. Please help to take a look if this issue still exists.

expectedOutput.add(new StreamRecord(CRow("aaa", Map(1 -> 1).asJava), 1))
testHarness.processElement(new StreamRecord(CRow(2L: JLong, 1: JInt, "bbb"), 1))
expectedOutput.add(new StreamRecord(CRow("bbb", Map(1 -> 1).asJava), 1))

Copy link
Contributor

@walterddr walterddr Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add something like:

// do a snapshot & close
State snapshot = testHarness.snapshot(0L, 0L);
testHarness.close();
// reopen and restore
testHarness = createTestHarness(operator);
testHarness.setup();
testHarness.initializeState(snapshot);
testHarness.open();

this will catch some of the weird serialization/deserialization problem as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. What about adding this kind of tests for the operator tests, such as GroupAggregateHarnessTest, JoinHarnessTest, etc as I think it's more useful for operator test.

@dianfu
Copy link
Contributor Author

dianfu commented Dec 7, 2018

@twalthr @sunjincheng121 @walterddr I have added an AggFunctionHarnessTest in another way from the current harness test by constructing the operator to test by compiling a sql query which can generate the operator, not by wrapping a hand written GeneratedAggregate directly. I think it has the following benefits:

  1. The harness test can also cover AggregateCodeGenerator as the operator is now code generated. For example, we are able to test if the DataView is actually replaced with StateDataView right now.
  2. Writing a new harness test becomes easy. We just need to construct a sql query which can generate the operator we want to test

Currently I have not changed other harness tests. We can improve other harness test in this way in the following PRs once we are sure this direction is what we want.

Could you help to take a look at this PR? Thanks a lot.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update @dianfu. I myself like the way how the operator code is now constructed through SQL instead of hard-coded codegen. The only concern I have is that this might lead to more of an "ITCase" instead of unit-test as it adds possibility that test exception can now come from compilation. Maybe @twalthr can provide more insight on this.

Overall the change looks very good to me. I only added minor comments. If we all agree with how the harness test codegen can be replaced by actual compilation, we can definitely work together and change others as well. 👍

@@ -87,7 +85,7 @@ class HarnessTestBase {
new RowTypeInfo(distinctCountAggregates.map(getAccumulatorTypeOfAggregateFunction(_)): _*)

protected val distinctCountDescriptor: String = EncodingUtils.encodeObjectToString(
new MapStateDescriptor("distinctAgg0", distinctCountAggregationStateType, Types.LONG))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't seem necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
extractExpectedTransformation(one.getInput, prefixOperatorName)
}
case two: TwoInputTransformation[_, _, _] =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw unsupported operation for now, since there's no code path that executes two input transform yet. we can always add this logic later when necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

testHarness.close()
}

private def getState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be put into HarnessTestBase as well. As of now I can only image the Operator to be accessible only when we need to manipulate the internal State to mock statebackend operation on top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Done.

Copy link
Contributor Author

@dianfu dianfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr Thanks a lot for your review and comments. Have updated the PR accordingly.

testHarness.close()
}

private def getState(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Done.

@@ -87,7 +85,7 @@ class HarnessTestBase {
new RowTypeInfo(distinctCountAggregates.map(getAccumulatorTypeOfAggregateFunction(_)): _*)

protected val distinctCountDescriptor: String = EncodingUtils.encodeObjectToString(
new MapStateDescriptor("distinctAgg0", distinctCountAggregationStateType, Types.LONG))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
extractExpectedTransformation(one.getInput, prefixOperatorName)
}
case two: TwoInputTransformation[_, _, _] =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dianfu
Copy link
Contributor Author

dianfu commented Dec 11, 2018

@twalthr Sorry for interrupt. It would be great if you can share some thoughts on the changes of this PR. Thanks in advance.

@dianfu
Copy link
Contributor Author

dianfu commented Dec 13, 2018

@twalthr @sunjincheng121 Could you help to take a look at this PR? You can see from the harness test of #7286 that this change makes it very easy to write a new harness test.

@sunjincheng121 sunjincheng121 self-assigned this Dec 13, 2018
Copy link
Member

@sunjincheng121 sunjincheng121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dianfu Thanks for your great job, I like your way of solving HarnessTest, which overrides the codegen logic and covers the test of the use of dataView by acc. Very good!
I only left two minor suggestions.
Thanks,
Jincheng

expectedOutput.add(new StreamRecord(CRow(false, "aaa", Map(1 -> 2).asJava), 1))
expectedOutput.add(new StreamRecord(CRow("aaa", Map(1 -> 2, 2 -> 1).asJava), 1))

// remove some state: state may be cleaned up by the state backend if not accessed more than ttl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more than ttl -> beyond ttl time?

@@ -491,13 +491,83 @@ class HarnessTestBase {
distinctCountFuncName,
distinctCountAggCode)

def createHarnessTester[KEY, IN, OUT](
dataStream: DataStream[_],
prefixOperatorName: String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How a bout add aggFieldAlias: String = "", for resolve scenarios where GroupBy is included in multiple UNION clauses. e.g:

      (SELECT b, max(a) as maxA FROM T GROUP BY b)
       UNION (
         SELECT b, min(a) as minA FROM (
          SELECT a, b FROM T GROUP BY a, b
         ) GROUP BY b
       )

And we using this method as follows: createHarnessTester(xx, "groupBy", "minA")
I didn't find a case where I had to test it in this way, it was just an enhanced suggestion.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createHarnessTester will be used not only in agg related tests but also in other harness tests, such as stream join tests, temporal join tests, sort tests, etc. So field aggFieldAlias seems a little wired from my point of view. What about adding it when we really need it? At that time we may have a better idea on how such a field will look like. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, If this tester will using in join test, we should add TwoInputTransformation check in the extractExpectedTransformation logic, then we also need add a xxName(may be not named aggFieldAlias), e.g.: In multiple join scenes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. Will do that when updating the join related harness tests.

@dianfu
Copy link
Contributor Author

dianfu commented Dec 14, 2018

@sunjincheng121 Thanks a lot for your review, appreciate! Updated the PR accordingly.

@sunjincheng121
Copy link
Member

Thanks for the update and feedback @dianfu
This PR is a base code of the harness test, and I am fine that improve the tester in your following JIRAs.
From the points of my view the PR looks good.

Bests,
Jincheng

@sunjincheng121
Copy link
Member

Merging...

@asfgit asfgit closed this in 9a45fca Dec 17, 2018
@dianfu
Copy link
Contributor Author

dianfu commented Dec 17, 2018

@sunjincheng121 @walterddr Thanks a lot for the review and merge. Very appreciated!

tisonkun pushed a commit to tisonkun/flink that referenced this pull request Jan 17, 2019
…ckend and add harness tests for CollectAggFunction

This closes apache#7253
@dianfu dianfu deleted the optimize_harness_test branch June 10, 2020 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants