[FLINK-15564][yarn][test] Fix YarnClusterDescriptorTest that failed to validate the original intended behavior #10852

xintongsong · 2020-01-14T10:06:31Z

What is the purpose of the change

This PR fix the YarnClusterDescriptorTest#testFailIfTaskSlotsHigherThanMaxVcores and #testFailIfTaskSlotsHigherThanMaxVcores, which should have failed long ago but was covered by other problem.

The original purpose of these two test cases was to verify the validation logic against yarn max allocation vcores. These two cases should have failed when we change the validation logic to get yarn max allocation vcores from yarnClient instead of configuration, because there are no yarn cluster (neither MiniYARNCluster) started in these cases, thus yarnClient#getNodeReports will never return.

The cases have not failed because another IllegalConfigurationException was thrown in validateClusterSpecification, because of memory validation failure. The memory validation failure was by design, and in order to verify the original purpose these two test cases should have been updated with reasonable memory sizes, which is unfortunately overlooked.

Brief change log

04ffffe: Update memory setups to uncover the problem.
- I leave this as a separate commit for the convenience of code review. This should be squashed at the merging time.
54fdafd: Fix test cases by mocking the yarn max allocation vcores.

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

xintongsong · 2020-01-14T10:07:11Z

cc @azagrebin

flinkbot · 2020-01-14T10:09:50Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 54fdafd (Tue Jan 14 10:09:50 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

xintongsong · 2020-01-14T10:10:53Z

Travis passed: https://travis-ci.org/xintongsong/flink/builds/636391268

flinkbot · 2020-01-14T10:26:53Z

CI report:

54fdafd Travis: SUCCESS Azure: SUCCESS
494422c Travis: FAILURE Azure: SUCCESS
e0f8d6d Travis: SUCCESS Azure: SUCCESS
9968737 Travis: CANCELED Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

flink-yarn/src/test/java/org/apache/flink/yarn/YarnClusterDescriptorTest.java

wangyang0918 · 2020-01-14T11:17:37Z

flink-yarn/src/test/java/org/apache/flink/yarn/YarnClusterDescriptorTest.java

-			.setMasterMemoryMB(1)
-			.setTaskManagerMemoryMB(1)
-			.setNumberTaskManagers(1)
+			.setTaskManagerMemoryMB(1024)


Do we really need to set the TaskManagerMemoryMB to 1024? Or the default value(768) is enough.

The default is not enough. It would cause memory validation failure because the derived network memory is smaller than default min.

Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.

I'll give it a try to change the default to 1024, see if it breaks anything.

I'm actually thinking about remove ClusterSpecification entirely. I've visited all the usages of this class, and it seems to me all its information can be fetched directly from configuration. However, I'd like to scope this out from this PR.

Making ClusterSpecificationBuilder#taskManagerMemoryMB default 1024 turns out works well. No test broken due to this change. https://travis-ci.org/xintongsong/flink/builds/637743056
I pushed a fixup commit to this PR.

wangyang0918 · 2020-01-14T11:17:56Z

flink-yarn/src/test/java/org/apache/flink/yarn/YarnClusterDescriptorTest.java

-			.setTaskManagerMemoryMB(1)
-			.setNumberTaskManagers(1)
-			.setSlotsPerTaskManager(1)
+			.setTaskManagerMemoryMB(1024)


Same as above.

wangyang0918

Thanks for your contribution. LTGM.

azagrebin

Thanks for fixing this @xintongsong
The idea looks good to me, I have left some questions.

azagrebin · 2020-01-15T16:38:58Z

flink-yarn/src/test/java/org/apache/flink/yarn/YarnClusterDescriptorTest.java

-			.setMasterMemoryMB(1)
-			.setTaskManagerMemoryMB(1)
-			.setNumberTaskManagers(1)
+			.setTaskManagerMemoryMB(1024)


Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.

azagrebin · 2020-01-15T17:03:32Z

flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java

 	}

+	@VisibleForTesting
+	protected int getNumYarnMaxVcores() throws YarnDeploymentException {


I am wondering whether it is actually less invasive to introduce a StubYarnClientImpl and use in YarnClusterDescriptorTest#setup instead of changing production class YarnClusterDescriptor for this:

class StubYarnClientImpl() extends YarnClient { @Override public List<NodeReport> getNodeReports(NodeState... states) { return Collections.singletonList(new NodeReport() { @Override public Resource getCapability() { return new Resource() { @Override public int getVirtualCores() { return NUM_YARN_MAX_VCORES; } // ... }; } // ... }); } // ... } public class YarnClusterDescriptorTest extends TestLogger { // .. @BeforeClass public static void setupClass() { yarnConfiguration = new YarnConfiguration(); yarnClient = new StubYarnClientImpl(); yarnClient.init(yarnConfiguration); yarnClient.start(); } //.. }

It is quite some useless code but we can put into a separate file.

I'm not sure about this.

YarnClient is an abstract class, and to introduce a StubYarnClientImpl we would need to also implement more than 20 useless methods and mock the NodeReport as well. The complication seems unnecessary to me.

It is true the current approach touches the production class YarnClusterDescriptor, but only with a trivial refactor. IMO, even without this testability issue, it is not a bad thing to extract the logic getting max vcores from Yarn into a separate method.

I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:

Introduce a YarnClusterInformationRetriever interface which offers the method getMaxVcores. The default implementation will simply use the YarnClient to retrieve the max vcores. In the test we can provide a testing implementation.

Alternatively, similar to TestingYarnClient override the getNodeReports method from YarnClientImpl.

I also looked into YarnClientImpl. My only concern about this option was that it is annotated with @Private and @Unstable but I did not see that we actually already decided once to extend it. I would be ok with it. I guess we will refactor it if this class is gone after updating the Yarn dependency.

tillrohrmann

Thanks for creating this PR @xintongsong. I agree with Andrey that overriding production code methods in order to test something should always be our last resort and is usually a code/testing smell. It shows that the code is not modular enough to properly test it. Hence, I would propose to either introduce a YarnClusterInformationRetriever interface which encapsulates the retrieval logic and allows us to provide a testing implementation or to extend directly YarnClientImpl to override the getNodeReports method.

tillrohrmann · 2020-01-16T10:48:55Z

flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java

 	}

+	@VisibleForTesting
+	protected int getNumYarnMaxVcores() throws YarnDeploymentException {


I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:

Introduce a YarnClusterInformationRetriever interface which offers the method getMaxVcores. The default implementation will simply use the YarnClient to retrieve the max vcores. In the test we can provide a testing implementation.

Alternatively, similar to TestingYarnClient override the getNodeReports method from YarnClientImpl.

…lusterInformationRetriever for getting Yarn max allocation vcores.

I'm putting these changes in a separate commit for the convinence of code review. This commit should be squashed with the subsequent commit, using the commit message of the later.

…IfTaskSlotsHigherThanMaxVcores and #testConfigOverwrite not validating the original intended behaviors.

xintongsong · 2020-01-16T13:54:50Z

Thanks for the review and explanations, @tillrohrmann and @azagrebin .

I think YarnClusterInformationRetriever sounds to be a really good solution.

For having a testing implementation of YarnClient, my concern is that we might need to introduce too many unused implementations. And if the Yarn API is extended in later versions, we might need to further extend our testing implementation, or even run into dependency problems of various Hadoop versions.

I've updated the PR with the YarnClusterInformationRetriever approach. Please take another look.

tillrohrmann

The changes look good to me. Thanks a lot for updating the PR @xintongsong. Merging this PR once Travis gives green light.

…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes apache#10852.

…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes #10852.

rmetzger added the review=description? label Jan 14, 2020

rmetzger added the component=Deployment/YARN label Jan 14, 2020

xintongsong mentioned this pull request Jan 14, 2020

[FLINK-15530][dist] Replace process memory with flink memory for TMs in default flink-conf.yaml #10834

Closed

wangyang0918 reviewed Jan 14, 2020

View reviewed changes

xintongsong force-pushed the FLINK-15564-yarn-descriptor-test branch from 54fdafd to 494422c Compare January 14, 2020 11:48

wangyang0918 approved these changes Jan 15, 2020

View reviewed changes

xintongsong requested a review from azagrebin January 15, 2020 11:33

azagrebin reviewed Jan 15, 2020

View reviewed changes

xintongsong mentioned this pull request Jan 16, 2020

[FLINK-15598][yarn] Memory accuracy loss in YarnClusterDescriptor may lead to deployment failure. #10863

Closed

tillrohrmann requested changes Jan 16, 2020

View reviewed changes

xintongsong added 3 commits January 16, 2020 21:35

[FLINK-15564][yarn] Refactor YarnClusterDescriptor, introducing YarnC…

f2a0c53

…lusterInformationRetriever for getting Yarn max allocation vcores.

Update memory setups to uncover problem.

4005ceb

I'm putting these changes in a separate commit for the convinence of code review. This commit should be squashed with the subsequent commit, using the commit message of the later.

[FLINK-15564][yarn][test] Fix that YarnClusterDescriptorTest#testFail…

9968737

…IfTaskSlotsHigherThanMaxVcores and #testConfigOverwrite not validating the original intended behaviors.

xintongsong force-pushed the FLINK-15564-yarn-descriptor-test branch from e0f8d6d to 9968737 Compare January 16, 2020 13:46

tillrohrmann approved these changes Jan 16, 2020

View reviewed changes

tillrohrmann closed this in 1db3b55 Jan 16, 2020

[FLINK-15564][yarn][test] Fix YarnClusterDescriptorTest that failed to validate the original intended behavior #10852

[FLINK-15564][yarn][test] Fix YarnClusterDescriptorTest that failed to validate the original intended behavior #10852

Uh oh!

Conversation

xintongsong commented Jan 14, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

xintongsong commented Jan 14, 2020

Uh oh!

flinkbot commented Jan 14, 2020

Automated Checks

Review Progress

Uh oh!

xintongsong commented Jan 14, 2020

Uh oh!

flinkbot commented Jan 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyang0918 left a comment

Choose a reason for hiding this comment

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xintongsong commented Jan 16, 2020

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

flinkbot commented Jan 14, 2020 •

edited

Loading