-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-15564][yarn][test] Fix YarnClusterDescriptorTest that failed to validate the original intended behavior #10852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-15564][yarn][test] Fix YarnClusterDescriptorTest that failed to validate the original intended behavior #10852
Conversation
|
cc @azagrebin |
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 54fdafd (Tue Jan 14 10:09:50 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
Travis passed: https://travis-ci.org/xintongsong/flink/builds/636391268 |
CI report:
Bot commandsThe @flinkbot bot supports the following commands:
|
flink-yarn/src/test/java/org/apache/flink/yarn/YarnClusterDescriptorTest.java
Outdated
Show resolved
Hide resolved
| .setMasterMemoryMB(1) | ||
| .setTaskManagerMemoryMB(1) | ||
| .setNumberTaskManagers(1) | ||
| .setTaskManagerMemoryMB(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to set the TaskManagerMemoryMB to 1024? Or the default value(768) is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is not enough. It would cause memory validation failure because the derived network memory is smaller than default min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try to change the default to 1024, see if it breaks anything.
I'm actually thinking about remove ClusterSpecification entirely. I've visited all the usages of this class, and it seems to me all its information can be fetched directly from configuration. However, I'd like to scope this out from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making ClusterSpecificationBuilder#taskManagerMemoryMB default 1024 turns out works well. No test broken due to this change. https://travis-ci.org/xintongsong/flink/builds/637743056
I pushed a fixup commit to this PR.
| .setTaskManagerMemoryMB(1) | ||
| .setNumberTaskManagers(1) | ||
| .setSlotsPerTaskManager(1) | ||
| .setTaskManagerMemoryMB(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
54fdafd to
494422c
Compare
wangyang0918
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. LTGM.
azagrebin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @xintongsong
The idea looks good to me, I have left some questions.
| .setMasterMemoryMB(1) | ||
| .setTaskManagerMemoryMB(1) | ||
| .setNumberTaskManagers(1) | ||
| .setTaskManagerMemoryMB(1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.
| } | ||
|
|
||
| @VisibleForTesting | ||
| protected int getNumYarnMaxVcores() throws YarnDeploymentException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether it is actually less invasive to introduce a StubYarnClientImpl and use in YarnClusterDescriptorTest#setup instead of changing production class YarnClusterDescriptor for this:
class StubYarnClientImpl() extends YarnClient {
@Override
public List<NodeReport> getNodeReports(NodeState... states) {
return Collections.singletonList(new NodeReport() {
@Override
public Resource getCapability() {
return new Resource() {
@Override
public int getVirtualCores() {
return NUM_YARN_MAX_VCORES;
}
// ...
};
}
// ...
});
}
// ...
}
public class YarnClusterDescriptorTest extends TestLogger {
// ..
@BeforeClass
public static void setupClass() {
yarnConfiguration = new YarnConfiguration();
yarnClient = new StubYarnClientImpl();
yarnClient.init(yarnConfiguration);
yarnClient.start();
}
//..
}
It is quite some useless code but we can put into a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this.
YarnClient is an abstract class, and to introduce a StubYarnClientImpl we would need to also implement more than 20 useless methods and mock the NodeReport as well. The complication seems unnecessary to me.
It is true the current approach touches the production class YarnClusterDescriptor, but only with a trivial refactor. IMO, even without this testability issue, it is not a bad thing to extract the logic getting max vcores from Yarn into a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:
- Introduce a
YarnClusterInformationRetrieverinterface which offers the methodgetMaxVcores. The default implementation will simply use theYarnClientto retrieve the max vcores. In the test we can provide a testing implementation. - Alternatively, similar to
TestingYarnClientoverride thegetNodeReportsmethod fromYarnClientImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also looked into YarnClientImpl. My only concern about this option was that it is annotated with @Private and @Unstable but I did not see that we actually already decided once to extend it. I would be ok with it. I guess we will refactor it if this class is gone after updating the Yarn dependency.
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @xintongsong. I agree with Andrey that overriding production code methods in order to test something should always be our last resort and is usually a code/testing smell. It shows that the code is not modular enough to properly test it. Hence, I would propose to either introduce a YarnClusterInformationRetriever interface which encapsulates the retrieval logic and allows us to provide a testing implementation or to extend directly YarnClientImpl to override the getNodeReports method.
| } | ||
|
|
||
| @VisibleForTesting | ||
| protected int getNumYarnMaxVcores() throws YarnDeploymentException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:
- Introduce a
YarnClusterInformationRetrieverinterface which offers the methodgetMaxVcores. The default implementation will simply use theYarnClientto retrieve the max vcores. In the test we can provide a testing implementation. - Alternatively, similar to
TestingYarnClientoverride thegetNodeReportsmethod fromYarnClientImpl.
…lusterInformationRetriever for getting Yarn max allocation vcores.
I'm putting these changes in a separate commit for the convinence of code review. This commit should be squashed with the subsequent commit, using the commit message of the later.
…IfTaskSlotsHigherThanMaxVcores and #testConfigOverwrite not validating the original intended behaviors.
e0f8d6d to
9968737
Compare
|
Thanks for the review and explanations, @tillrohrmann and @azagrebin . I think For having a testing implementation of I've updated the PR with the |
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. Thanks a lot for updating the PR @xintongsong. Merging this PR once Travis gives green light.
…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes apache#10852.
…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes apache#10852.
…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes apache#10852.
…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes apache#10852.
…criptor For better testability this commit introduces the YarnClusterInformationRetriever which is responsible for retrieving the maximum number of vcores. This closes #10852.
What is the purpose of the change
This PR fix the
YarnClusterDescriptorTest#testFailIfTaskSlotsHigherThanMaxVcoresand#testFailIfTaskSlotsHigherThanMaxVcores, which should have failed long ago but was covered by other problem.The original purpose of these two test cases was to verify the validation logic against yarn max allocation vcores. These two cases should have failed when we change the validation logic to get yarn max allocation vcores from yarnClient instead of configuration, because there are no yarn cluster (neither
MiniYARNCluster) started in these cases, thusyarnClient#getNodeReportswill never return.The cases have not failed because another
IllegalConfigurationExceptionwas thrown invalidateClusterSpecification, because of memory validation failure. The memory validation failure was by design, and in order to verify the original purpose these two test cases should have been updated with reasonable memory sizes, which is unfortunately overlooked.Brief change log
Verifying this change
This change is already covered by existing tests.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation