Skip to content

Conversation

@xintongsong
Copy link
Contributor

What is the purpose of the change

This PR fix the YarnClusterDescriptorTest#testFailIfTaskSlotsHigherThanMaxVcores and #testFailIfTaskSlotsHigherThanMaxVcores, which should have failed long ago but was covered by other problem.

The original purpose of these two test cases was to verify the validation logic against yarn max allocation vcores. These two cases should have failed when we change the validation logic to get yarn max allocation vcores from yarnClient instead of configuration, because there are no yarn cluster (neither MiniYARNCluster) started in these cases, thus yarnClient#getNodeReports will never return.

The cases have not failed because another IllegalConfigurationException was thrown in validateClusterSpecification, because of memory validation failure. The memory validation failure was by design, and in order to verify the original purpose these two test cases should have been updated with reasonable memory sizes, which is unfortunately overlooked.

Brief change log

  • 04ffffe: Update memory setups to uncover the problem.
    • I leave this as a separate commit for the convenience of code review. This should be squashed at the merging time.
  • 54fdafd: Fix test cases by mocking the yarn max allocation vcores.

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@xintongsong
Copy link
Contributor Author

cc @azagrebin

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 54fdafd (Tue Jan 14 10:09:50 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@xintongsong
Copy link
Contributor Author

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 14, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

.setMasterMemoryMB(1)
.setTaskManagerMemoryMB(1)
.setNumberTaskManagers(1)
.setTaskManagerMemoryMB(1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to set the TaskManagerMemoryMB to 1024? Or the default value(768) is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is not enough. It would cause memory validation failure because the derived network memory is smaller than default min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try to change the default to 1024, see if it breaks anything.

I'm actually thinking about remove ClusterSpecification entirely. I've visited all the usages of this class, and it seems to me all its information can be fetched directly from configuration. However, I'd like to scope this out from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making ClusterSpecificationBuilder#taskManagerMemoryMB default 1024 turns out works well. No test broken due to this change. https://travis-ci.org/xintongsong/flink/builds/637743056
I pushed a fixup commit to this PR.

.setTaskManagerMemoryMB(1)
.setNumberTaskManagers(1)
.setSlotsPerTaskManager(1)
.setTaskManagerMemoryMB(1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@xintongsong xintongsong force-pushed the FLINK-15564-yarn-descriptor-test branch from 54fdafd to 494422c Compare January 14, 2020 11:48
Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. LTGM.

@xintongsong xintongsong requested a review from azagrebin January 15, 2020 11:33
Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @xintongsong
The idea looks good to me, I have left some questions.

.setMasterMemoryMB(1)
.setTaskManagerMemoryMB(1)
.setNumberTaskManagers(1)
.setTaskManagerMemoryMB(1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually try to change the default ClusterSpecificationBuilder.taskManagerMemoryMB to 1024 if it is already clear that it is inconsistent in general? This would also give us more confidence that what we are fixing now does not happen in other tests as well.

}

@VisibleForTesting
protected int getNumYarnMaxVcores() throws YarnDeploymentException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether it is actually less invasive to introduce a StubYarnClientImpl and use in YarnClusterDescriptorTest#setup instead of changing production class YarnClusterDescriptor for this:

class StubYarnClientImpl() extends YarnClient {
			@Override
			public List<NodeReport> getNodeReports(NodeState... states) {
				return Collections.singletonList(new NodeReport() {
					@Override
					public Resource getCapability() {
						return new Resource() {
							@Override
							public int getVirtualCores() {
								return NUM_YARN_MAX_VCORES;
							}
							// ...
						};
					}
					// ...
				});
			}
            // ...
}

public class YarnClusterDescriptorTest extends TestLogger {
    // ..
    @BeforeClass
	public static void setupClass() {
		yarnConfiguration = new YarnConfiguration();
		yarnClient = new StubYarnClientImpl();
		yarnClient.init(yarnConfiguration);
		yarnClient.start();
	}
    //..
}

It is quite some useless code but we can put into a separate file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this.

YarnClient is an abstract class, and to introduce a StubYarnClientImpl we would need to also implement more than 20 useless methods and mock the NodeReport as well. The complication seems unnecessary to me.

It is true the current approach touches the production class YarnClusterDescriptor, but only with a trivial refactor. IMO, even without this testability issue, it is not a bad thing to extract the logic getting max vcores from Yarn into a separate method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:

  1. Introduce a YarnClusterInformationRetriever interface which offers the method getMaxVcores. The default implementation will simply use the YarnClient to retrieve the max vcores. In the test we can provide a testing implementation.
  2. Alternatively, similar to TestingYarnClient override the getNodeReports method from YarnClientImpl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also looked into YarnClientImpl. My only concern about this option was that it is annotated with @Private and @Unstable but I did not see that we actually already decided once to extend it. I would be ok with it. I guess we will refactor it if this class is gone after updating the Yarn dependency.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR @xintongsong. I agree with Andrey that overriding production code methods in order to test something should always be our last resort and is usually a code/testing smell. It shows that the code is not modular enough to properly test it. Hence, I would propose to either introduce a YarnClusterInformationRetriever interface which encapsulates the retrieval logic and allows us to provide a testing implementation or to extend directly YarnClientImpl to override the getNodeReports method.

}

@VisibleForTesting
protected int getNumYarnMaxVcores() throws YarnDeploymentException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Andrey that overriding production code methods should always be our last resort when it comes to testing. The danger is that this method evolves and that we override important behaviour unintentionally. I would propose two solutions to the problem:

  1. Introduce a YarnClusterInformationRetriever interface which offers the method getMaxVcores. The default implementation will simply use the YarnClient to retrieve the max vcores. In the test we can provide a testing implementation.
  2. Alternatively, similar to TestingYarnClient override the getNodeReports method from YarnClientImpl.

…lusterInformationRetriever for getting Yarn max allocation vcores.
I'm putting these changes in a separate commit for the convinence of code review.
This commit should be squashed with the subsequent commit, using the commit message of the later.
…IfTaskSlotsHigherThanMaxVcores and #testConfigOverwrite not validating the original intended behaviors.
@xintongsong xintongsong force-pushed the FLINK-15564-yarn-descriptor-test branch from e0f8d6d to 9968737 Compare January 16, 2020 13:46
@xintongsong
Copy link
Contributor Author

Thanks for the review and explanations, @tillrohrmann and @azagrebin .

I think YarnClusterInformationRetriever sounds to be a really good solution.

For having a testing implementation of YarnClient, my concern is that we might need to introduce too many unused implementations. And if the Yarn API is extended in later versions, we might need to further extend our testing implementation, or even run into dependency problems of various Hadoop versions.

I've updated the PR with the YarnClusterInformationRetriever approach. Please take another look.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Thanks a lot for updating the PR @xintongsong. Merging this PR once Travis gives green light.

tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Jan 16, 2020
…criptor

For better testability this commit introduces the YarnClusterInformationRetriever which
is responsible for retrieving the maximum number of vcores.

This closes apache#10852.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Jan 16, 2020
…criptor

For better testability this commit introduces the YarnClusterInformationRetriever which
is responsible for retrieving the maximum number of vcores.

This closes apache#10852.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Jan 16, 2020
…criptor

For better testability this commit introduces the YarnClusterInformationRetriever which
is responsible for retrieving the maximum number of vcores.

This closes apache#10852.
tillrohrmann added a commit to tillrohrmann/flink that referenced this pull request Jan 16, 2020
…criptor

For better testability this commit introduces the YarnClusterInformationRetriever which
is responsible for retrieving the maximum number of vcores.

This closes apache#10852.
tillrohrmann added a commit that referenced this pull request Jan 16, 2020
…criptor

For better testability this commit introduces the YarnClusterInformationRetriever which
is responsible for retrieving the maximum number of vcores.

This closes #10852.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants