Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios #37962

Closed
wants to merge 1 commit into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Sep 22, 2022

What changes were proposed in this pull request?

After SPARK-17321, YarnShuffleService will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with YarnConfiguration#NM_RECOVERY_ENABLED = true.

YarnShuffleIntegrationSuite not set YarnConfiguration#NM_RECOVERY_ENABLED and the default value of the configuration is false, so YarnShuffleIntegrationSuite will neither trigger data persistence to the db nor verify the reload of data.

This pr aims to let YarnShuffleIntegrationSuite restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes:

  1. Add a new un-document configuration spark.yarn.shuffle.testing to YarnShuffleService, and Initialize _recoveryPath when _recoveryPath == null && spark.yarn.shuffle.testing == true.

  2. Only set spark.yarn.shuffle.testing = true in YarnShuffleIntegrationSuite, and add assertions to check registeredExecFile is not null to ensure that registeredExecFile reload scenarios will be verified.

Why are the changes needed?

Fix registeredExecFile reload test scenarios.

Why not test by configuring YarnConfiguration#NM_RECOVERY_ENABLED as true?

This configuration has been tried

Hadoop 3.3.2

build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-3
YarnShuffleIntegrationSuite:
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/iq80/leveldb/DBException
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
  at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:111)
  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)

Hadoop 2.7.4

build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-2
YarnShuffleIntegrationSuite:
org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED ***
  java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  ...
Run completed in 3 seconds, 992 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***

From the above test, we need to use a fixed port to enable Yarn NodeManager recovery, but this is difficult to be guaranteed in UT, so this pr try a workaround way.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

@@ -222,6 +227,10 @@ protected void serviceInit(Configuration externalConf) throws Exception {

boolean stopOnFailure = _conf.getBoolean(STOP_ON_FAILURE_KEY, DEFAULT_STOP_ON_FAILURE);

if (_recoveryPath == null && _conf.getBoolean(INTEGRATION_TESTING, false)) {
_recoveryPath = new Path(Files.createTempDir().toURI());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a little difference here.
com.google.common.io.Files.createTempDir is used instead due to SPARK-39102 is not in Spark 3.3, so there is no createTempDir method in JavaUtils

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending CIs).

cc @tgravescs

@dongjoon-hyun
Copy link
Member

The test passed. Merged to branch-3.3.

dongjoon-hyun pushed a commit that referenced this pull request Sep 22, 2022
…sts registeredExecFile reload scenarios

### What changes were proposed in this pull request?
After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.

`YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data.

This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes:

1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`.

2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified.

### Why are the changes needed?
Fix registeredExecFile reload  test scenarios.

Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?

This configuration has been tried

**Hadoop 3.3.2**

```
build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-3
```

```
YarnShuffleIntegrationSuite:
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/iq80/leveldb/DBException
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
  at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:111)
  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
  ...
  Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
  at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
```

**Hadoop 2.7.4**

```
build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-2
```

```
YarnShuffleIntegrationSuite:
org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED ***
  java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  ...
Run completed in 3 seconds, 992 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***
```

From the above test, we need to use a fixed port to enable Yarn NodeManager recovery, but this is difficult to be guaranteed in UT, so this pr try a workaround way.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #37962 from LuciferYang/SPARK-40490-33.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@LuciferYang
Copy link
Contributor Author

thanks @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants