Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14963][Yarn] Using recoveryPath if NM recovery is enabled #12994

Closed
wants to merge 5 commits into from

Conversation

Projects
None yet
3 participants
@jerryshao
Copy link
Contributor

commented May 9, 2016

What changes were proposed in this pull request?

From Hadoop 2.5+, Yarn NM supports NM recovery which using recovery path for auxiliary services such as spark_shuffle, mapreduce_shuffle. So here change to use this path install of NM local dir if NM recovery is enabled.

How was this patch tested?

Unit test + local test.

@SparkQA

This comment has been minimized.

Copy link

commented May 9, 2016

Test build #58116 has finished for PR 12994 at commit 08557bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
}

/**
* Get the recovery path, this will override the default one to get the our own maintained

This comment has been minimized.

Copy link
@tgravescs

tgravescs May 9, 2016

Contributor

nit: "to get our" remove "the"


// In case this NM was killed while there were running spark applications, we need to restore
// lost state for the existing executors. We look for an existing file in the NM's local dirs.
// If we don't find one, then we choose a file to use to save the state next time. Even if
// an application was stopped while the NM was down, we expect yarn to call stopApplication()
// when it comes back
registeredExecutorFile =
findRegisteredExecutorFile(conf.getTrimmedStrings("yarn.nodemanager.local-dirs"));
new File(getRecoveryPath().toUri().getPath(), "registeredExecutors.ldb");

This comment has been minimized.

Copy link
@tgravescs

tgravescs May 9, 2016

Contributor

I know you didn't change this but could you make a constant for "registeredExecutors.ldb"

// dirs, which is compatible with the old code.
_recoveryPath = new Path(dir);
} else {
// If NM recovery is enabled and recovery file is existed in old NM local dirs, which

This comment has been minimized.

Copy link
@tgravescs

tgravescs May 9, 2016

Contributor

nit change "and recovery file is existed in old" to "the recovery file exists in the old"

}

test("get correct recovery path") {

This comment has been minimized.

Copy link
@tgravescs

tgravescs May 9, 2016

Contributor

could you also add a test to make sure the move happens properly when upgrading.

@tgravescs

This comment has been minimized.

Copy link
Contributor

commented May 9, 2016

few minor comments but mostly looks good. Did you build against both hadoop 2.5+ and hadoop < 2.5?

Did you manually test the upgrade path?

@jerryshao

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2016

Thanks @tgravescs for your comments, I will change the code and do a more comprehensive test accordingly.

@jerryshao

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2016

@tgravescs , I tested locally using Hadoop 2.4 and 2.6 with different scenarios:

  1. Only Hadoop 2.4
  2. Hadoop 2.4 upgrade to 2.6 with NM recovery disabled.
  3. Hadoop 2.4 upgrade to 2.6 with NM recovery enabled.
  4. Hadoop 2.6 NM recovery disabled to enabled.

Looks fine in all these scenarios.

One missing part is do we need to take care of downgrade scenarios, like 2.6 to 2.4 or NM recovery enabled to disabled?

@SparkQA

This comment has been minimized.

Copy link

commented May 10, 2016

Test build #58201 has finished for PR 12994 at commit 4e5c2fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented May 10, 2016

Test build #58205 has finished for PR 12994 at commit 519bf07.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@jerryshao

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2016

Jenkins, retest this please.

@SparkQA

This comment has been minimized.

Copy link

commented May 10, 2016

Test build #58207 has finished for PR 12994 at commit 519bf07.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented May 10, 2016

Test build #58210 has finished for PR 12994 at commit 02752c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented May 10, 2016

Test build #58216 has finished for PR 12994 at commit 6d4a8f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@tgravescs

This comment has been minimized.

Copy link
Contributor

commented May 10, 2016

I'm not concerned with the downgrade case. It just won't find the file if yarn isn't setting the recovery path any longer (it will create new one in localdir) , but I don't see that as a big issue because if someone is downgrading their cluster or turned off recovery they should kill everything that is running.

@tgravescs

This comment has been minimized.

Copy link
Contributor

commented May 10, 2016

+1 Thanks @jerryshao

@asfgit asfgit closed this in aab99d3 May 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.