Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22372][core, yarn] Make cluster submission use SparkApplication. #19631

Closed
wants to merge 11 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 1, 2017

The main goal of this change is to allow multiple cluster-mode
submissions from the same JVM, without having them end up with
mixed configuration. That is done by extending the SparkApplication
trait, and doing so was reasonably trivial for standalone and
mesos modes.

For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE"
system property to control behavior indirectly in a whole bunch of
places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes.
Most of the changes here are removing that.

Since we removed support for Hadoop 1.x, some methods that lived in
YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining
methods don't need to be part of the class, and can be called directly
from the YarnSparkHadoopUtil object, so now there's a single
implementation of SparkHadoopUtil.

There were two places in the code that relied on SPARK_YARN_MODE to
make decisions about YARN-specific functionality, and now explicitly check
the master from the configuration for that instead:

  • fetching the external shuffle service port, which can come from the YARN
    configuration.

  • propagation of the authentication secret using Hadoop credentials. This also
    was cleaned up a little to not need so many methods in SparkHadoopUtil.

With those out of the way, actually changing the YARN client
to extend SparkApplication was easy.

Tested with existing unit tests, and also by running YARN apps
with auth and kerberos both on and off in a real cluster.

The main goal of this change is to allow multiple cluster-mode
submissions from the same JVM, without having them end up with
mixed configuration. That is done by extending the SparkApplication
trait, and doing so was reasonably trivial for standalone and
mesos modes.

For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE"
system property to control behavior indirectly in a whole bunch of
places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes.
Most of the changes here are removing that.

Since we removed support for Hadoop 1.x, some methods that lived in
YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining
methods don't need to be part of the class, and can be called directly
from the YarnSparkHadoopUtil object, so now there's a single
implementation of SparkHadoopUtil.

One remaining use case was fetching the external shuffle
service port, which can come from the YARN configuration. That
is now done by checking the master used to submit the app,
instead of the system property.

The other use case was the propagation of the auth secret.
That was done by stashing the secret in the current user's
credentials in YARN mode. Instead, the secret is now propagated
using the config / env variable like for standalone and mesos.
That allowed a few methods in SparkHadoopUtil to go away.
There's still a little bit of code in SecurityManager that
is currently YARN-specific, but that uses the conf's master
to detect whether the app is a YARN app. This also has the
benefit of not stashing the secret in a shared location (the
current UGI), making sure different apps use different secrets.

With those out of the way, actually changing the YARN client
to extend SparkApplication was easy.

Tested with existing unit tests, and also by running YARN apps
with auth and kerberos both on and off in a real cluster.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 1, 2017

retest this please

@vanzin
Copy link
Contributor Author

vanzin commented Nov 1, 2017

@shaneknapp any idea what's going on?

chmod: cannot access `target/*': No such file or directory

@vanzin
Copy link
Contributor Author

vanzin commented Nov 1, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2017

Test build #83299 has finished for PR 19631 at commit c80554b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkHadoopUtil is not documented as a public class in Spark's API
docs, so just added new exclusions.
@vanzin
Copy link
Contributor Author

vanzin commented Nov 1, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2017

Test build #83305 has started for PR 19631 at commit cee6be2.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 2, 2017

retest this please

@vanzin
Copy link
Contributor Author

vanzin commented Nov 2, 2017

Sigh. retest this please

@SparkQA
Copy link

SparkQA commented Nov 2, 2017

Test build #83352 has finished for PR 19631 at commit cee6be2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

shaneknapp commented Nov 4, 2017 via email

return
}

// In YARN, force creation of a new secret if this is client mode. This ensures each
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this has to be unique to YARN? Will this solve the problem (in Mesos currently) where when the Executors bootstrap they do so without security (unless you "bake" the secret and secret config into the container image)? Looks like propagating the envvar is only handled in the YARN case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaves the same way as before for non-YARN. Standalone and Mesos have always used hardcoded secrets in the config to authenticate executors to driver and the driver to the master (in the case of standalone).

You can see the code I'm changing in this class, where for non-YARN it would throw an error if the secret was not set. If changing that behavior is desired for Mesos, then it should be done in a separate change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW standalone at least propagates the secret using an env var, the issue is just that standalone, at least, needs the same secret everywhere, including the part where the driver authenticates with the master. Mesos just inherited that.

Copy link

@ArtRand ArtRand Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I guess I'm wondering if now with your change do you think this will this work in all cases, not just YARN? Perhaps obviously, I'm looking into changing this for Mesos in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not changing the previous behavior for non-YARN. So my change shouldn't make it easier nor harder to make things work for other cluster managers.

Whether it will work depends on how the auth secret is used in those cases.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the clarification.

@@ -365,22 +370,21 @@ object SparkSubmit extends CommandLineUtils with Logging {

// This security manager will not need an auth secret, but set a dummy value in case
// spark.authenticate is enabled, otherwise an exception is thrown.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is no longer true?

env.filterKeys { k =>
// SPARK_HOME is filtered out because it is usually wrong on the remote machine (SPARK-12345)
(k.startsWith("SPARK_") && k != "SPARK_ENV_LOADED" && k != "SPARK_HOME") ||
k.startsWith("MESOS_")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this may break Mesos when using the mesos bundle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but this should be the exact same behavior as before, no? I didn't really change this code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I apologize you're correct. I think this is actually to filter out things like MESOS_EXECUTOR_ID and MESOS_FRAMEWORK_ID

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83555 has finished for PR 19631 at commit cb6c8ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 13, 2017

@tgravescs @jerryshao

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems also involve many different changes in many files, needs to understand again.

val SPARK_AUTH_CONF: String = "spark.authenticate"
val SPARK_AUTH_SECRET_CONF: String = "spark.authenticate.secret"
val SPARK_AUTH_CONF = NETWORK_AUTH_ENABLED.key
val SPARK_AUTH_SECRET_CONF = "spark.authenticate.secret"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also make this as a ConfigEntry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a separate change maybe. There are too many references to this right now, it would be really noisy here.

// Following constants are visible for testing.
private[deploy] val YARN_SUBMIT_CLASS = "org.apache.spark.deploy.yarn.YarnClusterApplication"
private[deploy] val REST_SUBMIT_CLASS = classOf[RestSubmissionClientApp].getName()
private[deploy] val STANDALONE_SUBMIT_CLASS = classOf[ClientApp].getName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rename these variables to XXX_CLUSTER_SUBMIT_CLASS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging {
if (driverConf.contains("spark.yarn.credentials.file")) {
logInfo("Will periodically update credentials from: " +
driverConf.get("spark.yarn.credentials.file"))
SparkHadoopUtil.get.startCredentialUpdater(driverConf)
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of changing to this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YarnSparkHadoopUtil does not extend SparkHadoopUtil anymore (it doesn't even exist as a class), so I'm using this to invoke the method on the YARN object.

I plan to clean this up once #19272 is in, since that's a cleaner way to do this token renewal thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explanation, this kind of reflection seems not so elegant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19272 is already in, do you plan to update it here in this PR or in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to do it in a separate PR (also to make YARN and Mesos share more code in that area).

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83858 has finished for PR 19631 at commit 593eb6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did another round of review and leave some questions, LGTM overall.

One more question, does this PR implies that supporting to launch multiple SparkSubmit in one JVM will only work in cluster mode?

@@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging {
}
}

if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is not required anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is removing all references to SPARK_YARN_MODE.

@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging {
if (driverConf.contains("spark.yarn.credentials.file")) {
logInfo("Will periodically update credentials from: " +
driverConf.get("spark.yarn.credentials.file"))
SparkHadoopUtil.get.startCredentialUpdater(driverConf)
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explanation, this kind of reflection seems not so elegant.

Option(amKeytabFileName).foreach { k => props.setProperty(KEYTAB.key, k) }
Option(amKeytabFileName).foreach { k =>
// Do not propagate the app's secret using the config file.
if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add a check here? I'm not sure how this could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this is the wrong place for the check.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83875 has finished for PR 19631 at commit 121bcf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83876 has finished for PR 19631 at commit 08f47ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ClientSuite extends SparkFunSuite with Matchers

@jerryshao
Copy link
Contributor

Did another round of review, LGTM overall. @tgravescs do you any comment?

@tgravescs
Copy link
Contributor

Sorry on vacation til Monday. I would like to look at this though since it says changing secret passing to env variable.

@SparkQA
Copy link

SparkQA commented Nov 16, 2017

Test build #83946 has finished for PR 19631 at commit 2129ccb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83951 has finished for PR 19631 at commit 86f0bf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

fyi - in progress of reviewing should be done in an hour or so.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other concern here is that an environment variable I think opens up more possibility for leaking the secret as well. For instance if someone is starting a server that is getting hit by other users that server could be started with the same env and then inadvertently expose the secret to other users. For this, you could argue they should handle but I think its just something that could happen more easily then if the secret is in the hadoop credentials.

private val yarnClient = YarnClient.createYarnClient
private val yarnConf = new YarnConfiguration(hadoopConf)
private val hadoopConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - why the name change here? this is a yarnConf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are effectively the same; I chose too keep the name used more often to minimize changes.


// If an auth secret is configured, propagate it to executors.
Option(securityMgr.getSecretKey()).foreach { secret =>
sparkConf.setExecutorEnv(SecurityManager.ENV_AUTH_SECRET, secret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also set ENV_AUTH_SECRET in the ExecutorRunnable so isn't this duplicate?

@@ -824,6 +821,11 @@ private[spark] class Client(
}
}
sys.env.get("PYTHONHASHSEED").foreach(env.put("PYTHONHASHSEED", _))
} else {
// Propagate the auth secret to the AM using the environment, if set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only propagate ... in client mode.

def getSecretKey(): String = secretKey
def getSecretKey(): String = {
if (isAuthenticationEnabled) {
Option(sparkConf.getenv(ENV_AUTH_SECRET))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that now for the driver we are using the conf (SPARK_AUTH_SECRET_CONF) just as a holding point for yarn. To me this introduces a bit of risk that it more easily gets out to the user. we are filtering it out from the spark conf written for executors but that seems more brittle then if its just not in there.
I realize this makes the code a bit more common for the other modes, but the other modes aren't really secure. I would almost rather keep the in memory secretKey variable as storage on yarn.

I think this also makes the secret key available for the user to get on the driver side (sc.getConf.get..) which I think it would be better to hide.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, in Mesos, we are planning on using the Secrets primitives to distribute ENV_AUTH_SECRET. This way Mesos and YARN can both use the same secret-generation code and only differ in the distribution of the secret. SPARK_AUTH_SECRET_CONF is already somewhat awkward because it has to be in the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a different, internal config for this is re-using SPARK_AUTH_SECRET_CONF is confusing. But I'm not too concerned about exposing this to the user code running the application; they can just as easily get that info from the UGI currently. Spark already redacts this kind of information when writing it to things like the event log, which would be one place where it might leak out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed things and now they work pretty much as before. It would be good to separate secret generation from distribution, but I'd rather do that separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree SPARK_AUTH_SECRET_CONF is awkward and not really secure, when I initially did this , this is what was requested by other committers since standalone and mesos needed more security work around it anyway.

I don't follow how the MesosSecretConfig is going to be used fully. Are these just regular spark configs passed around or distributed through mesos somehow?

@vanzin
Copy link
Contributor Author

vanzin commented Nov 30, 2017

For instance if someone is starting a server that is getting hit by other users that server could be started with the same env and then inadvertently expose the secret to other users.

If I understand what you're saying correctly, that should be considered a security issue in that server application regardless of this change. The server should not be exposing its environment to unprivileged users.

That being said, it seems Spark's own thrift server does that. If the following works in a spark-shell, it probably would do the same through the STS:

scala> spark.sql("set spark.sql.columnNameOfCorruptRecord=${env:SCALA_HOME}").show()
+--------------------+------------------+
|                 key|             value|
+--------------------+------------------+
|spark.sql.columnN...|/apps/scala-2.11.7|
+--------------------+------------------+

So this change would expose that secret to users also in YARN mode (it's already exposed in Standalone and Mesos currently, because it's in the config).

Let me think a little about this. I prefer the environment to the credentials approach because the latter are written to disk, but at the same time, that's less problematic than exposing the secret to users in the STS.

@SparkQA
Copy link

SparkQA commented Dec 1, 2017

Test build #84363 has finished for PR 19631 at commit c752453.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

LGTM.

@@ -36,6 +36,12 @@ object MimaExcludes {

// Exclude rules for 2.3.x
lazy val v23excludes = v22excludes ++ Seq(
// SPARK-22372: Make cluster submission use SparkApplication.
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"),
Copy link
Contributor

@tgravescs tgravescs Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall change is only going into spark 2.3 ? Wondering for compatibility. I realize this is developerapi but would be nice to not change in minor release I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is 2.3 only.

I've always questioned why SparkHadoopUtil is public in the first place. I'm not that worried about removing the functionality in this method because Spark apps don't really need to depend on it; it's easy to call the Hadoop API directly, and the API did nothing before except in YARN mode.

@tgravescs
Copy link
Contributor

If I understand what you're saying correctly, that should be considered a security issue in that server application regardless of this change. The server should not be exposing its environment to unprivileged users.

Yes this is what i was saying and agree one could argue its their fault its more of us being nice because we are not adding it in and they might not realize its something they shouldn't expose. I'm not sure if there was anything that sensitive in there before but still good practice to not expose either way.

Let me think a little about this. I prefer the environment to the credentials approach because the
latter are written to disk, but at the same time, that's less problematic than exposing the secret to users in the STS.

I agree written to disk isn't ideal either but its only local disk and not HDFS so user would have to compromise the box vs someone simply accidentally allowing access to env/conf via a ui or console.

reviewing changes now.

@tgravescs
Copy link
Contributor

+1

@vanzin
Copy link
Contributor Author

vanzin commented Dec 4, 2017

Merging to master.

@vanzin vanzin closed this Dec 4, 2017
@vanzin vanzin deleted the SPARK-22372 branch December 4, 2017 19:05
asfgit pushed a commit that referenced this pull request Dec 4, 2017
The main goal of this change is to allow multiple cluster-mode
submissions from the same JVM, without having them end up with
mixed configuration. That is done by extending the SparkApplication
trait, and doing so was reasonably trivial for standalone and
mesos modes.

For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE"
system property to control behavior indirectly in a whole bunch of
places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes.
Most of the changes here are removing that.

Since we removed support for Hadoop 1.x, some methods that lived in
YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining
methods don't need to be part of the class, and can be called directly
from the YarnSparkHadoopUtil object, so now there's a single
implementation of SparkHadoopUtil.

There were two places in the code that relied on  SPARK_YARN_MODE to
make decisions about YARN-specific functionality, and now explicitly check
the master from the configuration for that instead:

* fetching the external shuffle service port, which can come from the YARN
  configuration.

* propagation of the authentication secret using Hadoop credentials. This also
  was cleaned up a little to not need so many methods in `SparkHadoopUtil`.

With those out of the way, actually changing the YARN client
to extend SparkApplication was easy.

Tested with existing unit tests, and also by running YARN apps
with auth and kerberos both on and off in a real cluster.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #19631 from vanzin/SPARK-22372.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants