New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22372][core, yarn] Make cluster submission use SparkApplication. #19631
Conversation
The main goal of this change is to allow multiple cluster-mode submissions from the same JVM, without having them end up with mixed configuration. That is done by extending the SparkApplication trait, and doing so was reasonably trivial for standalone and mesos modes. For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE" system property to control behavior indirectly in a whole bunch of places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes. Most of the changes here are removing that. Since we removed support for Hadoop 1.x, some methods that lived in YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining methods don't need to be part of the class, and can be called directly from the YarnSparkHadoopUtil object, so now there's a single implementation of SparkHadoopUtil. One remaining use case was fetching the external shuffle service port, which can come from the YARN configuration. That is now done by checking the master used to submit the app, instead of the system property. The other use case was the propagation of the auth secret. That was done by stashing the secret in the current user's credentials in YARN mode. Instead, the secret is now propagated using the config / env variable like for standalone and mesos. That allowed a few methods in SparkHadoopUtil to go away. There's still a little bit of code in SecurityManager that is currently YARN-specific, but that uses the conf's master to detect whether the app is a YARN app. This also has the benefit of not stashing the secret in a shared location (the current UGI), making sure different apps use different secrets. With those out of the way, actually changing the YARN client to extend SparkApplication was easy. Tested with existing unit tests, and also by running YARN apps with auth and kerberos both on and off in a real cluster.
retest this please |
@shaneknapp any idea what's going on?
|
retest this please |
Test build #83299 has finished for PR 19631 at commit
|
SparkHadoopUtil is not documented as a public class in Spark's API docs, so just added new exclusions.
retest this please |
Test build #83305 has started for PR 19631 at commit |
retest this please |
Sigh. retest this please |
Test build #83352 has finished for PR 19631 at commit
|
That's an expected message from the build and has no impact on it. I'll
change the build config to suppress the message when I'm back from holiday
late next week.
On Nov 1, 2017 6:11 PM, "Marcelo Vanzin" <notifications@github.com> wrote:
@shaneknapp <https://github.com/shaneknapp> any idea what's going on?
chmod: cannot access `target/*': No such file or directory
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19631 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABiDrNcw4Zj27px49Fm53DRl7O4W1T2xks5syLRsgaJpZM4QOkme>
.
|
return | ||
} | ||
|
||
// In YARN, force creation of a new secret if this is client mode. This ensures each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this has to be unique to YARN? Will this solve the problem (in Mesos currently) where when the Executors bootstrap they do so without security (unless you "bake" the secret and secret config into the container image)? Looks like propagating the envvar is only handled in the YARN case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaves the same way as before for non-YARN. Standalone and Mesos have always used hardcoded secrets in the config to authenticate executors to driver and the driver to the master (in the case of standalone).
You can see the code I'm changing in this class, where for non-YARN it would throw an error if the secret was not set. If changing that behavior is desired for Mesos, then it should be done in a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW standalone at least propagates the secret using an env var, the issue is just that standalone, at least, needs the same secret everywhere, including the part where the driver authenticates with the master. Mesos just inherited that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I guess I'm wondering if now with your change do you think this will this work in all cases, not just YARN? Perhaps obviously, I'm looking into changing this for Mesos in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not changing the previous behavior for non-YARN. So my change shouldn't make it easier nor harder to make things work for other cluster managers.
Whether it will work depends on how the auth secret is used in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the clarification.
@@ -365,22 +370,21 @@ object SparkSubmit extends CommandLineUtils with Logging { | |||
|
|||
// This security manager will not need an auth secret, but set a dummy value in case | |||
// spark.authenticate is enabled, otherwise an exception is thrown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is no longer true?
env.filterKeys { k => | ||
// SPARK_HOME is filtered out because it is usually wrong on the remote machine (SPARK-12345) | ||
(k.startsWith("SPARK_") && k != "SPARK_ENV_LOADED" && k != "SPARK_HOME") || | ||
k.startsWith("MESOS_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this may break Mesos when using the mesos bundle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but this should be the exact same behavior as before, no? I didn't really change this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I apologize you're correct. I think this is actually to filter out things like MESOS_EXECUTOR_ID
and MESOS_FRAMEWORK_ID
Test build #83555 has finished for PR 19631 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems also involve many different changes in many files, needs to understand again.
val SPARK_AUTH_CONF: String = "spark.authenticate" | ||
val SPARK_AUTH_SECRET_CONF: String = "spark.authenticate.secret" | ||
val SPARK_AUTH_CONF = NETWORK_AUTH_ENABLED.key | ||
val SPARK_AUTH_SECRET_CONF = "spark.authenticate.secret" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also make this as a ConfigEntry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate change maybe. There are too many references to this right now, it would be really noisy here.
// Following constants are visible for testing. | ||
private[deploy] val YARN_SUBMIT_CLASS = "org.apache.spark.deploy.yarn.YarnClusterApplication" | ||
private[deploy] val REST_SUBMIT_CLASS = classOf[RestSubmissionClientApp].getName() | ||
private[deploy] val STANDALONE_SUBMIT_CLASS = classOf[ClientApp].getName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can rename these variables to XXX_CLUSTER_SUBMIT_CLASS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging { | |||
if (driverConf.contains("spark.yarn.credentials.file")) { | |||
logInfo("Will periodically update credentials from: " + | |||
driverConf.get("spark.yarn.credentials.file")) | |||
SparkHadoopUtil.get.startCredentialUpdater(driverConf) | |||
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of changing to this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YarnSparkHadoopUtil
does not extend SparkHadoopUtil
anymore (it doesn't even exist as a class), so I'm using this to invoke the method on the YARN object.
I plan to clean this up once #19272 is in, since that's a cleaner way to do this token renewal thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explanation, this kind of reflection seems not so elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#19272 is already in, do you plan to update it here in this PR or in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to do it in a separate PR (also to make YARN and Mesos share more code in that area).
Test build #83858 has finished for PR 19631 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did another round of review and leave some questions, LGTM overall.
One more question, does this PR implies that supporting to launch multiple SparkSubmit in one JVM will only work in cluster mode?
@@ -412,8 +412,6 @@ class SparkContext(config: SparkConf) extends Logging { | |||
} | |||
} | |||
|
|||
if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is not required anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is removing all references to SPARK_YARN_MODE
.
@@ -216,7 +216,9 @@ private[spark] object CoarseGrainedExecutorBackend extends Logging { | |||
if (driverConf.contains("spark.yarn.credentials.file")) { | |||
logInfo("Will periodically update credentials from: " + | |||
driverConf.get("spark.yarn.credentials.file")) | |||
SparkHadoopUtil.get.startCredentialUpdater(driverConf) | |||
Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explanation, this kind of reflection seems not so elegant.
Option(amKeytabFileName).foreach { k => props.setProperty(KEYTAB.key, k) } | ||
Option(amKeytabFileName).foreach { k => | ||
// Do not propagate the app's secret using the config file. | ||
if (k != SecurityManager.SPARK_AUTH_SECRET_CONF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add a check here? I'm not sure how this could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think this is the wrong place for the check.
Test build #83875 has finished for PR 19631 at commit
|
Test build #83876 has finished for PR 19631 at commit
|
Did another round of review, LGTM overall. @tgravescs do you any comment? |
Sorry on vacation til Monday. I would like to look at this though since it says changing secret passing to env variable. |
Test build #83946 has finished for PR 19631 at commit
|
Test build #83951 has finished for PR 19631 at commit
|
fyi - in progress of reviewing should be done in an hour or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only other concern here is that an environment variable I think opens up more possibility for leaking the secret as well. For instance if someone is starting a server that is getting hit by other users that server could be started with the same env and then inadvertently expose the secret to other users. For this, you could argue they should handle but I think its just something that could happen more easily then if the secret is in the hadoop credentials.
private val yarnClient = YarnClient.createYarnClient | ||
private val yarnConf = new YarnConfiguration(hadoopConf) | ||
private val hadoopConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - why the name change here? this is a yarnConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are effectively the same; I chose too keep the name used more often to minimize changes.
|
||
// If an auth secret is configured, propagate it to executors. | ||
Option(securityMgr.getSecretKey()).foreach { secret => | ||
sparkConf.setExecutorEnv(SecurityManager.ENV_AUTH_SECRET, secret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also set ENV_AUTH_SECRET in the ExecutorRunnable so isn't this duplicate?
@@ -824,6 +821,11 @@ private[spark] class Client( | |||
} | |||
} | |||
sys.env.get("PYTHONHASHSEED").foreach(env.put("PYTHONHASHSEED", _)) | |||
} else { | |||
// Propagate the auth secret to the AM using the environment, if set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only propagate ... in client mode.
def getSecretKey(): String = secretKey | ||
def getSecretKey(): String = { | ||
if (isAuthenticationEnabled) { | ||
Option(sparkConf.getenv(ENV_AUTH_SECRET)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that now for the driver we are using the conf (SPARK_AUTH_SECRET_CONF) just as a holding point for yarn. To me this introduces a bit of risk that it more easily gets out to the user. we are filtering it out from the spark conf written for executors but that seems more brittle then if its just not in there.
I realize this makes the code a bit more common for the other modes, but the other modes aren't really secure. I would almost rather keep the in memory secretKey variable as storage on yarn.
I think this also makes the secret key available for the user to get on the driver side (sc.getConf.get..) which I think it would be better to hide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, in Mesos, we are planning on using the Secrets primitives to distribute ENV_AUTH_SECRET
. This way Mesos and YARN can both use the same secret-generation code and only differ in the distribution of the secret. SPARK_AUTH_SECRET_CONF
is already somewhat awkward because it has to be in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a different, internal config for this is re-using SPARK_AUTH_SECRET_CONF
is confusing. But I'm not too concerned about exposing this to the user code running the application; they can just as easily get that info from the UGI currently. Spark already redacts this kind of information when writing it to things like the event log, which would be one place where it might leak out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed things and now they work pretty much as before. It would be good to separate secret generation from distribution, but I'd rather do that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree SPARK_AUTH_SECRET_CONF is awkward and not really secure, when I initially did this , this is what was requested by other committers since standalone and mesos needed more security work around it anyway.
I don't follow how the MesosSecretConfig is going to be used fully. Are these just regular spark configs passed around or distributed through mesos somehow?
If I understand what you're saying correctly, that should be considered a security issue in that server application regardless of this change. The server should not be exposing its environment to unprivileged users. That being said, it seems Spark's own thrift server does that. If the following works in a spark-shell, it probably would do the same through the STS:
So this change would expose that secret to users also in YARN mode (it's already exposed in Standalone and Mesos currently, because it's in the config). Let me think a little about this. I prefer the environment to the credentials approach because the latter are written to disk, but at the same time, that's less problematic than exposing the secret to users in the STS. |
Test build #84363 has finished for PR 19631 at commit
|
LGTM. |
@@ -36,6 +36,12 @@ object MimaExcludes { | |||
|
|||
// Exclude rules for 2.3.x | |||
lazy val v23excludes = v22excludes ++ Seq( | |||
// SPARK-22372: Make cluster submission use SparkApplication. | |||
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall change is only going into spark 2.3 ? Wondering for compatibility. I realize this is developerapi but would be nice to not change in minor release I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is 2.3 only.
I've always questioned why SparkHadoopUtil
is public in the first place. I'm not that worried about removing the functionality in this method because Spark apps don't really need to depend on it; it's easy to call the Hadoop API directly, and the API did nothing before except in YARN mode.
Yes this is what i was saying and agree one could argue its their fault its more of us being nice because we are not adding it in and they might not realize its something they shouldn't expose. I'm not sure if there was anything that sensitive in there before but still good practice to not expose either way.
I agree written to disk isn't ideal either but its only local disk and not HDFS so user would have to compromise the box vs someone simply accidentally allowing access to env/conf via a ui or console. reviewing changes now. |
+1 |
Merging to master. |
The main goal of this change is to allow multiple cluster-mode submissions from the same JVM, without having them end up with mixed configuration. That is done by extending the SparkApplication trait, and doing so was reasonably trivial for standalone and mesos modes. For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE" system property to control behavior indirectly in a whole bunch of places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes. Most of the changes here are removing that. Since we removed support for Hadoop 1.x, some methods that lived in YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining methods don't need to be part of the class, and can be called directly from the YarnSparkHadoopUtil object, so now there's a single implementation of SparkHadoopUtil. There were two places in the code that relied on SPARK_YARN_MODE to make decisions about YARN-specific functionality, and now explicitly check the master from the configuration for that instead: * fetching the external shuffle service port, which can come from the YARN configuration. * propagation of the authentication secret using Hadoop credentials. This also was cleaned up a little to not need so many methods in `SparkHadoopUtil`. With those out of the way, actually changing the YARN client to extend SparkApplication was easy. Tested with existing unit tests, and also by running YARN apps with auth and kerberos both on and off in a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19631 from vanzin/SPARK-22372.
The main goal of this change is to allow multiple cluster-mode
submissions from the same JVM, without having them end up with
mixed configuration. That is done by extending the SparkApplication
trait, and doing so was reasonably trivial for standalone and
mesos modes.
For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE"
system property to control behavior indirectly in a whole bunch of
places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes.
Most of the changes here are removing that.
Since we removed support for Hadoop 1.x, some methods that lived in
YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining
methods don't need to be part of the class, and can be called directly
from the YarnSparkHadoopUtil object, so now there's a single
implementation of SparkHadoopUtil.
There were two places in the code that relied on SPARK_YARN_MODE to
make decisions about YARN-specific functionality, and now explicitly check
the master from the configuration for that instead:
fetching the external shuffle service port, which can come from the YARN
configuration.
propagation of the authentication secret using Hadoop credentials. This also
was cleaned up a little to not need so many methods in
SparkHadoopUtil
.With those out of the way, actually changing the YARN client
to extend SparkApplication was easy.
Tested with existing unit tests, and also by running YARN apps
with auth and kerberos both on and off in a real cluster.