New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1126. spark-app preliminary #86
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
System.err.println( | ||
"Usage: spark-app <primary binary> [options] \n" + | ||
"Options:\n" + | ||
" --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd disable the scalastyle here so that it doesn't produce build errors. In this case I think it's fine to violate the line limit:
http://www.scalastyle.org/configuration.html
also there is a different way to do multline strings in scala - but up to you...
http://downgra.de/2010/09/14/multi-line_strings_with_scala/
Hey Sandy, the overall approach looks good, though I made some comments throughout. It would be really nice to avoid launching a second JVM if possible. It seems that the main reasons are to set environment vars or to pass arguments to the YARN launcher, but we can call the YARN launcher directly. |
Also, not sure what people think about calling this "spark-submit" instead of "spark-app". For the in-cluster use case it's really just for submitting, and I imagine that case will be more popular over time. |
Thanks for taking a look, Matei. If we use system properties instead of env variables, the remaining reason we'd want to start a second JVM is to be able to have a --driver-memory property. The only way around this I can think of would be to require users to set this with an environment variable instead of a command line option. One small weird thing about this is that the client would still be given the max heap specified in driver SPARK_DRIVER_MEMORY even when the driver is being run on the cluster. |
I uploaded a new patch that takes most of the review feedback into account. Includes the following changes:
I still need to tidy up the usage string. And there's the outstanding question of whether we can avoid starting a new JVM. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
I see, regarding the memory part, it sounds like we could do it in bash, but it might be kind of painful. We could do the following:
I agree that we shouldn't use the full memory you required if you submitted to a cluster. I'm not sure how hard it is to parse these arguments in bash -- it shouldn't be that hard, but we'll also have to do it in .cmd scripts on Windows and such. Otherwise It would be good to test how slow this is with two JVM launches (maybe we can avoid a lot of the slowness). |
I uploaded a new patch that doesn't start a new JVM and parses --driver-memory in bash. It wasn't as bad as I expected (thanks to some help from @umbrant and @atm). I've verified that it works with yarn with both deploy modes. I'm still planning to add some tests and doc, but I wanted to upload it with the new approach in case there are any comments. |
Merged build triggered. |
Merged build started. |
fi | ||
|
||
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we envisioning a corresponding .cmd file once the review of this is done ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though I think as a separate JIRA.
Merged build finished. |
One or more automated tests failed |
System.err.println("Unknown/unsupported param " + unknownParam) | ||
} | ||
System.err.println( | ||
"""Usage: spark-submit <primary binary> [options] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to say <application jar>
instead of primary binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking behind "primary binary" was that we might support binaries that aren't jars for non-Java apps.
Merged build finished. |
All automated tests passed. |
Some surface-level comments, but looking pretty good. Will try to test on a standalone cluster later tonight. |
Updated patch addresses Patrick's comments. |
Merged build triggered. |
Merged build started. |
Merged build finished. |
All automated tests passed. |
|
||
Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the "master" parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the master parameter is simply "yarn-client" or "yarn-cluster". | ||
|
||
The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to YARN in either deploy mode. For info on the lower-level invocations it uses, read ahead. For running spark-shell against YARN, skip down to the yarn-client section. | ||
|
||
## Launching a Spark application with yarn-cluster mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to give an example of actual usage of running one of the examples
if I just run ./bin/spark-submit the usage has < primary binary > and the documentation (cluster-overview.md) seems to have < jar >. Should be make that the same? Also its not clear to me if I want to run one of the examples (SparkPi) on yarn should the primary binary be the examples jar or the spark jar itself? Perhaps just an examples would help with this or explaining what a primary binary is. Note I haven't looked at the code in detail so this is just from a users point of view. I'll dig into the class missing error to figure out what I was doing wrong. |
Looks like the issues with missing Client class is due to https://spark-project.atlassian.net/browse/SPARK-1330 not this pr. Once that was fixed I am able to run both cluster and client mode on yarn Another thing I noticed is that the spark-submit script uses --arg and the spark-class script uses --args. Not a big deal just want to make sure we want arg vs args. I don't have a strong opinion on it but if people are used to using spark-class its just a change. It is a bit unfortunate we still have to specify the first arg as yarn-client or yarn-cluster for the spark examples so it can pass it to SparkContext but I guess there isn't much we can do about that since if it was real user code, the user could have that hardcoded or put it as any argument (not just the first one). Great work Sandy! Its nice to have this easier interface. |
Thanks for the feedback, Tom. Regarding "primary binary" and "jar", to clear up confusion I'm just going to call it "app jar" for now and if/when we add support for non-jar binaries we can find something more suitable. Regarding arg vs. args, I found the plural in args confusing - it makes it seem like the parameter should take multiple values when in fact it takes a single value and can be specified multiple times. Other parameters with plurals, like "jars", don't work this way. We could possibly add --arg and deprecate --args for the spark-class way? |
I guess its actually the yarn ClientArguments that takes the --args, not spark-class directly. I would be in favor of adding --arg and deprecating --args. With the spark-submit script I expect it to be hidden from most people going forward anyway. |
I'm also +1 moving to |
Can one of the admins verify this patch? |
Hey @sryza I'm going to submit a PR with some suggested follow-on changes, but I think we can go ahead and merge this for now as a starting point. Thanks for your work on this! |
This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster. This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes). Author: Sandy Ryza <sandy@cloudera.com> Closes apache#86 from sryza/sandy-spark-1126 and squashes the following commits: d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments e7315c6 [Sandy Ryza] Fix failing tests 34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs 299ddca [Sandy Ryza] Fix scalastyle a94c627 [Sandy Ryza] Add newline at end of SparkSubmit 04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR
…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR
…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR
Added Hive support, as well as SparkR
This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.
This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).