SPARK-1126. spark-app preliminary #86

sryza · 2014-03-06T00:35:15Z

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.

This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).

AmplabJenkins · 2014-03-06T00:36:04Z

Merged build triggered.

AmplabJenkins · 2014-03-06T00:36:04Z

Merged build started.

AmplabJenkins · 2014-03-06T00:37:19Z

Merged build finished.

AmplabJenkins · 2014-03-06T00:37:20Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13010/

pwendell · 2014-03-06T03:20:35Z

core/src/main/scala/org/apache/spark/deploy/SparkAppArguments.scala

+    System.err.println(
+      "Usage: spark-app <primary binary> [options] \n" +
+        "Options:\n" +
+        "  --master MASTER_URL        spark://host:port, mesos://host:port, yarn, or local\n" +


I'd disable the scalastyle here so that it doesn't produce build errors. In this case I think it's fine to violate the line limit:
http://www.scalastyle.org/configuration.html

also there is a different way to do multline strings in scala - but up to you...
http://downgra.de/2010/09/14/multi-line_strings_with_scala/

mateiz · 2014-03-06T06:23:20Z

Hey Sandy, the overall approach looks good, though I made some comments throughout. It would be really nice to avoid launching a second JVM if possible. It seems that the main reasons are to set environment vars or to pass arguments to the YARN launcher, but we can call the YARN launcher directly.

mateiz · 2014-03-06T06:24:15Z

Also, not sure what people think about calling this "spark-submit" instead of "spark-app". For the in-cluster use case it's really just for submitting, and I imagine that case will be more popular over time.

sryza · 2014-03-06T06:28:13Z

Thanks for taking a look, Matei. If we use system properties instead of env variables, the remaining reason we'd want to start a second JVM is to be able to have a --driver-memory property. The only way around this I can think of would be to require users to set this with an environment variable instead of a command line option. One small weird thing about this is that the client would still be given the max heap specified in driver SPARK_DRIVER_MEMORY even when the driver is being run on the cluster.

sryza · 2014-03-06T21:08:01Z

I uploaded a new patch that takes most of the review feedback into account. Includes the following changes:

changes Opt to OptionAssigner and uses default parameters
makes deployMode a boolean instead of an int
renames to spark-submit
adds (and takes advantage of) system properties for the yarn-client configs that can be used instead of env vars
removes --client-classpath
marks helper classes as private

I still need to tidy up the usage string. And there's the outstanding question of whether we can avoid starting a new JVM.

AmplabJenkins · 2014-03-06T21:28:24Z

Merged build triggered.

AmplabJenkins · 2014-03-06T21:29:41Z

Merged build started.

AmplabJenkins · 2014-03-06T21:30:47Z

Merged build finished.

AmplabJenkins · 2014-03-06T21:30:48Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13027/

mateiz · 2014-03-06T22:31:08Z

I see, regarding the memory part, it sounds like we could do it in bash, but it might be kind of painful. We could do the following:

Look for just the driver memory and cluster mode arguments using bash
If we're not running in cluster mode, set the -Xmx and -Xms parameters when launching

I agree that we shouldn't use the full memory you required if you submitted to a cluster. I'm not sure how hard it is to parse these arguments in bash -- it shouldn't be that hard, but we'll also have to do it in .cmd scripts on Windows and such. Otherwise It would be good to test how slow this is with two JVM launches (maybe we can avoid a lot of the slowness).

sryza · 2014-03-07T04:14:37Z

I uploaded a new patch that doesn't start a new JVM and parses --driver-memory in bash. It wasn't as bad as I expected (thanks to some help from @umbrant and @atm).

I've verified that it works with yarn with both deploy modes. I'm still planning to add some tests and doc, but I wanted to upload it with the new approach in case there are any comments.

AmplabJenkins · 2014-03-07T04:27:51Z

Merged build triggered.

AmplabJenkins · 2014-03-07T04:27:51Z

Merged build started.

mridulm · 2014-03-07T04:29:27Z

bin/spark-submit

+fi
+
+$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS
+


Are we envisioning a corresponding .cmd file once the review of this is done ?

Yeah, though I think as a separate JIRA.

AmplabJenkins · 2014-03-07T04:29:38Z

Merged build finished.

AmplabJenkins · 2014-03-07T04:29:38Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13036/

pwendell · 2014-03-26T01:32:05Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala

+      System.err.println("Unknown/unsupported param " + unknownParam)
+    }
+    System.err.println(
+      """Usage: spark-submit <primary binary> [options]


would it make sense to say <application jar> instead of primary binary?

The thinking behind "primary binary" was that we might support binaries that aren't jars for non-Java apps.

AmplabJenkins · 2014-03-26T02:11:33Z

Merged build finished.

AmplabJenkins · 2014-03-26T02:11:34Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13451/

pwendell · 2014-03-26T02:18:00Z

Some surface-level comments, but looking pretty good. Will try to test on a standalone cluster later tonight.

sryza · 2014-03-26T16:59:09Z

Updated patch addresses Patrick's comments.

AmplabJenkins · 2014-03-26T17:12:50Z

Merged build triggered.

AmplabJenkins · 2014-03-26T17:12:50Z

Merged build started.

AmplabJenkins · 2014-03-26T18:12:24Z

Merged build finished.

AmplabJenkins · 2014-03-26T18:12:24Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13472/

tgravescs · 2014-03-26T19:46:15Z

docs/running-on-yarn.md


 Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the "master" parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration.  Thus, the master parameter is simply "yarn-client" or "yarn-cluster".

+The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to YARN in either deploy mode. For info on the lower-level invocations it uses, read ahead. For running spark-shell against YARN, skip down to the yarn-client section. 
+
 ## Launching a Spark application with yarn-cluster mode.


It might be useful to give an example of actual usage of running one of the examples

tgravescs · 2014-03-26T20:06:04Z

if I just run ./bin/spark-submit the usage has < primary binary > and the documentation (cluster-overview.md) seems to have < jar >. Should be make that the same?

Also its not clear to me if I want to run one of the examples (SparkPi) on yarn should the primary binary be the examples jar or the spark jar itself? Perhaps just an examples would help with this or explaining what a primary binary is.
When I try to run I"m getting error about missing "org.apache.spark.deploy.yarn.Client" class. I'm not sure if this is related to this PR, if I'm just running it wrong, or its broken in master.

Note I haven't looked at the code in detail so this is just from a users point of view. I'll dig into the class missing error to figure out what I was doing wrong.

tgravescs · 2014-03-26T21:13:04Z

Looks like the issues with missing Client class is due to https://spark-project.atlassian.net/browse/SPARK-1330 not this pr. Once that was fixed I am able to run both cluster and client mode on yarn

Another thing I noticed is that the spark-submit script uses --arg and the spark-class script uses --args. Not a big deal just want to make sure we want arg vs args. I don't have a strong opinion on it but if people are used to using spark-class its just a change.

It is a bit unfortunate we still have to specify the first arg as yarn-client or yarn-cluster for the spark examples so it can pass it to SparkContext but I guess there isn't much we can do about that since if it was real user code, the user could have that hardcoded or put it as any argument (not just the first one).

Great work Sandy! Its nice to have this easier interface.

sryza · 2014-03-26T22:59:52Z

Thanks for the feedback, Tom.

Regarding "primary binary" and "jar", to clear up confusion I'm just going to call it "app jar" for now and if/when we add support for non-jar binaries we can find something more suitable.

Regarding arg vs. args, I found the plural in args confusing - it makes it seem like the parameter should take multiple values when in fact it takes a single value and can be specified multiple times. Other parameters with plurals, like "jars", don't work this way. We could possibly add --arg and deprecate --args for the spark-class way?

tgravescs · 2014-03-27T13:30:29Z

I guess its actually the yarn ClientArguments that takes the --args, not spark-class directly. I would be in favor of adding --arg and deprecating --args. With the spark-submit script I expect it to be hidden from most people going forward anyway.

pwendell · 2014-03-27T15:55:43Z

I'm also +1 moving to --arg instead of --args

AmplabJenkins · 2014-03-28T01:55:18Z

Can one of the admins verify this patch?

pwendell · 2014-03-29T21:41:00Z

Hey @sryza I'm going to submit a PR with some suggested follow-on changes, but I think we can go ahead and merge this for now as a starting point. Thanks for your work on this!

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster. This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes). Author: Sandy Ryza <sandy@cloudera.com> Closes apache#86 from sryza/sandy-spark-1126 and squashes the following commits: d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments e7315c6 [Sandy Ryza] Fix failing tests 34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs 299ddca [Sandy Ryza] Fix scalastyle a94c627 [Sandy Ryza] Add newline at end of SparkSubmit 04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script

…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR

Added Hive support, as well as SparkR

pwendell reviewed Mar 6, 2014
View reviewed changes

mridulm reviewed Mar 7, 2014
View reviewed changes

pwendell reviewed Mar 26, 2014
View reviewed changes

Commenting, doc, and import fixes from Patrick's comments

d428d85

tgravescs reviewed Mar 26, 2014
View reviewed changes

asfgit closed this in 1617816 Mar 29, 2014

robert3005 pushed a commit to robert3005/spark that referenced this pull request Jan 12, 2017

Dictionary decoding for int64 timestamps (apache#86)

2bba89d

lins05 added a commit to lins05/spark that referenced this pull request Feb 9, 2017

Check for user jars/files existence before creating the driver pod. (a…

69c8270

…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR

lins05 added a commit to lins05/spark that referenced this pull request Apr 23, 2017

Check for user jars/files existence before creating the driver pod. (a…

52a7ab2

…pache#86) * Check for user jars/files existence before creating the driver pod. Close apache-spark-on-k8s#85 * CR

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

Merge pull request apache#86 from mesosphere/add-hive

81ee64b

Added Hive support, as well as SparkR

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Add apt update instruction (apache#86)

b7a9445

		fi

		$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS

SPARK-1126. spark-app preliminary #86

SPARK-1126. spark-app preliminary #86

Conversation

sryza commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

pwendell Mar 6, 2014

Choose a reason for hiding this comment

mateiz commented Mar 6, 2014

mateiz commented Mar 6, 2014

sryza commented Mar 6, 2014

sryza commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

AmplabJenkins commented Mar 6, 2014

mateiz commented Mar 6, 2014

sryza commented Mar 7, 2014

AmplabJenkins commented Mar 7, 2014

AmplabJenkins commented Mar 7, 2014

mridulm Mar 7, 2014

Choose a reason for hiding this comment

sryza Mar 7, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 7, 2014

AmplabJenkins commented Mar 7, 2014

pwendell Mar 26, 2014

Choose a reason for hiding this comment

sryza Mar 26, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

pwendell commented Mar 26, 2014

sryza commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

AmplabJenkins commented Mar 26, 2014

tgravescs Mar 26, 2014

Choose a reason for hiding this comment

tgravescs commented Mar 26, 2014

tgravescs commented Mar 26, 2014

sryza commented Mar 26, 2014

tgravescs commented Mar 27, 2014

pwendell commented Mar 27, 2014

AmplabJenkins commented Mar 28, 2014

pwendell commented Mar 29, 2014