Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1126. spark-app preliminary #86

Closed
wants to merge 6 commits into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Mar 6, 2014

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.

This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13010/

System.err.println(
"Usage: spark-app <primary binary> [options] \n" +
"Options:\n" +
" --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd disable the scalastyle here so that it doesn't produce build errors. In this case I think it's fine to violate the line limit:
http://www.scalastyle.org/configuration.html

also there is a different way to do multline strings in scala - but up to you...
http://downgra.de/2010/09/14/multi-line_strings_with_scala/

@mateiz
Copy link
Contributor

mateiz commented Mar 6, 2014

Hey Sandy, the overall approach looks good, though I made some comments throughout. It would be really nice to avoid launching a second JVM if possible. It seems that the main reasons are to set environment vars or to pass arguments to the YARN launcher, but we can call the YARN launcher directly.

@mateiz
Copy link
Contributor

mateiz commented Mar 6, 2014

Also, not sure what people think about calling this "spark-submit" instead of "spark-app". For the in-cluster use case it's really just for submitting, and I imagine that case will be more popular over time.

@sryza
Copy link
Contributor Author

sryza commented Mar 6, 2014

Thanks for taking a look, Matei. If we use system properties instead of env variables, the remaining reason we'd want to start a second JVM is to be able to have a --driver-memory property. The only way around this I can think of would be to require users to set this with an environment variable instead of a command line option. One small weird thing about this is that the client would still be given the max heap specified in driver SPARK_DRIVER_MEMORY even when the driver is being run on the cluster.

@sryza
Copy link
Contributor Author

sryza commented Mar 6, 2014

I uploaded a new patch that takes most of the review feedback into account. Includes the following changes:

  • changes Opt to OptionAssigner and uses default parameters
  • makes deployMode a boolean instead of an int
  • renames to spark-submit
  • adds (and takes advantage of) system properties for the yarn-client configs that can be used instead of env vars
  • removes --client-classpath
  • marks helper classes as private

I still need to tidy up the usage string. And there's the outstanding question of whether we can avoid starting a new JVM.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13027/

@mateiz
Copy link
Contributor

mateiz commented Mar 6, 2014

I see, regarding the memory part, it sounds like we could do it in bash, but it might be kind of painful. We could do the following:

  • Look for just the driver memory and cluster mode arguments using bash
  • If we're not running in cluster mode, set the -Xmx and -Xms parameters when launching

I agree that we shouldn't use the full memory you required if you submitted to a cluster. I'm not sure how hard it is to parse these arguments in bash -- it shouldn't be that hard, but we'll also have to do it in .cmd scripts on Windows and such. Otherwise It would be good to test how slow this is with two JVM launches (maybe we can avoid a lot of the slowness).

@sryza
Copy link
Contributor Author

sryza commented Mar 7, 2014

I uploaded a new patch that doesn't start a new JVM and parses --driver-memory in bash. It wasn't as bad as I expected (thanks to some help from @umbrant and @atm).

I've verified that it works with yarn with both deploy modes. I'm still planning to add some tests and doc, but I wanted to upload it with the new approach in case there are any comments.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

fi

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit $ORIG_ARGS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we envisioning a corresponding .cmd file once the review of this is done ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though I think as a separate JIRA.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13036/

System.err.println("Unknown/unsupported param " + unknownParam)
}
System.err.println(
"""Usage: spark-submit <primary binary> [options]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to say <application jar> instead of primary binary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking behind "primary binary" was that we might support binaries that aren't jars for non-Java apps.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13451/

@pwendell
Copy link
Contributor

Some surface-level comments, but looking pretty good. Will try to test on a standalone cluster later tonight.

@sryza
Copy link
Contributor Author

sryza commented Mar 26, 2014

Updated patch addresses Patrick's comments.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13472/


Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the "master" parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the master parameter is simply "yarn-client" or "yarn-cluster".

The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to YARN in either deploy mode. For info on the lower-level invocations it uses, read ahead. For running spark-shell against YARN, skip down to the yarn-client section.

## Launching a Spark application with yarn-cluster mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to give an example of actual usage of running one of the examples

@tgravescs
Copy link
Contributor

if I just run ./bin/spark-submit the usage has < primary binary > and the documentation (cluster-overview.md) seems to have < jar >. Should be make that the same?

Also its not clear to me if I want to run one of the examples (SparkPi) on yarn should the primary binary be the examples jar or the spark jar itself? Perhaps just an examples would help with this or explaining what a primary binary is.
When I try to run I"m getting error about missing "org.apache.spark.deploy.yarn.Client" class. I'm not sure if this is related to this PR, if I'm just running it wrong, or its broken in master.

Note I haven't looked at the code in detail so this is just from a users point of view. I'll dig into the class missing error to figure out what I was doing wrong.

@tgravescs
Copy link
Contributor

Looks like the issues with missing Client class is due to https://spark-project.atlassian.net/browse/SPARK-1330 not this pr. Once that was fixed I am able to run both cluster and client mode on yarn

Another thing I noticed is that the spark-submit script uses --arg and the spark-class script uses --args. Not a big deal just want to make sure we want arg vs args. I don't have a strong opinion on it but if people are used to using spark-class its just a change.

It is a bit unfortunate we still have to specify the first arg as yarn-client or yarn-cluster for the spark examples so it can pass it to SparkContext but I guess there isn't much we can do about that since if it was real user code, the user could have that hardcoded or put it as any argument (not just the first one).

Great work Sandy! Its nice to have this easier interface.

@sryza
Copy link
Contributor Author

sryza commented Mar 26, 2014

Thanks for the feedback, Tom.

Regarding "primary binary" and "jar", to clear up confusion I'm just going to call it "app jar" for now and if/when we add support for non-jar binaries we can find something more suitable.

Regarding arg vs. args, I found the plural in args confusing - it makes it seem like the parameter should take multiple values when in fact it takes a single value and can be specified multiple times. Other parameters with plurals, like "jars", don't work this way. We could possibly add --arg and deprecate --args for the spark-class way?

@tgravescs
Copy link
Contributor

I guess its actually the yarn ClientArguments that takes the --args, not spark-class directly. I would be in favor of adding --arg and deprecating --args. With the spark-submit script I expect it to be hidden from most people going forward anyway.

@pwendell
Copy link
Contributor

I'm also +1 moving to --arg instead of --args

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

Hey @sryza I'm going to submit a PR with some suggested follow-on changes, but I think we can go ahead and merge this for now as a starting point. Thanks for your work on this!

@asfgit asfgit closed this in 1617816 Mar 29, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This is a starting version of the spark-app script for running compiled binaries against Spark.  It still needs tests and some polish.  The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.

This leaves out the changes required for launching python scripts.  I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).

Author: Sandy Ryza <sandy@cloudera.com>

Closes apache#86 from sryza/sandy-spark-1126 and squashes the following commits:

d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
e7315c6 [Sandy Ryza] Fix failing tests
34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
299ddca [Sandy Ryza] Fix scalastyle
a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
robert3005 pushed a commit to robert3005/spark that referenced this pull request Jan 12, 2017
lins05 added a commit to lins05/spark that referenced this pull request Feb 9, 2017
…pache#86)

* Check for user jars/files existence before creating the driver pod.

Close apache-spark-on-k8s#85

* CR
lins05 added a commit to lins05/spark that referenced this pull request Apr 23, 2017
…pache#86)

* Check for user jars/files existence before creating the driver pod.

Close apache-spark-on-k8s#85

* CR
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
…pache#86)

* Check for user jars/files existence before creating the driver pod.

Close apache-spark-on-k8s#85

* CR
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
Added Hive support, as well as SparkR
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants