-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5341] Use maven coordinates as dependencies in spark-shell and spark-submit #4215
Conversation
Test build #26136 has finished for PR 4215 at commit
|
Test build #26135 has finished for PR 4215 at commit
|
--master | --deploy-mode | --class | --name | --jars | --py-files | --files | \ | ||
--conf | --properties-file | --driver-memory | --driver-java-options | \ | ||
--master | --deploy-mode | --class | --name | --jars | --maven | --py-files | --files | \ | ||
--conf | --maven_repos | --properties-file | --driver-memory | --driver-java-options | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to --maven-repos with a dash instead of an underscore; everything else has a dash
Test build #26191 has finished for PR 4215 at commit
|
# modify NOT ONLY this script but also SparkSubmitArgument.scala | ||
SUBMISSION_OPTS=() | ||
APPLICATION_OPTS=() | ||
while (($#)); do | ||
case "$1" in | ||
--master | --deploy-mode | --class | --name | --jars | --py-files | --files | \ | ||
--conf | --properties-file | --driver-memory | --driver-java-options | \ | ||
--master | --deploy-mode | --class | --name | --jars | --maven | --py-files | --files | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one maybe we could call it --packages
. IMO maven is a little confusing because it's also the name of a piece of software. I'd also below just say --repositories
below.
Test build #26200 has finished for PR 4215 at commit
|
Test build #26255 has finished for PR 4215 at commit
|
Test build #26256 has finished for PR 4215 at commit
|
Test build #26258 has finished for PR 4215 at commit
|
Test build #26262 has finished for PR 4215 at commit
|
Test build #26277 has finished for PR 4215 at commit
|
Interesting... The tests are successful on my local computer but fails in Jenkins... The end to end test that downloads spark-avro and spark-csv succeeds which is nice. Searching for artifacts at other repositories looks like it failed, but actually it says: Test succeeded, but ended abruptly. |
Test build #26493 has finished for PR 4215 at commit
|
Test build #26492 has finished for PR 4215 at commit
|
val path = SparkSubmitUtils.resolveMavenCoordinates("com.agimatec:agimatec-validation:0.9.3", | ||
Option("https://oss.sonatype.org/content/repositories/agimatec/"), None, true) | ||
assert(path.indexOf("agimatec-validation") >= 0, "should find package. If it doesn't, check" + | ||
"if package still exists. If it has been removed, replace the example in this test.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if there was some way to mock out the Maven repository so that this test isn't reliant on third-party services that we don't control; that would also allow the test to run without an internet connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of tricky, though, since I guess we do want to test against the actual repository at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would be awesome if we can mock it, but on the other hand, we still want to be sure that we can access these remote repositories correctly. I would prefer to keep it for now.
Test build #26558 has finished for PR 4215 at commit
|
@brkyvz I see that one of the TODOS is for adding Windows compatibility. Beyond the additions to the shell script command-line parsing, what features are we missing for Windows support? I've been testing a few Windows things today in a VM, so if it's just a matter of testing I'd be glad to try things out. |
@JoshRosen I actually don't know what we are missing. I think it only requires testing, because the directory structure (backslashes instead of slashes) and command-line parsing should all be handled. If you can test it, I'd really appreciate it! |
artifacts.map { artifactInfo => | ||
val artifactString = artifactInfo.toString | ||
val jarName = artifactString.drop(artifactString.lastIndexOf("!") + 1) | ||
cacheDirectory.getAbsolutePath + "/" + jarName.substring(0, jarName.lastIndexOf(".jar") + 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding /
as the file separator character will probably break things on Windows; I think we should use File.separator
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed it. Pushing update in a few secs
Test build #26615 has finished for PR 4215 at commit
|
@pwendell, I think this is in good shape to go in right before you cut the branch. Having the community test it out under many different settings and setups would help a lot. @JoshRosen, what do you think? |
import org.apache.ivy.plugins.matcher.GlobPatternMatcher | ||
import org.apache.ivy.plugins.resolver.{ChainResolver, IBiblioResolver} | ||
|
||
import org.apache.spark.Logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't use the existing Spark logging framework. We actually just directly print the output in elsewhere in this tool (look at uses of printStream
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then should I just log to System.out
and System.err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, just saw the printStream
in SparkSubmit
} | ||
// Log the callers for each dependency | ||
rr.getDependencies.toArray.foreach { case dependency: IvyNode => | ||
var logMsg = s"$dependency will be retrieved as a dependency for:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running this myself, I think your original instinct is right. Let's not bother printing this since there is already fairly thorough printing in ivy.
Test build #26685 has finished for PR 4215 at commit
|
LGTM pending tests. |
Test build #26692 has finished for PR 4215 at commit
|
I merged this - thanks Burak! |
… spark-submit This PR adds support for using maven coordinates as dependencies to spark-shell. Coordinates can be provided as a comma-delimited string after the flag `--packages`. Additional remote repositories (like SonaType) can be supplied as a comma-delimited string after the flag `--repositories`. Uses the Ivy library to resolve dependencies. Unfortunately the library has no decent documentation, therefore solving more complex dependency issues can be a problem. pwendell, mateiz, mengxr **Note: This is still a WIP. The following need to be handled:** - [x] add docs for the methods - [x] take local ivy cache path as an argument - [x] add tests - [x] add Windows compatibility - [x] exclude unused Ivy dependencies Author: Burak Yavuz <brkyvz@gmail.com> Closes #4215 from brkyvz/SPARK-5341ivy and squashes the following commits: 9215851 [Burak Yavuz] ready to merge db2a5cc [Burak Yavuz] changed logging to printStream 9dae87f [Burak Yavuz] file separators changed 71c374d [Burak Yavuz] merge conflicts fixed c08dc9f [Burak Yavuz] fixed merge conflicts 3ada19a [Burak Yavuz] fixed Jenkins error (hopefully) and added comment on oro 43c2290 [Burak Yavuz] fixed that ONE line 231f72f [Burak Yavuz] addressed code review 2cd6562 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5341ivy 85ec5a3 [Burak Yavuz] added oro as a dependency explicitly ea44ca4 [Burak Yavuz] add oro back to dependencies cef0e24 [Burak Yavuz] IntelliJ is just messing things up 97c4a92 [Burak Yavuz] fix more weird IntelliJ formatting 9cf077d [Burak Yavuz] fix weird IntelliJ formatting dcf5e13 [Burak Yavuz] fix windows command line flags 3a23f21 [Burak Yavuz] excluded ivy dependencies 53423e0 [Burak Yavuz] tests added 3705907 [Burak Yavuz] remove ivy-repo as a command line argument. Use global ivy cache as default c04d885 [Burak Yavuz] take path to ivy cache as a conf 2edc9b5 [Burak Yavuz] managed to exclude Spark and it's dependencies a0870af [Burak Yavuz] add docs. remove unnecesary new lines 6645af4 [Burak Yavuz] [SPARK-5341] added base implementation 882c4c8 [Burak Yavuz] added maven dependency download (cherry picked from commit 6aed719) Signed-off-by: Patrick Wendell <patrick@databricks.com>
This PR adds support for using maven coordinates as dependencies to spark-shell.
Coordinates can be provided as a comma-delimited string after the flag
--packages
.Additional remote repositories (like SonaType) can be supplied as a comma-delimited string after the flag
--repositories
.Uses the Ivy library to resolve dependencies. Unfortunately the library has no decent documentation, therefore solving more complex dependency issues can be a problem.
@pwendell, @mateiz, @mengxr
Note: This is still a WIP. The following need to be handled: