-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313
[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313
Conversation
…rom python + add a util function to simplify. Follow up should make Kafakutils & related use the util function
Test build #44471 has finished for PR 9313 at commit
|
Aside: I think that another Py4J contributor almost fixed the JAR thing at some point, but if I recall their patch got bogged down with some OSGI stuff or something that caused it to go stale and not get merged. I'd be interested to see if we can eventually fix this upstream in Py4J via a smaller, more focused patch there. |
We still would want some of this patch I think since it adds the classes to the class loader for the current running JVM, if the py4j class loader stuff got fixed would be even more useful. |
Test build #44497 has finished for PR 9313 at commit
|
Does the Scala SparkContext#addJar add the jar to the driver classpath? My impression was that it does not. If so, this would be a little inconsistent, right? |
Loads a JVM class using the MutableClass loader used by spark. | ||
This function exists because Py4J uses a different class loader. | ||
""" | ||
self._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(className) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Utils.getContextOrSparkClassLoader, it seems like the context classloader can sometimes be null. Is that possible here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on http://www.javaworld.com/article/2077344/core-java/find-a-way-out-of-the-classloader-maze.html every thread has a classloader associated with it unless it was created by native code. This is also the technique @tdas used for getting the class loader in the python kafka utils. Although this did come up (in https://issues.apache.org/jira/browse/SPARK-1403 ). I'll fix it here and make a follow up JIRA to update the kafka utils pyspark to use the common methodlogy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, follow up JIRA for streaming is SPARK-11397
…e can have a null context class loader (see SPARK-1403)
@sryza it does not, we could make it a flag so the default behaviour is more consistent, although this behaviour is in line with addPyFile which does add the file to the current running context. |
Test build #44567 has finished for PR 9313 at commit
|
Could we add that flag to the Scala API as well? Would that break binary compatibility? |
We can try and see what the binary comparability checked says. I'd move the On Thursday, October 29, 2015, Sandy Ryza notifications@github.com wrote:
Cell : 425-233-8271 |
…d expose it through the general spark context
Test build #44670 has finished for PR 9313 at commit
|
@sryza so I've added it to the scala API - what are your thoughts on this approach? |
* sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("class name") | ||
* as described in SPARK-5185. | ||
*/ | ||
def updatePrimaryClassLoader(sc: SparkContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of this update method that compares the full contents of sc.addedJars
to those already registered with the current classloader, vs. just calling cs.addUrl
directly on the jar being added?
A possibly confusing effect of this approach is that if I have code like:
sc.addJar(jar1)
... other stuff blah blah blah ...
sc.addJar(jar2, true)
jar1 will get added to the classpath only when sc.addJar
is called the second time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thats a good point, it made a bit more sense when this was being done only for Python and there wasn't a flag. I'll switch it around.
…er, not all classes
Test build #44848 has finished for PR 9313 at commit
|
addJar(path, false) | ||
} | ||
|
||
def addJar(path: String, addToCurrentThread: Boolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be correct to say that in nearly all cases, setting the second argument to true will result in the jar being added to all of the application's threads? Because Spark sets the context classloader to a MutableClassLoader before loading the application's main class, and then all other app threads will inherit this as the default unless they explicitly change the context class loader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, maybe good to call this out explicitly in the doc? Also, the naming addToCurrentThread
makes it seem like it'll get added only to the current thread, so maybe something like addToContextClassLoader
would be more clear.
Test build #44952 has finished for PR 9313 at commit
|
@@ -18,6 +18,7 @@ | |||
package org.apache.spark.api.python | |||
|
|||
import java.io.File | |||
import java.net.{URL, URI} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: alphabetize
val currentCL = Utils.getContextOrSparkClassLoader | ||
currentCL match { | ||
case cl: MutableURLClassLoader => cl.addURL(new URI(key).toURL()) | ||
case _ => logWarning(s"Unsupported cl $currentCL will not update jars thread cl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that it's better to throw an exception here. Though as that's an API behavior question, probably best to defer to a core maintainer? @pwendell any thoughts? And thoughts on adding this option to addJar in general?
Test build #45307 has finished for PR 9313 at commit
|
@sryza Any additional thoughts? Should we try pinging some people in the core group? |
Test build #48498 has finished for PR 9313 at commit
|
I think the closurecleaner suite failure is unrelated, jenkins retest this please. |
Test build #48508 has finished for PR 9313 at commit
|
jenkins, retest this please. |
Test build #48745 has finished for PR 9313 at commit
|
Doesn't seem like there is interest in having this added right now - so I'll close. But if this is something we want to add to the API in the future more than happy to re-open and update. |
Any updates on this? Seems like a nice thing to have. |
Jenkins, retest this please. |
So since py4j now uses the context classloader, we can remove the python pieces about loading a class by name. @holdenk If you want I can revisit this PR. This case occurs for me specifically because I have python modules that bundle their jars with them, and when using spark-submit it is rather tedious to have to manually muck around with the classloader under python. We can probably also add it to SparkR since I assume they have similar requirements to the PySpark side. |
This does some work to allow dynamic adding of classes to running PySpark instance, although it still suffers from some of the restrictions mentioned in SPARK-5185 but provides a utility method for dealing with that. Should we eventually fix the class loader used by Py4J then this should continue to work and we can kill the helper method.
Something to note for reviewers: I'm using the test JAR also used by R, I could make a copy of it but I figured referencing it would be OK but want to double check.