Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

Closed

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Oct 27, 2015

This does some work to allow dynamic adding of classes to running PySpark instance, although it still suffers from some of the restrictions mentioned in SPARK-5185 but provides a utility method for dealing with that. Should we eventually fix the class loader used by Py4J then this should continue to work and we can kill the helper method.

Something to note for reviewers: I'm using the test JAR also used by R, I could make a copy of it but I figured referencing it would be OK but want to double check.

…rom python + add a util function to simplify. Follow up should make Kafakutils & related use the util function
@SparkQA
Copy link

SparkQA commented Oct 28, 2015

Test build #44471 has finished for PR 9313 at commit 3b8b1b0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

@JoshRosen
Copy link
Contributor

Aside: I think that another Py4J contributor almost fixed the JAR thing at some point, but if I recall their patch got bogged down with some OSGI stuff or something that caused it to go stale and not get merged. I'd be interested to see if we can eventually fix this upstream in Py4J via a smaller, more focused patch there.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 28, 2015

We still would want some of this patch I think since it adds the classes to the class loader for the current running JVM, if the py4j class loader stuff got fixed would be even more useful.

@SparkQA
Copy link

SparkQA commented Oct 28, 2015

Test build #44497 has finished for PR 9313 at commit b8b8d72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

@holdenk holdenk changed the title [SPARK-10658][PYSPARK][WIP] Provide add jars to py spark api [SPARK-10658][PYSPARK] Provide add jars to py spark api Oct 29, 2015
@sryza
Copy link
Contributor

sryza commented Oct 29, 2015

Does the Scala SparkContext#addJar add the jar to the driver classpath? My impression was that it does not. If so, this would be a little inconsistent, right?

Loads a JVM class using the MutableClass loader used by spark.
This function exists because Py4J uses a different class loader.
"""
self._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(className)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Utils.getContextOrSparkClassLoader, it seems like the context classloader can sometimes be null. Is that possible here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on http://www.javaworld.com/article/2077344/core-java/find-a-way-out-of-the-classloader-maze.html every thread has a classloader associated with it unless it was created by native code. This is also the technique @tdas used for getting the class loader in the python kafka utils. Although this did come up (in https://issues.apache.org/jira/browse/SPARK-1403 ). I'll fix it here and make a follow up JIRA to update the kafka utils pyspark to use the common methodlogy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, follow up JIRA for streaming is SPARK-11397

…e can have a null context class loader (see SPARK-1403)
@holdenk
Copy link
Contributor Author

holdenk commented Oct 29, 2015

@sryza it does not, we could make it a flag so the default behaviour is more consistent, although this behaviour is in line with addPyFile which does add the file to the current running context.

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44567 has finished for PR 9313 at commit a0fdddb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

@sryza
Copy link
Contributor

sryza commented Oct 29, 2015

Could we add that flag to the Scala API as well? Would that break binary compatibility?

@holdenk
Copy link
Contributor Author

holdenk commented Oct 29, 2015

We can try and see what the binary comparability checked says. I'd move the
mutating class loader stuff to the general Utils class then as well. Should
we maybe ping one of the core reviewers as well then?

On Thursday, October 29, 2015, Sandy Ryza notifications@github.com wrote:

Could we add that flag to the Scala API as well? Would that break binary
compatibility?


Reply to this email directly or view it on GitHub
#9313 (comment).

Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

@holdenk holdenk changed the title [SPARK-10658][PYSPARK] Provide add jars to py spark api [SPARK-10658][PYSPARK][CORE] Provide add jars to py spark api Oct 30, 2015
@holdenk holdenk changed the title [SPARK-10658][PYSPARK][CORE] Provide add jars to py spark api [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api Oct 30, 2015
@holdenk holdenk changed the title [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api [SPARK-10658][SPARK-11421][PYSPARK][CORE][WIP] Provide add jars to py spark api Oct 30, 2015
…d expose it through the general spark context
@SparkQA
Copy link

SparkQA commented Oct 30, 2015

Test build #44670 has finished for PR 9313 at commit 371fc7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n

@holdenk holdenk changed the title [SPARK-10658][SPARK-11421][PYSPARK][CORE][WIP] Provide add jars to py spark api [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api Nov 2, 2015
@holdenk
Copy link
Contributor Author

holdenk commented Nov 2, 2015

@sryza so I've added it to the scala API - what are your thoughts on this approach?

* sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("class name")
* as described in SPARK-5185.
*/
def updatePrimaryClassLoader(sc: SparkContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of this update method that compares the full contents of sc.addedJars to those already registered with the current classloader, vs. just calling cs.addUrl directly on the jar being added?

A possibly confusing effect of this approach is that if I have code like:

sc.addJar(jar1)
... other stuff blah blah blah ...
sc.addJar(jar2, true)

jar1 will get added to the classpath only when sc.addJar is called the second time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thats a good point, it made a bit more sense when this was being done only for Python and there wasn't a flag. I'll switch it around.

@SparkQA
Copy link

SparkQA commented Nov 2, 2015

Test build #44848 has finished for PR 9313 at commit 72b9f36.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

addJar(path, false)
}

def addJar(path: String, addToCurrentThread: Boolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be correct to say that in nearly all cases, setting the second argument to true will result in the jar being added to all of the application's threads? Because Spark sets the context classloader to a MutableClassLoader before loading the application's main class, and then all other app threads will inherit this as the default unless they explicitly change the context class loader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, maybe good to call this out explicitly in the doc? Also, the naming addToCurrentThread makes it seem like it'll get added only to the current thread, so maybe something like addToContextClassLoader would be more clear.

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44952 has finished for PR 9313 at commit 2467ce5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -18,6 +18,7 @@
package org.apache.spark.api.python

import java.io.File
import java.net.{URL, URI}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: alphabetize

val currentCL = Utils.getContextOrSparkClassLoader
currentCL match {
case cl: MutableURLClassLoader => cl.addURL(new URI(key).toURL())
case _ => logWarning(s"Unsupported cl $currentCL will not update jars thread cl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that it's better to throw an exception here. Though as that's an API behavior question, probably best to defer to a core maintainer? @pwendell any thoughts? And thoughts on adding this option to addJar in general?

@SparkQA
Copy link

SparkQA commented Nov 8, 2015

Test build #45307 has finished for PR 9313 at commit d9d1375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 1, 2015

@sryza Any additional thoughts? Should we try pinging some people in the core group?

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48498 has finished for PR 9313 at commit bf3e98f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 30, 2015

I think the closurecleaner suite failure is unrelated, jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48508 has finished for PR 9313 at commit bf3e98f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 5, 2016

jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48745 has finished for PR 9313 at commit bf3e98f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Feb 6, 2016

Doesn't seem like there is interest in having this added right now - so I'll close. But if this is something we want to add to the API in the future more than happy to re-open and update.

@holdenk holdenk closed this Mar 21, 2016
@vojnovski
Copy link

Any updates on this? Seems like a nice thing to have.

@vojnovski
Copy link

Jenkins, retest this please.

@mariusvniekerk
Copy link
Member

So since py4j now uses the context classloader, we can remove the python pieces about loading a class by name.

@holdenk If you want I can revisit this PR.

This case occurs for me specifically because I have python modules that bundle their jars with them, and when using spark-submit it is rather tedious to have to manually muck around with the classloader under python.

We can probably also add it to SparkR since I assume they have similar requirements to the PySpark side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants