[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

holdenk · 2015-10-27T22:51:41Z

This does some work to allow dynamic adding of classes to running PySpark instance, although it still suffers from some of the restrictions mentioned in SPARK-5185 but provides a utility method for dealing with that. Should we eventually fix the class loader used by Py4J then this should continue to work and we can kill the helper method.

Something to note for reviewers: I'm using the test JAR also used by R, I could make a copy of it but I figured referencing it would be OK but want to double check.

…rom python + add a util function to simplify. Follow up should make Kafakutils & related use the util function

SparkQA · 2015-10-28T00:33:53Z

Test build #44471 has finished for PR 9313 at commit 3b8b1b0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

JoshRosen · 2015-10-28T05:03:50Z

Aside: I think that another Py4J contributor almost fixed the JAR thing at some point, but if I recall their patch got bogged down with some OSGI stuff or something that caused it to go stale and not get merged. I'd be interested to see if we can eventually fix this upstream in Py4J via a smaller, more focused patch there.

holdenk · 2015-10-28T05:09:45Z

We still would want some of this patch I think since it adds the classes to the class loader for the current running JVM, if the py4j class loader stuff got fixed would be even more useful.

SparkQA · 2015-10-28T07:28:24Z

Test build #44497 has finished for PR 9313 at commit b8b8d72.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

sryza · 2015-10-29T03:59:06Z

Does the Scala SparkContext#addJar add the jar to the driver classpath? My impression was that it does not. If so, this would be a little inconsistent, right?

sryza · 2015-10-29T04:19:35Z

python/pyspark/context.py

+        Loads a JVM class using the MutableClass loader used by spark.
+        This function exists because Py4J uses a different class loader.
+        """
+        self._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(className)


In Utils.getContextOrSparkClassLoader, it seems like the context classloader can sometimes be null. Is that possible here as well?

based on http://www.javaworld.com/article/2077344/core-java/find-a-way-out-of-the-classloader-maze.html every thread has a classloader associated with it unless it was created by native code. This is also the technique @tdas used for getting the class loader in the python kafka utils. Although this did come up (in https://issues.apache.org/jira/browse/SPARK-1403 ). I'll fix it here and make a follow up JIRA to update the kafka utils pyspark to use the common methodlogy.

Fixed, follow up JIRA for streaming is SPARK-11397

…e can have a null context class loader (see SPARK-1403)

holdenk · 2015-10-29T05:45:55Z

@sryza it does not, we could make it a flag so the default behaviour is more consistent, although this behaviour is in line with addPyFile which does add the file to the current running context.

SparkQA · 2015-10-29T08:04:43Z

Test build #44567 has finished for PR 9313 at commit a0fdddb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n * case _ => logWarning(s\"Unsupported class loader $currentCL will not update jars\")\n

sryza · 2015-10-29T16:25:22Z

Could we add that flag to the Scala API as well? Would that break binary compatibility?

holdenk · 2015-10-29T16:51:56Z

We can try and see what the binary comparability checked says. I'd move the
mutating class loader stuff to the general Utils class then as well. Should
we maybe ping one of the core reviewers as well then?

On Thursday, October 29, 2015, Sandy Ryza notifications@github.com wrote:

Could we add that flag to the Scala API as well? Would that break binary
compatibility?

—
Reply to this email directly or view it on GitHub
#9313 (comment).

Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

…d expose it through the general spark context

SparkQA · 2015-10-30T09:38:21Z

Test build #44670 has finished for PR 9313 at commit 371fc7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * * warning. Intended for use by addJar(), although constructing an instance of the class will\n * * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(\"class name\")\n

holdenk · 2015-11-02T19:25:05Z

@sryza so I've added it to the scala API - what are your thoughts on this approach?

sryza · 2015-11-02T21:10:16Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+   * sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("class name")
+   * as described in SPARK-5185.
+   */
+  def updatePrimaryClassLoader(sc: SparkContext) {


What's the advantage of this update method that compares the full contents of sc.addedJars to those already registered with the current classloader, vs. just calling cs.addUrl directly on the jar being added?

A possibly confusing effect of this approach is that if I have code like:

sc.addJar(jar1) ... other stuff blah blah blah ... sc.addJar(jar2, true)

jar1 will get added to the classpath only when sc.addJar is called the second time.

Ah thats a good point, it made a bit more sense when this was being done only for Python and there wasn't a flag. I'll switch it around.

…er, not all classes

SparkQA · 2015-11-02T23:16:48Z

Test build #44848 has finished for PR 9313 at commit 72b9f36.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2015-11-03T08:57:29Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+    addJar(path, false)
+  }
+
+  def addJar(path: String, addToCurrentThread: Boolean) {


Would it be correct to say that in nearly all cases, setting the second argument to true will result in the jar being added to all of the application's threads? Because Spark sets the context classloader to a MutableClassLoader before loading the application's main class, and then all other app threads will inherit this as the default unless they explicitly change the context class loader?

I think that is correct.

In that case, maybe good to call this out explicitly in the doc? Also, the naming addToCurrentThread makes it seem like it'll get added only to the current thread, so maybe something like addToContextClassLoader would be more clear.

SparkQA · 2015-11-03T22:12:24Z

Test build #44952 has finished for PR 9313 at commit 2467ce5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sryza · 2015-11-04T17:13:57Z

core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala

@@ -18,6 +18,7 @@
 package org.apache.spark.api.python

 import java.io.File
+import java.net.{URL, URI}


Nit: alphabetize

sryza · 2015-11-04T17:15:55Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+          val currentCL = Utils.getContextOrSparkClassLoader
+          currentCL match {
+            case cl: MutableURLClassLoader => cl.addURL(new URI(key).toURL())
+            case _ => logWarning(s"Unsupported cl $currentCL will not update jars thread cl")


My opinion is that it's better to throw an exception here. Though as that's an API behavior question, probably best to defer to a core maintainer? @pwendell any thoughts? And thoughts on adding this option to addJar in general?

SparkQA · 2015-11-08T21:37:11Z

Test build #45307 has finished for PR 9313 at commit d9d1375.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-12-01T20:24:00Z

@sryza Any additional thoughts? Should we try pinging some people in the core group?

SparkQA · 2015-12-30T20:18:08Z

Test build #48498 has finished for PR 9313 at commit bf3e98f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-12-30T20:29:35Z

I think the closurecleaner suite failure is unrelated, jenkins retest this please.

SparkQA · 2015-12-30T22:38:38Z

Test build #48508 has finished for PR 9313 at commit bf3e98f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-01-05T07:10:12Z

jenkins, retest this please.

SparkQA · 2016-01-05T08:47:34Z

Test build #48745 has finished for PR 9313 at commit bf3e98f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-02-06T06:08:34Z

Doesn't seem like there is interest in having this added right now - so I'll close. But if this is something we want to add to the API in the future more than happy to re-open and update.

vojnovski · 2016-08-30T13:01:12Z

Any updates on this? Seems like a nice thing to have.

vojnovski · 2016-08-30T13:01:54Z

Jenkins, retest this please.

mariusvniekerk · 2016-10-27T22:56:26Z

So since py4j now uses the context classloader, we can remove the python pieces about loading a class by name.

@holdenk If you want I can revisit this PR.

This case occurs for me specifically because I have python modules that bundle their jars with them, and when using spark-submit it is rather tedious to have to manually muck around with the classloader under python.

We can probably also add it to SparkR since I assume they have similar requirements to the PySpark side.

holdenk added 2 commits October 27, 2015 14:19

Add addJar to PySpark API but it isn't very useful at present

5eb0892

Add a helper to update the running class loader when adding classes f…

3b8b1b0

…rom python + add a util function to simplify. Follow up should make Kafakutils & related use the util function

Finish up test

b8b8d72

holdenk changed the title ~~[SPARK-10658][PYSPARK][WIP] Provide add jars to py spark api~~ [SPARK-10658][PYSPARK] Provide add jars to py spark api Oct 29, 2015

sryza reviewed Oct 29, 2015
View reviewed changes

Use the context or spark class loader since sometimes (very rarely) w…

a0fdddb

…e can have a null context class loader (see SPARK-1403)

holdenk changed the title ~~[SPARK-10658][PYSPARK] Provide add jars to py spark api~~ [SPARK-10658][PYSPARK][CORE] Provide add jars to py spark api Oct 30, 2015

holdenk changed the title ~~[SPARK-10658][PYSPARK][CORE] Provide add jars to py spark api~~ [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api Oct 30, 2015

holdenk changed the title ~~[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api~~ [SPARK-10658][SPARK-11421][PYSPARK][CORE][WIP] Provide add jars to py spark api Oct 30, 2015

Switch the addToCurrentThread functionality for addJar into a bool an…

371fc7d

…d expose it through the general spark context

holdenk changed the title ~~[SPARK-10658][SPARK-11421][PYSPARK][CORE][WIP] Provide add jars to py spark api~~ [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api Nov 2, 2015

sryza reviewed Nov 2, 2015
View reviewed changes

CR feedback: only add the class we are asked to add to the class load…

72b9f36

…er, not all classes

sryza reviewed Nov 3, 2015
View reviewed changes

Style fix

2467ce5

sryza reviewed Nov 4, 2015
View reviewed changes

holdenk added 2 commits November 7, 2015 21:36

CR feedback, change param name and update the comments

8b104b1

Revert changes to PythonUtils that are no longer required

d9d1375

Merge branch 'master' into SPARK-10658-provide-addJars-to-pySpark-API

bf3e98f

holdenk closed this Mar 21, 2016

mariusvniekerk added a commit to mariusvniekerk/spark that referenced this pull request Oct 28, 2016

Squashed content from pull request apache#9313

6fb5d66

mariusvniekerk mentioned this pull request Oct 28, 2016

[SPARK-11421] [Core][Python][R] Added ability for addJar to augment the current classloader #15666

Closed

mariusvniekerk added a commit to mariusvniekerk/spark that referenced this pull request Mar 1, 2017

Squashed content from pull request apache#9313

fb5c48d

HyukjinKwon mentioned this pull request Nov 2, 2017

[SPARK-11421][CORE][PYTHON][R] Added ability for addJar to augment the current classloader #19643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

holdenk commented Oct 27, 2015

SparkQA commented Oct 28, 2015

JoshRosen commented Oct 28, 2015

holdenk commented Oct 28, 2015

SparkQA commented Oct 28, 2015

sryza commented Oct 29, 2015

sryza Oct 29, 2015

holdenk Oct 29, 2015

holdenk Oct 29, 2015

holdenk commented Oct 29, 2015

SparkQA commented Oct 29, 2015

sryza commented Oct 29, 2015

holdenk commented Oct 29, 2015

SparkQA commented Oct 30, 2015

holdenk commented Nov 2, 2015

sryza Nov 2, 2015

holdenk Nov 2, 2015

SparkQA commented Nov 2, 2015

sryza Nov 3, 2015

holdenk Nov 3, 2015

sryza Nov 4, 2015

SparkQA commented Nov 3, 2015

sryza Nov 4, 2015

sryza Nov 4, 2015

SparkQA commented Nov 8, 2015

holdenk commented Dec 1, 2015

SparkQA commented Dec 30, 2015

holdenk commented Dec 30, 2015

SparkQA commented Dec 30, 2015

holdenk commented Jan 5, 2016

SparkQA commented Jan 5, 2016

holdenk commented Feb 6, 2016

vojnovski commented Aug 30, 2016

vojnovski commented Aug 30, 2016

mariusvniekerk commented Oct 27, 2016

[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

[SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jars to py spark api #9313

Conversation

holdenk commented Oct 27, 2015

SparkQA commented Oct 28, 2015

JoshRosen commented Oct 28, 2015

holdenk commented Oct 28, 2015

SparkQA commented Oct 28, 2015

sryza commented Oct 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Oct 29, 2015

SparkQA commented Oct 29, 2015

sryza commented Oct 29, 2015

holdenk commented Oct 29, 2015

SparkQA commented Oct 30, 2015

holdenk commented Nov 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 8, 2015

holdenk commented Dec 1, 2015

SparkQA commented Dec 30, 2015

holdenk commented Dec 30, 2015

SparkQA commented Dec 30, 2015

holdenk commented Jan 5, 2016

SparkQA commented Jan 5, 2016

holdenk commented Feb 6, 2016

vojnovski commented Aug 30, 2016

vojnovski commented Aug 30, 2016

mariusvniekerk commented Oct 27, 2016