[SPARK-11195][CORE] Use correct classloader for TaskResultGetter #9367

choochootrain · 2015-10-30T01:08:38Z

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.

JoshRosen · 2015-10-30T01:10:39Z

Jenkins, this is ok to test.

choochootrain · 2015-10-30T01:14:38Z

I have a manual test that exhibits this behavior in https://issues.apache.org/jira/browse/SPARK-11195 and I am working on adding a test to the repo.

In order to test this I basically want to mirror the classloader hierarchy created by spark-submit - are there any conventions or existing tests which do something like this that I can look at?

JoshRosen · 2015-10-30T01:52:24Z

Jenkins, this is ok to test.

JoshRosen · 2015-10-30T01:52:42Z

@brkyvz might know more about testing Spark Submit.

yhuai · 2015-10-30T02:57:21Z

Will SparkSubmitSuite help?

choochootrain · 2015-10-30T03:10:09Z

SparkSubmitSuite is helpful, but I want to catch and assert the type of exception that is thrown when the job fails - calling into doMain in SparkSubmit.scala would be closer?

yhuai · 2015-10-30T03:22:41Z

How about this. In the main method, you can catch the exception and if it is the expected type, let the main method finish successfully. Otherwise, throw an exception, which causes a non-zero exit code. In runSparkSubmit, we will find that the exit code is not 0 and fail the test.

SparkQA · 2015-10-30T04:57:58Z

Test build #44655 has finished for PR 9367 at commit 90c47aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-10-30T13:50:07Z

This looks good as I'm not otherwise clear why these two blocks would use different classloaders.

srowen · 2015-11-01T12:24:39Z

@choochootrain are you able to put together a small test like what @yhuai mentions? then I think this is good to go.

choochootrain · 2015-11-02T23:29:33Z

i'm writing the test right now, is there an easy way to get the relative path to the assembled spark jar so I can compile my job against it?

yhuai · 2015-11-02T23:32:09Z

@choochootrain I think you can write your app in SparkSubmitSuite (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L590-L614 is an example).

choochootrain · 2015-11-03T02:42:52Z

@yhuai i can't write the test directly in SparkSubmitSuite because this is a classloader issue that only repros (as far as I can tell) when an external jar is loaded. I'm adding a test that uses TestUtils.createCompiledClass to compile and submit my external jar.

yhuai · 2015-11-03T04:51:15Z

@choochootrain I just noticed that this PR is for branch-1.5. Should we also fix it in master?

choochootrain · 2015-11-03T05:24:53Z

@yhuai yep this should also be in master. should I also submit a pr on master when this one looks good or will a maintainer be able to cherry-pick the commit?

yhuai · 2015-11-03T05:40:45Z

@choochootrain It will be great if you can submit a pr against the master. Our merge script only cherry-pick commit from master to a branch.

SparkQA · 2015-11-04T01:41:04Z

Test build #44987 has finished for PR 9367 at commit c63ca09.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

This test compiles an external Spark job which throws a user defined exception and asserts that Spark handles the TaskResult deserialization properly.

choochootrain · 2015-11-04T02:21:11Z

whoops, fixed the style error.

choochootrain · 2015-11-04T02:22:24Z

core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala

+        |}
+      """.stripMargin)
+    // scalastyle:on line.size.limit
+    val sparkJar = "../assembly/target/scala-2.10/spark-assembly-1.5.1-hadoop2.2.0.jar"


the only remaining issue is that this jar is hardcoded for the maven profiles that i am using. i was experimenting with putting the entire maven target/classes directory on the classpath but that seems equally janky. any suggestions here?

I probably missed something. Why do we need to put spark's assembly jar at here? When we run test, spark's classes are already in class path.

btw, when you unit test SparkSubmitSuite, you have to do assembly/assembly before you can run any of these tests.

This is what our jenkins does.

in order to test this issue, i need to submit an external jar to spark-submit. i can compile my job it using TestUtils, but i need to put (some version) of spark on the classpath so that javac can resolve RDD and so on.

Why do you need to compile using Spark? Can't you just create your own exception, compile it, pass that to Spark Submit. Then move all of the code here down to something like SimpleApplicationTest, where you create your exception through reflection and then throw it?

Or maybe I didn't understand the issue that this patch is trying to solve very well

Maybe you can compile a jar that just include your jar and put your app below. In your app, you use reflection to create an instance of your exception.

all this patch does is make TaskResult deserialization use Utils.getContextOrSparkClassLoader (the classloader which loaded the spark-submited jar) instead of Utils.getSparkClassLoader (this is AppClassLoader which only has spark classes in it). without this patch, a failed task would not be able to deserialize an exception if it did not exist in Utils.getSparkClassLoader.

in order to reproduce this issue, i set up a situation where Utils.getContextOnSparkClassLoader contains MyException but Utils.getSparkClassLoader does not (see https://issues.apache.org/jira/browse/SPARK-11195). this is easy to manually test with spark-submit and a user defined exception, but turning this into an automated test is proving to be much trickier. here are the 3 options:

❌ if i place all of the code into SparkSubmitSuite, the bug won't be hit because MyException will be in the root classloader and my patch makes no difference.

❔ if i place all of the code into an external jar and run spark-submit, i can set up the same situation as my repro which found this bug. the issue i am running into is that i need a spark classpath in order to compile my jar. i can use the assembled jar, but this changes depending on the maven profiles that are enabled and so on.

❔ i can try @brkyvz & @yhuai's hybrid approach of putting only the exception into a jar and the rest of the code into SparkSubmitSuite. i will have to do the following in order to repro this issue:

load the jar with MyException in a new classloader whose parent is the root classloader

somehow allow this classloader to be used by the driver and the executor without changing Utils.getSparkClassLoader.

at this point am i not reimplementing spark-submit? :)

the final approach is certainly worth trying, i'll take a look at it later today.

We should go with the simplest option that reproduces the issue. In other SparkSubmitSuite tests we used (2) but only out of necessity, where we just prepackage a jar and put it in the test resources dir. This makes it a little hard to maintain, e.g. we need a separate jar for scala-2.11.

In this case, maybe (3) is the simplest and most maintainable. It's unlikely that we'll ever have to modify MyException, but the reproduction code itself should be kept flexible. Could you give it a try?

SparkQA · 2015-11-04T04:50:31Z

Test build #44991 has finished for PR 9367 at commit 53f7c4c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-11-10T08:11:05Z

@yhuai @choochootrain how is this one going? it does seem desirable to test this as much as possible. If we're having to introduce very complex mechanisms to do so and they aren't working, is there anything simpler even if less effective we can do to test it?

yhuai · 2015-11-10T17:36:46Z

I feel the easiest way is to have something like https://github.com/apache/spark/tree/master/sql/hive/src/test/resources/regression-test-SPARK-8489. So, we will not need to change anything in TestUtils. We just have a jar containing your class and the main object.

choochootrain · 2015-11-11T18:13:48Z

@srowen @andrewor14 @yhuai

it seems like the SPARK-8489 approach would be less invasive, but also less maintainable. any preferences? i'd like to stick with one and get this out asap.

brkyvz · 2015-11-11T18:33:36Z

My vote is with (3). I feel it requires the least amount of new code that you have to write and is more maintainable.

yhuai · 2015-11-11T19:03:40Z

I'd like to go with 3.

choochootrain · 2015-11-14T01:19:45Z

should be less invasive now :)
note that i still have to make some minor changes to TestUtils in order for the jar to be in the correct format.

SparkQA · 2015-11-14T01:28:29Z

Test build #45913 has finished for PR 9367 at commit 71f8df9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-15T23:42:11Z

core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala

+    // load the exception from the jar
+    val loader = new MutableURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader)
+    loader.addURL(jarFile.toURI.toURL)
+    Thread.currentThread().setContextClassLoader(loader)


Can we set the original loader back?

SparkQA · 2015-11-16T22:57:05Z

Test build #46008 timed out for PR 9367 at commit fadf2ca after a configured wait of 175m.

SparkQA · 2015-11-17T13:23:16Z

Test build #2073 has finished for PR 9367 at commit fadf2ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-17T16:27:43Z

Test build #2079 has finished for PR 9367 at commit fadf2ca.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-17T16:32:42Z

Test build #2080 has finished for PR 9367 at commit fadf2ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-17T17:51:54Z

@choochootrain We should also open a PR for master, right?

yhuai · 2015-11-17T17:55:40Z

core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala

+    assert(unknownFailure.findFirstMatchIn(exceptionMessage).isEmpty)
+
+    // reset the classloader to the default value
+    Thread.currentThread.setContextClassLoader(originalClassLoader)


Can we use the following pattern?

// Get the original classloader try { // do our test } finally { // reset our classloader }

So, we will not mess up the classloader even if the test somehow failed.

yhuai · 2015-11-17T17:57:25Z

@choochootrain Please open a pr for master branch (this is our typical workflow). So, we can fix it in master and branch-1.6. I'd prefer fixing master and branch 1.6 first and then backport the fix to 1.5.

choochootrain · 2015-11-17T19:49:14Z

sounds good, i'll go ahead and squash the commits and submit a pr to master and branch 1.6

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader. The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`. Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`. See #9367 for previous comments See SPARK-11195 for a full repro Author: Hurshal Patel <hpatel516@gmail.com> Closes #9779 from choochootrain/spark-11195-master. (cherry picked from commit 3cca5ff) Signed-off-by: Yin Huai <yhuai@databricks.com>

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader. The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`. Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`. See apache#9367 for previous comments See SPARK-11195 for a full repro Author: Hurshal Patel <hpatel516@gmail.com> Closes apache#9779 from choochootrain/spark-11195-master.

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader. The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`. Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`. See #9367 for previous comments See SPARK-11195 for a full repro Author: Hurshal Patel <hpatel516@gmail.com> Closes #9779 from choochootrain/spark-11195-master. (cherry picked from commit 3cca5ff) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/TestUtils.scala

yhuai · 2015-11-18T17:39:43Z

@choochootrain #9779 has been merged. Can you close this one?

choochootrain · 2015-11-18T18:45:38Z

thanks!

[SPARK-11195][CORE] Use correct classloader for TaskResultGetter

90c47aa

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.

[SPARK-11195][CORE] Add SparkSubmitSuite test for classloader

53f7c4c

This test compiles an external Spark job which throws a user defined exception and asserts that Spark handles the TaskResult deserialization properly.

choochootrain reviewed Nov 4, 2015
View reviewed changes

[SPARK-11195][CORE] Move test to TaskResultGetterSuite

71f8df9

yhuai reviewed Nov 15, 2015
View reviewed changes

[SPARK-11195][CORE] Address comments and clean up tests

fadf2ca

yhuai reviewed Nov 17, 2015
View reviewed changes

choochootrain mentioned this pull request Nov 17, 2015

[SPARK-11195][CORE] Use correct classloader for TaskResultGetter #9779

Closed

choochootrain closed this Nov 18, 2015

choochootrain deleted the spark-11195 branch November 18, 2015 18:45

[SPARK-11195][CORE] Use correct classloader for TaskResultGetter #9367

[SPARK-11195][CORE] Use correct classloader for TaskResultGetter #9367

Conversation

choochootrain commented Oct 30, 2015

JoshRosen commented Oct 30, 2015

choochootrain commented Oct 30, 2015

JoshRosen commented Oct 30, 2015

JoshRosen commented Oct 30, 2015

yhuai commented Oct 30, 2015

choochootrain commented Oct 30, 2015

yhuai commented Oct 30, 2015

SparkQA commented Oct 30, 2015

srowen commented Oct 30, 2015

srowen commented Nov 1, 2015

choochootrain commented Nov 2, 2015

yhuai commented Nov 2, 2015

choochootrain commented Nov 3, 2015

yhuai commented Nov 3, 2015

choochootrain commented Nov 3, 2015

yhuai commented Nov 3, 2015

SparkQA commented Nov 4, 2015

choochootrain commented Nov 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 4, 2015

srowen commented Nov 10, 2015

yhuai commented Nov 10, 2015

choochootrain commented Nov 11, 2015

brkyvz commented Nov 11, 2015

yhuai commented Nov 11, 2015

choochootrain commented Nov 14, 2015

SparkQA commented Nov 14, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 16, 2015

SparkQA commented Nov 17, 2015

SparkQA commented Nov 17, 2015

SparkQA commented Nov 17, 2015

yhuai commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Nov 17, 2015

choochootrain commented Nov 17, 2015

yhuai commented Nov 18, 2015

choochootrain commented Nov 18, 2015