[SPARK-14540][Core] Fix remaining major issues for Scala 2.12 Support #21930

skonto · 2018-07-31T10:22:09Z

What changes were proposed in this pull request?

This PR addresses issues 2,3 in this document.

We modified the closure cleaner to identify closures that are implemented via the LambdaMetaFactory mechanism (serializedLambdas) (issue2).
We also fix the issue due to Overloading resolution for functions fails with Unit adaptation scala/bug#11016. There are two options for solving the Unit issue, either add () at the end of the closure or use the trick described in the doc. Otherwise overloading resolution does not work (we are not going to eliminate either of the methods) here. Compiler tries to adapt to Unit and makes these two methods candidates for overloading, when there is polymorphic overloading there is no ambiguity (that is the workaround implemented). This does not look that good but it serves its purpose as we need to support two different uses for method: addTaskCompletionListener. One that passes a TaskCompletionListener and one that passes a closure that is wrapped with a TaskCompletionListener later on (issue3).

Note: regarding issue 1 in the doc the plan is:

Do Nothing. Don’t try to fix this as this is only a problem for Java users who would want to use 2.11 binaries. In that case they can cast to MapFunction to be able to utilize lambdas. In Spark 3.0.0 the API should be simplified so that this issue is removed.

How was this patch tested?

This was manually tested:

./build/mvn -DskipTests -Pscala-2.12 clean package
./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.serializer.ProactiveClosureSerializationSuite -Dtest=None
./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.util.ClosureCleanerSuite -Dtest=None
./build/mvn -Pscala-2.12 clean package -DwildcardSuites=org.apache.spark.streaming.DStreamClosureSuite -Dtest=None```

holdensmagicalunicorn · 2018-07-31T10:22:11Z

@skonto, thanks! I am a bot who has found some folks who might be able to help with the review:@pwendell, @cloud-fan and @mateiz

skonto · 2018-07-31T10:25:19Z

@lrytz @retronym @adriaanm @debasishg @srowen @felixcheung fyi, pls review.

cloud-fan · 2018-07-31T11:24:35Z

cc @JoshRosen too

hvanhovell · 2018-07-31T14:02:46Z

core/src/main/scala/org/apache/spark/TaskContext.scala

@@ -123,7 +123,7 @@ abstract class TaskContext extends Serializable {
   *
   * Exceptions thrown by the listener will result in failure of the task.
   */
-  def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext = {
+  def addTaskCompletionListener[U](f: (TaskContext) => U): TaskContext = {


Do we need to change this? I don't think it is a problem binary compatibility wise, but it seems a but weird since we don't use the result of the function.

This is to work around scala/bug#11016 right? I'd prefer any solution that doesn't involve changing all the callers, but looks like both workarounds require something to be done. At least I'd document the purpose of U here.

That said, user code can call this right? And it would have to implement a similar change to work with 2.12? that's probably OK in the sense that any user app must make several changes to be compatible with 2.12.

I don't think 2.11 users would find there is a change to the binary API. Would a 2.11 user need to change its calls to specify a type for U with this change? because it looks like it's not optional, given that Spark code has to change its calls. Is that not a source incompatibility?

If it's not, then, I guess I wonder if you can avoid changing all the calls in Spark?

Yes this covers that bug. So if you build with 2.11 you dont need to specify the type Unit (I tried that) when you make the call since there is no ambiguity, compiler does not face an overloading issue. So at the source level there shouldnt be an issue. Binary compatibility is also described in the doc.
With 2.12 both addTaskCompletionListener methods end up to be SAM types and the Unit adaption causes this issue. I am not sure we can do anything more here and this is specific to 2.12, otherwise you get compilation errors for that version. @retronym or @lrytz may add more context. I certainly should document this.

OK, if it's binary- and source-compatible with existing user programs for 2.11 users, that's fine. Bets are off for 2.12 users anyway.

When the release notes are crafted for 2.4, we'll want to mention this JIRA (I'll tag it) and issues like this.

srowen · 2018-07-31T14:06:55Z

core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite2.scala

-      // the outer closure's parent pointer. This will make `inner2` serializable.
-      verifyCleaning(
-        inner2, serializableBefore = false, serializableAfter = true, transitive = true)
+      if(!ClosureCleanerSuite2.supportsLMFs) {


Nit: space after "if". scalastyle might flag that. While you're at it, what about flipping the blocks to avoid a negation? just for a tiny bit of clarity.

srowen · 2018-07-31T14:07:42Z

core/src/main/scala/org/apache/spark/TaskContext.scala

@@ -123,7 +123,7 @@ abstract class TaskContext extends Serializable {
   *
   * Exceptions thrown by the listener will result in failure of the task.
   */
-  def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext = {
+  def addTaskCompletionListener[U](f: (TaskContext) => U): TaskContext = {


This is to work around scala/bug#11016 right? I'd prefer any solution that doesn't involve changing all the callers, but looks like both workarounds require something to be done. At least I'd document the purpose of U here.

That said, user code can call this right? And it would have to implement a similar change to work with 2.12? that's probably OK in the sense that any user app must make several changes to be compatible with 2.12.

I don't think 2.11 users would find there is a change to the binary API. Would a 2.11 user need to change its calls to specify a type for U with this change? because it looks like it's not optional, given that Spark code has to change its calls. Is that not a source incompatibility?

If it's not, then, I guess I wonder if you can avoid changing all the calls in Spark?

SparkQA · 2018-07-31T15:03:36Z

Test build #93828 has finished for PR 21930 at commit d466a9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
logDebug(s\" + cloning the object $obj of class $

skonto · 2018-07-31T15:21:29Z

This patch adds the following public classes (experimental): logDebug(s\" + cloning the object $obj of class $

Is this normal?

hvanhovell · 2018-07-31T15:30:13Z

Yeah I would not worry about it

felixcheung · 2018-08-01T06:50:16Z

I think that's binary-incompatible breaking API change, right?
ex. https://github.com/apache/spark/pull/21930/files#diff-2b8f0f66fe5397b169d0f754e99da8d5R64

skonto · 2018-08-01T09:40:18Z

@felixcheung AFAIK as stated (@lrytz) in the doc it shouldnt be. It can be tested I guess.

skonto · 2018-08-01T11:00:20Z

@felixcheung I tested that with this simple program with Scala 2.11. The app jar was built against the officially released artifacts (2.3.1) with Scala 2.11.8:

    val spark = SparkSession
      ...
     spark.sparkContext.makeRDD(1 to 10000).foreachPartition{
      x => TaskContext.get().addTaskCompletionListener{
        y: TaskContext => println(s"Finishing...${y.partitionId()}...${x.length}")
      }
      x
    }

I run this app with the 2.3.1 official distro and also by building a distro from this PR again with Scala 2.11.6.
I got no errors, both cases output is:

Finishing...3...1250
Finishing...0...1250
Finishing...4...1250
Finishing...7...1250
Finishing...5...1250
Finishing...6...1250
Finishing...1...1250
Finishing...2...1250

The definition of binary compatibility states that when you change a class, client code using that class should not break when using it (without re-compiling client code). I hope I am not missing something here.
I also run the same app against a distro built with Scala 2.12 with this PR. Again I did not see any issue.

skonto · 2018-08-01T11:48:14Z

@srowen I updated the PR with the minor fixes.

SparkQA · 2018-08-01T12:05:49Z

Test build #93878 has finished for PR 21930 at commit c72362b.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
logDebug(s\" + cloning the object $obj of class $

srowen · 2018-08-01T12:58:33Z

I can imagine it is binary-compatible because the change is just to generic types, that are erased here (no ClassTags or anything).

To be clear about source compatibility, do I have this right?

The change to addTaskCompletionListener callers is necessary for 2.12 callers
It isn't necessary for 2.11 callers
It's necessary for Spark because it compiles against both

skonto · 2018-08-01T13:04:38Z

@srowen that is my understanding too and yes they are erased AFAIK, but just in case (out of curiosity) I tried it...

SparkQA · 2018-08-01T17:21:15Z

Test build #93879 has finished for PR 21930 at commit 30d83f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
logDebug(s\" + cloning the object $obj of class $

srowen

I think it's mergeable as-is, even, but if you have time for one more pass on some nits, this could also improve a little on the code that was there.

srowen · 2018-08-01T17:35:15Z