Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL #1929

Closed
wants to merge 4 commits into from

Conversation

Projects
None yet
10 participants
@chipsenkbeil
Copy link

commented Aug 13, 2014

As seen with SI-6502 of Scala, the :cp command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem.

My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global).

The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off.

Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath.

The solution now looks like this:
screen shot 2014-08-13 at 3 46 02 pm

@AmplabJenkins

This comment has been minimized.

Copy link

commented Aug 13, 2014

Can one of the admins verify this patch?

val totalClasspath = ClassPath.join(settings.classpath.value, addedClasspath)
echo("Added '%s'. Your new classpath is:\n\"%s\"".format(f.path, totalClasspath))
replay()
intp.addUrlsToClassPath(f.toURI.toURL)

This comment has been minimized.

Copy link
@chipsenkbeil

chipsenkbeil Aug 13, 2014

Author

This adds the specified jar to both our compile time and runtime classpaths in the interpreter.

@@ -130,6 +131,9 @@ import org.apache.spark.util.Utils
private var _classLoader: AbstractFileClassLoader = null // active classloader
private val _compiler: Global = newCompiler(settings, reporter) // our private compiler

private trait ExposeAddUrl extends URLClassLoader { def addNewUrl(url: URL) = this.addURL(url) }
private var _runtimeClassLoader: URLClassLoader with ExposeAddUrl = null // wrapper exposing addURL

This comment has been minimized.

Copy link
@chipsenkbeil

chipsenkbeil Aug 13, 2014

Author

I needed to add a parent loader that explicitly exposed the addURL method so I could add new jars to our runtime, otherwise just adding to the compile time classpath would allow you to import the classes but not use them in the interpreter.

* @note Currently only supports jars, not directories
* @param urls The list of items to add to the compile and runtime classpaths
*/
def addUrlsToClassPath(urls: URL*): Unit = {

This comment has been minimized.

Copy link
@chipsenkbeil

chipsenkbeil Aug 13, 2014

Author

This is the publicly-available method I added to SparkIMain to allow you to add jars to the compile and runtime classpaths of the interpreter (global).

This comment has been minimized.

Copy link
@heathermiller

heathermiller Aug 26, 2014

so if the first line in this method is:

new Run  // force some initialization

as the comment says, it forces some initialization, and the bugs below seem to disappear, in Scala (no idea about Java, but likely should be fine)

@pwendell

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2014

@ScrapCodes can you look at this one?

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2014

Just curious, is this what the Scala team intends to do here (or did in 2.11), or could it interfere with other things in the compiler?

@@ -130,6 +131,9 @@ import org.apache.spark.util.Utils
private var _classLoader: AbstractFileClassLoader = null // active classloader
private val _compiler: Global = newCompiler(settings, reporter) // our private compiler

private trait ExposeAddUrl extends URLClassLoader { def addNewUrl(url: URL) = this.addURL(url) }

This comment has been minimized.

Copy link
@thunterdb

thunterdb Aug 13, 2014

Contributor

You are mixing a trait with URLClassloader. is there a reason for not implementing a derived class?

This comment has been minimized.

Copy link
@chipsenkbeil

chipsenkbeil Aug 13, 2014

Author

I previously was using the following inline:

new URLClassLoader(...) {
  def addNewUrl(url: URL) = this.addURL(url)
}

But I was unable to access the exposed method since it was creating an anonymous subclass. In terms of providing a trait versus a class that extends URLClassLoader, I figured a trait was a lighter way to mix in the needed exposure.

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 13, 2014

@mateiz I don't know what the Scala team's plans would be. I know that the only suggestion thus far was to wipe the global reference and recreate it (https://issues.scala-lang.org/browse/SI-6502?focusedCommentId=68422&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-68422), which is not ideal. There hasn't been any discussion since February on this.

On quick glance, it doesn't look like they have fixed anything with the REPL in 2.11.2. The code was moved (https://github.com/scala/scala/blob/v2.11.2/src/repl/scala/tools/nsc/interpreter/ILoop.scala), but has the same issue.

The 2.11.2 REPL gives me the same problem:

screen shot 2014-08-13 at 4 48 39 pm

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

Alright, it might be good to ping them on that JIRA to make sure this doesn't break subtle things. Although obviously it does seem to work.

@ScrapCodes

This comment has been minimized.

Copy link
Member

commented Aug 14, 2014

Looks okay to me, I was wondering if you tried scala reflection before resorting to java reflection ?

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 14, 2014

@ScrapCodes I recently picked up Scala and am much more familiar with Java's reflection than Scala's runtime mirrors and such. Do you have any pseudo code to express what this would look like using Scala's mirrors?

@mateiz What side effects are you concerned about? FYI, I did add a comment about this pull request asking for any side effects it might have: https://issues.scala-lang.org/browse/SI-6502?focusedCommentId=70380&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-70380

@thunterdb

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

I believe @mateiz is concerned about compatibility with other scala 2.10.x versions. We would have to make sure this works for all supported scala versions, as the internals may change between minor releases. In that case, we might have to add different code paths for different versions. This is unlikely though and I think the benefits far outweigh the costs in this case.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

I actually wasn't super concerned about new 2.10.x releases, since I don't think they'll change the REPL, but I'm just worried that there may be assumptions in the compiler that we're violating by adding these JARs. Maybe not, but it wouldn't hurt to ask them whether this is an okay fix.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2014

BTW I'd also be okay merging it if there are only minor problems, since adding these JARs at runtime is very useful.

@som-snytt

This comment has been minimized.

Copy link

commented Aug 14, 2014

This is similar to what I tried last year, but I recently saw:

scala/scala#3884

which says the mechanisms are in flux in 2.11.

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 15, 2014

@mateiz @som-snytt The summary I have is that this should work for 2.10.x as it doesn't appear like they are trickling the removal of the class invalidation from scala/scala#3884. I've given more analysis to the internal classpath editing from 2.10 and I don't see any potential side effects from appending to the existing merged classpath.

In terms of Spark moving to 2.11(.3 or higher) in the future, one option would be to simply include a trait mixed in with global that keeps the functionality they removed. From what I've seen, the only reason they removed it was because it wasn't used in the compiler or sbt, not that it had any issues. The underlying functionality wasn't broken, so I don't see any harm in doing that as a backup.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 18, 2014

Got it, thanks for looking into this.

@heathermiller

This comment has been minimized.

Copy link

commented Aug 18, 2014

So, this is isn't my area of expertise (I'm not sure who at Typesafe/EFPL is currently responsible for the development/maintenance of the REPL, thus I don't know who could comment on future plans/roadmap)

However,

  1. I think the reliance of this patch on invalidateClassPathEntries (removed in scala/scala#3884) is inconsequential because I get from @gkossakowski’s comment (scala/scala#3884 (comment)) that this functionality can easily (and will most likely) be provided by the 2.11.x’s new classpath handling code. So, the approach taken in this PR is probably still valid into 2.11. Though, obligatory to note that of course, we don't know yet what it'll look like, or precisely when it’ll hit, but since SI-6502 is "major" and it has an owner, I would bet on it being included in 2.11.x.
  2. There were some questions above about reflection – I second your Java reflection-based approach for two reasons. (1) As many know, Scala reflection is not thread-safe in 2.10.x, but perhaps more importantly (2) it’s an experimental API that is still fluctuating, and has fluctuated quite a bit so far between 2.10.x and 2.11.x. That said, it’s just a safer, stabler bet to rely on Java reflection if you can for now.

Since I’m not the foremost expert on this stuff, I’m also going to ping @gkossakowski (our best bet haha) – hopefully he can confirm/deny/clarify these insights?

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 19, 2014

@heathermiller Thanks for the feedback! It looks like @gkossakowski commented here: SI-6052.

So, it looks like issues that might not be tackled by the current pull request are as follows:

  1. what should happen when appended jar contains package that overlap with what's already on class path?
  2. what should happen when appended jar contains a class that shadows existing class?

At least, he makes me believe that the invalidateClassPaths doesn't handle that. While I didn't mention it in the pull request earlier, I made assumptions that entries added wouldn't overlap with the existing classpaths. If the feature is added later to properly handle this, even better. For now, it looks like this implementation works as long as you don't add conflicting classpath entries. @mateiz, I'm not sure if that is a small enough problem for you to want this or not.

Personally, I haven't run into these issues with Apache libraries that I've used with the patch. Nor have I run into issues trying out different jars for testing such as oddballs like JGoodies and lwjgl. Obviously, if someone wants to import jars at runtime, they need to be aware of conflicting classes.

I did just test this with the following files:

Placed inside TestJar.jar

package com.ibm.testjar;

public class TestClass {
    public void runMe() {
        System.out.println("I WAS RUN!");
    }
}

Placed inside TestJar2.jar

package com.ibm.testjar;

public class TestClass2 {
    public void runMe() {
        System.out.println("I WAS ALSO RUN!");
    }
}

And running yields something like this:

scala> :cp /path/to/TestJar.jar

scala> :cp /path/to/TestJar2.jar

scala> import com.ibm.testjar._

scala> new TestClass().runMe
I WAS RUN!
scala> new TestClass2().runMe
I WAS ALSO RUN!

So, I think the current solution does work for multiple imports that have the same package structure with entirely different classes. I don't know how it handles shadowing classes, if at all.

@som-snytt

This comment has been minimized.

Copy link

commented Aug 19, 2014

FWIW, I remembered my fix included overriding the ClassPath (in the JavaPlatform) of the REPL compiler. So it was a bit more nuanced in handling the replacement; it was a long time ago now, in Internet or dog years. But having an "additive" solution is better than nothing, the perfect being the enemy of the good and all that.

@heathermiller

This comment has been minimized.

Copy link

commented Aug 19, 2014

Yep, so shortly after I posted my last message, I had a conference call with @gkossakowski and then chatted briefly with @mateiz. Here's an attempt to summarize our discussions. Important note: @gkossakowski is currently recovering from surgery on his wrists, so written responses from him ATM are precious :) (and we shouldn't push him for too many...)

RE: point 2)

@gkossakowski and I discussed the issue of shadowing other jars on the classpath.

He explained a bit about the history of invalidateClassPathEntries; he was previously tasked to work on the resident compiler to enable faster incremental compilation, exploitable by both the Scala IDE and sbt. Thus, the functionality he was working on would have to manage large and complicated builds where shadowing and managing multiple versions of the same library would be a real challenge (he explained a bit the sort of maintenance that'd have to be done with Symbols, and all of the things that could go wrong, and it's really a nightmare). In the end, the approach was abandoned due to its complexity. invalidateClassPathEntries is a not-completely-baked artifact from this effort, and as far as I understand, no, it doesn't support any of the scenarios you list.

However, my intuition (which could be wrong) was that, for now, this is likely less important for users of spark-shell. Rationale: you're manually adding jars to the classpath; it's unlikely that the list of jars one would add to the classpath be so numerous that you'd shadow something else that you've added. The real concern would be that someone adds a different version of a library that's already on the classpath – and you're right, for this we have no support as far as I can tell. I agree with you, that for now, if folks are going to import jars at runtime, they should just be aware of conflicting classes (at least importing jars at runtime would now once again be a possibility).

@gkossakowski also assured me that for the rest of 2.10.x, logic that this PR touches most likely wouldn't fluctuate, so we shouldn't worry about any of this breaking for the rest of the 2.10.x series.

The particularly important bit of @gkossakowski & I's conversation though centered on what spark-shell would do when migrating to 2.11.x when you can't even rely on a partial solution like invalidateClassPathEntries. In the end, the consensus was that, on the side of scala/scalac, we need at minimum some lightweight fix for SI-6502 that's at least sufficient for spark-shell.

Right now, the idea would be to write a specification of what the bare minimum that the spark-shell/the Scala REPL should do when there are conflicts on the classpath. And, ideally, we could try to get this fix into one of the next two releases of the 2.11.x series (due dates: Sept 19, Nov 28) so that this fix doesn't have to live in Spark and depend on delicate scalac internals.

RE: point 1)

After chatting briefly with @mateiz, it was determined that the later-appended jars should take precedence (right Matei? or do I have that backwards?)

@gkossakowski

This comment has been minimized.

Copy link

commented Aug 19, 2014

Thanks @heathermiller for summing up my conversation.

I realized that I introduced a bit of confusion with my comment on SI-6052. Let me clarify.

The logic that is being removed by scala/scala#3884 does support the scenario of appending a jar to classpath. This is demonstrated by @rcsenkbeil's comment. That piece of logic promises much more. It promises to handle arbitrary changes to classpath, including removal and changes to single classes. However, it doesn't deliver on that promise. We discovered all sorts of subtle issues while working on resident compilation mode of scalac. This is not a real surprise: Scala compiler wasn't designed with resident compilation in mind.

The invalidateClassPathEntries is half-baked implementation of a very ambitious goal. It comes with little documentation, zero tests and no clients using it. For that reason we decided to remove it.

As I mentioned in SI-6052, we have a good chance to implement a small subset of what invalidateClassPathEntries does and that should be enough to satisfy Spark's needs. You guys don't need full-blown resident compilation logic just to load some jar.

How do we determine what's the small subset? I said mentioned in SI-6502. The API that :cp command needs have to deal with two concerns:

  1. what should happen when appended jar contains package that overlap with what's already on classpath?
  2. what should happen when appended jar contains a class that shadows existing class?

One might think, that the first point is not important because it's unlikely the same package will span several jars. That would be true if packages were flat. However, in Scala packages are nested. So if you have one jar with org.apache.spark package and another jar with org.apache.commons then you have overlapping packages. From Scala's point of view you have:

# first jar
  org
    apache
      spark

# second jar
  org
    apache
      commons

When appending the second jar, you must merge org and apache packages. What you merge is contents of PackageSymbol which is a symbol that represents a single (nested) package from classpath. The invalidateClassPathEntries handles this scenario by clearing entire contents of symbol representing org and reloading a fresh entries from classpath and making sure that old symbol for spark package is not lost (so we have just one symbol representing spark package).

The second point is extremely hard to implement properly. That's why I think the alternative api to current invalidateClassPathEntries should just abort in such case.

Here's what needs to happen for Scala 2.11.x:

  • define minimal classpath manipulation api that will be enough for Spark use case
  • write tests for scenarios outlined above
  • implement the minimal classpath manipulation (invalidateClassPathEntries can serve as inspiration)
  • make sure that the implementation abort on every case that is not supported; this way we'll avoid some weird compiler crashes caused by classpath and symbols getting out of sync
@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2014

@gkossakowski, thanks for the detailed reply. From my point of view, what we want when new JARs are added is for earlier JARs to take precedence. This is what makes the most sense. If you already instantiated an object from the previous version of the class and stored it in a variable, it's not possible for it to suddenly change class. So instead the effect should be the same as tacking on another JAR at the end of your classpath -- only classes that are not found in earlier JARs come from there. Would these semantics be possible to implement for 2.11?

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2014

BTW this actual patch looks good to me for 2.10, so I'll probably merge it after a bit of manual testing. @pwendell should we put this in branch-1.1 as well? It is technically a bug fix, but not a critical one.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

Jenkins, test this please

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

Hey @rcsenkbeil, so I tested this out manually, but unfortunately it doesn't seem to quite work -- you can't use the new classes in the REPL, though you can create them using reflection. Here's a sample session:

scala> :cp /Users/matei/workspace/jar-test/simple-jar-v1.jar
Added '/Users/matei/workspace/jar-test/simple-jar-v1.jar'.  Your new classpath is:
"/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/JObjC.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/System/Library/Java/Extensions/MRJToolkit.jar:/System/Library/Java/Extensions/QTJava.zip:/Users/matei/workspace/apache-spark/conf:/Users/matei/workspace/apache-spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar:/Users/matei/workspace/jar-test/simple-jar-v1.jar"

scala> Class.forName("test.Test").newInstance
res0: Any = Test2

scala> new test.Test
<console>:11: error: not found: value test
              new test.Test
                  ^

My simple-jar-v1.jar just contains a class called Test in package test.

Have you seen this happen yourself? It doesn't seem like this will be usable enough in its current form.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

BTW, the problem happens even if I call new test.Test before creating it through reflection first. Also, since we're in Spark-land, it would probably be good to call SparkContext.addJar on the newly added JARs to send them to executors in the cluster.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

Actually, one follow-up: It seems that the patch works if you try to use the new class before calling :cp and fail to find it, but fails if you just call :cp first and then try to use the class. Here's a session that worked for me:

scala> import test.Test
<console>:10: error: not found: value test
       import test.Test
              ^

scala> :cp /Users/matei/workspace/jar-test/simple-jar-v1.jar
Added '/Users/matei/workspace/jar-test/simple-jar-v1.jar'.  Your new classpath is:
"/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/JObjC.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/System/Library/Java/Extensions/MRJToolkit.jar:/System/Library/Java/Extensions/QTJava.zip:/Users/matei/workspace/apache-spark/conf:/Users/matei/workspace/apache-spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar:/Users/matei/workspace/jar-test/simple-jar-v1.jar"

scala> import test.Test
import test.Test

scala> new Test
res0: test.Test = Test2
@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

@mateiz I have not run into that issue. In fact, I just cloned a copy of my branch I submitted on a fresh computer and it still worked fine.

To test, I cloned the branch and ran sbt/sbt assembly to build the new jars.

From there, I created a file called TestClass.java as an illustration:

package com.ibm.testjar;

public class TestClass {
    public void runMe() {
        System.out.println("I WAS RUN!");
    }
}

I created a jar from this by issuing javac TestClass.java, mv TestClass.class com/ibm/testjar/, and jar -cf TestJar.jar com/ibm/testjar/TestClass.class.

Finally, I ran two different cases after starting the Spark shell through bin/spark-shell (after sbt/sbt assembly):

screen shot 2014-08-25 at 9 43 08 pm
screen shot 2014-08-25 at 9 44 35 pm

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

Yeah, probably with myFunc() it does another search for a version of myFunc without arguments.

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

@mateiz I've added a hack that just runs a compilation error in silence before adding anything to allow this. It works as seen here (and doesn't interfere with anything else). I'll spend a little more time investigating, but this does some to work for Scala classes now.

screen shot 2014-08-26 at 6 05 01 pm

@heathermiller

This comment has been minimized.

Copy link

commented Aug 26, 2014

Ok, so I think I have a solution. It took an embarrassing amount of experimentation and grokking of the REPL (not my area of expertise), but if you just create a dummy Run instance, it seems to force symbols to be initialized when you need them to be (e.g., classloading)

So, in your implementation of addUrlsToClassPath, what seems to work is:

    def addUrlsToClassPath(urls: URL*): Unit = {
      new Run // force some initialization
      urls.foreach(_runtimeClassLoader.addNewUrl) // Add jars/classes to runtime for execution
      updateCompilerClassPath(urls: _*)           // Add jars/classes to compile time for compiling
    }

After running into 10,000 walls, writing a bunch of code to manually force the initialization of symbols in previous/current runs, I discovered this happening in Global:
https://github.com/scala/scala/blob/2.11.x/src/compiler/scala/tools/nsc/Global.scala#L754
Which in turn forces initialization in Runplaces like:
https://github.com/scala/scala/blob/2.11.x/src/compiler/scala/tools/nsc/Global.scala#L1051

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 27, 2014

Awesome! Great find, @heathermiller! I've added an additional commit to move to that solution. @mateiz, what do you think now? This is what Global itself does to reinitialize. Is that clean enough?

FYI, I ran through tests with both Scala and Java classes (from within jars) and everything checks out.

I also added the SparkContext.addJar call as you requested to make this more Sparkish.

I tested this by starting up a standalone cluster using the sbin/start-all.sh script to launch a cluster with a single worker and connecting using
./bin/spark-shell --master spark://rcsenkbe.local:7077.

I'm hoping this demonstrates that the jars also work on the remote cluster. I don't have access to a remote Spark cluster at the moment.

screen shot 2014-08-26 at 7 18 16 pm

The Java implementation:

package my.java;

public class MyJavaClass {
    public int add(int x, int y) {
        return x + y;
    }
}

The Scala implementation:

package my.scala

class MyScalaClass {
    def add(x: Int, y: Int) = x + y
}

screen shot 2014-08-26 at 7 20 58 pm
screen shot 2014-08-26 at 7 21 30 pm
screen shot 2014-08-26 at 7 22 25 pm
screen shot 2014-08-26 at 7 23 04 pm

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

Cool, thanks! Running a cluster locally sounds fine as a way to test it; it will go through the same codepath.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

Jenkins, test this please

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

test this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 27, 2014

QA tests have started for PR 1929 at commit f420cbf.

  • This patch merges cleanly.
@SparkQA

This comment has been minimized.

Copy link

commented Aug 27, 2014

QA tests have finished for PR 1929 at commit f420cbf.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • rem In this case, leave out the main class (org.apache.spark.deploy.SparkSubmit) and use our own.
@gkossakowski

This comment has been minimized.

Copy link

commented Aug 27, 2014

@gkossakowski, thanks for the detailed reply. From my point of view, what we want when new JARs are added is for earlier JARs to take precedence. This is what makes the most sense. If you already instantiated an object from the previous version of the class and stored it in a variable, it's not possible for it to suddenly change class. So instead the effect should be the same as tacking on another JAR at the end of your classpath -- only classes that are not found in earlier JARs come from there. Would these semantics be possible to implement for 2.11?

We agree on semantics. I called changing existing class shadowing but we mean the same: changes to existing classes should not be allowed.

Adding jars to the classpath means just adding new classes that were not previously available. For that we need merging of packages as I explained earlier. It's possible to implement this kind of API for 2.11 but it doesn't exist yet .

I hope we can figure out how to merge your changes and work on the API on the compiler side. The current approach of going deep into internals of Global as seen in this PR is fine as a short term experimentation so you can quickly deliver a fix to your users. Long term solution would be migrating most of the Spark's code that talks to compiler internals to Scala code base.

@pwendell

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

@mateiz I'd prefer not to merge this into branch-1.1 at this point unless you see a really compelling need. Scarred from Spark 1.0.0 which actually released with a major REPL bug (which was itself, an attempt to fix another bug!).

@ScrapCodes

This comment has been minimized.

Copy link
Member

commented Aug 27, 2014

I have a easy test case which crashes the compiler.

jar1 has a package repl1 and a case class A

:cp jar1

import repl1
val a = repl1.A

jar2 also has package repl1 and case class A and B
val b = repl1.A

b == a // will crash the compiler.

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 27, 2014

@ScrapCodes, this was the case that I said I didn't support since the likelihood of bringing in different jars with the same package structure and class names is rare for the repl and can be tricky to deal with in many cases with the current code base. This is where we would want something upstream in Scala as @gkossakowski has mentioned. Also, as he discussed, this would really be a feature to add new classes, not replace existing ones.

[EDIT] Also, you're going so far to mix comparing a previous and current class implementation in an equality comparison. Maybe this is something that should be supported, but it just seems to ask for trouble. I'd think that people adding jars to the repl would be more cautious. Maybe it makes more sense to have this command be a different name? Like :addNewJar or something with a disclaimer about mixing old/new implementations?

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

I think :cp is fine, no reason to make a new command. I tested this again and it seems to work well (also on a cluster), so I'm going to merge this into 1.2. And I'll open a new JIRA for a Scala 2.11 version of this.

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

Hey Chip, one more thing, can you create an account on https://issues.apache.org/jira/browse/SPARK so I can assign this issue to you? Then update the pull request to say [SPARK-3256] in the title. I created this as an issue: https://issues.apache.org/jira/browse/SPARK-3256.

@chipsenkbeil

This comment has been minimized.

Copy link
Author

commented Aug 27, 2014

@chipsenkbeil chipsenkbeil changed the title Added support for :cp <jar> that was broken in Scala 2.10.x for REPL [SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL Aug 27, 2014

@mateiz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

Alright, thanks! I've merged this in. I also opened https://issues.apache.org/jira/browse/SPARK-3257 to do this for the Scala 2.11 REPL. @gkossakowski it would be great if you guys add an API for this in the next Scala 2.11 release -- our 1.2 release will be in roughly 3 months.

@asfgit asfgit closed this in 191d7cf Aug 27, 2014

@gkossakowski

This comment has been minimized.

Copy link

commented Aug 29, 2014

@mateiz, we haven't been planning to touch REPL code and this is not trivial request so I'll have to ask around about this. I'll update this ticket once I know what our plans are.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014

[SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.1…
…0.x for REPL

As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem.

My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global).

The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off.

Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath.

The solution now looks like this:
![screen shot 2014-08-13 at 3 46 02 pm](https://cloud.githubusercontent.com/assets/2481802/3912625/f02b1440-232c-11e4-9bf6-bafb3e352d14.png)

Author: Chip Senkbeil <rcsenkbe@us.ibm.com>

Closes apache#1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits:

f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters
a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error
2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly
a220639 [Chip Senkbeil] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL

@Leemoonsoo Leemoonsoo referenced this pull request Feb 4, 2015

Merged

Fix/Improve Dependency loader #308

3 of 3 tasks complete

@scabug scabug referenced this pull request Apr 7, 2017

Closed

:cp does not work #6502

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.