Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2848] Shade Guava in uber-jars. #1813

Closed
wants to merge 8 commits into from
Closed

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Aug 6, 2014

For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:

  • all classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
  • if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

  • Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
  • At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 6, 2014

I used a tool I wrote to check the references in the final jars (check it out: https://gist.github.com/vanzin/bd9057fadf4a296220b7). Verified only references to Optional point at the old "com.google.common" package.

Currently running unit tests (don't think they'd really be affected) and will double-check the sbt-generated jar to make sure asm dependencies are not changed from before.

@@ -0,0 +1,243 @@
/*
* Copyright (C) 2011 The Guava Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I checked and there is no additional change we need to make to LICENSE or NOTICE as a result of including this. Guava is already AL2 and has no NOTICE that pertains to this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still list it in our LICENSE? We've listed all source files that are not copyright Apache, which would include this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not according to my reading of
http://www.apache.org/dev/licensing-howto.html , in the part covering
bundled AL2 dependencies, but this stuff is always less than intuitive to
me.

I suppose the reasoning is that Spark's AL2 license already exactly
describes the terms of the licensing for that file. That's not quite true
for BSD or MIT licenses.

The copyright owner doesn't matter; there are hundreds of them. It matters
just how whoever they are license their IP to be used.
On Aug 11, 2014 12:22 AM, "Matei Zaharia" notifications@github.com wrote:

In core/src/main/java/com/google/common/base/Optional.java:

@@ -0,0 +1,243 @@
+/*

  • * Copyright (C) 2011 The Guava Authors

Shouldn't we still list it in our LICENSE? We've listed all source files
that are not copyright Apache, which would include this one.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/1813/files#r16034129.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if we're comfortable with not locking this file to the 14.0.1 version (or remembering to do this when we change guava versions), I can remove this fork (and clean up some code elsewhere in this patch too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I would still include it because the copyright is from another organization. I don't see why adding this would hurt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin You're referring to just not including this source file and instead getting it in binary form from the Guava artifact right? b/c yes if it stays in source form it certainly must keep its copyright statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen yes that's the alternative.

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1813. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18069/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA results for PR 1813:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
public abstract class Optional implements Serializable {
class Relocator(prefix: String, shaded: String, includes: Seq[Regex], excludes: Seq[Regex]) {
class RelocatorRemapper(relocators: List[Relocator]) extends Remapper {
class ShadeStrategy(relocators: List[Relocator], preferLocal: List[Regex]) extends MergeStrategy {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18069/consoleFull

@srowen
Copy link
Member

srowen commented Aug 7, 2014

After a side-bar -- I still think compile scope is correct for Guava, but that it can be marked optional as an extra safe-guard against people linking to Spark's Guava accidentally. This seems to be the motivation for provided scope, since it achieves this as a secondary effect. But in the exceptional case where Spark is being used outside an environment where Spark is provided, this has the undesirable primary effect of making Spark not execute for lack of Guava, unless the packager also includes Guava for it. Either way can be made to work and are equivalent for the common case, where Spark is provided.

EDIT: OK, is this accurate?

A "compile" scope Guava is the simple thing to do, and allows the shading solution to proceed. It however causes Guava to leak into apps that depend on Spark. But, only if they don't do the normal thing and treat Spark as "provided".

A "provided" scope Guava is more unusual, and requires some extra work in POM, but also enables the shading solution. Guava doesn't leak at all, but, people not depending on Spark as "provided" must remember to depend on Guava for Spark.

A "compile" scope Guava that is also "optional" works with the shading solution (? I think). Guava doesn't leak. But people using not using Spark as "provided" still have to remember to include Guava for Spark.

The last two seem the same, so, I suppose I can't really argue for "compile+optional" and against "provided" since the semantics are the same and both are not precisely what we desire (which would be something like "runtime by default for transitive dependers").

Proceed unless anyone else has an opinion to the contrary, IMHO.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 7, 2014

@pwendell I think you may have suggested in jira at some point to create a separate deliverable for the shaded guava jar, and have the spark build depend on that. Is that what you had in mind?

If that's the preferred approach, it would require code changes (change imports) and I'd need pointers for how to create this separate deliverable (I don't think it would live inside the spark build? how is it done for hive?).

@pwendell
Copy link
Contributor

pwendell commented Aug 7, 2014

hey @vanzin - so yeah, in the past we've done this by creating separate dependencies and publishing them to maven central, this is a bit more work on our side, but it plays nicer for downstream users because the correct functioning of spark does not depend on our distribution artifacts. For instance, users who write applications that embed spark, they don't have to try to recreate or mimic our distribution artifacts. For this reason I have a slight bias for the existing approach, just because we've used it for a long time without any issues.

The way we do this type of shading has been ad-hoc, we've typically cloned the relevant projects, re-written class names and then published them to maven central under the org.spark-project namespace.

On the dev list I recently explained to @avati that it would be nice to include scripts in the Spark repo that do this. He's actually starting working on this and you can see an example for protobuf - I think the stuff he's done so far looks great:

https://github.com/avati/spark-shaded/blob/master/0001-protobuf.sh

I downloaded guava the other day and played around, I think it will be pretty straightforward to do this for guava as well.

I actually know several other projects that have started to use our shaded artifacts (we tend to shade dependencies that are a common pain point) - so I think there is some value in continuing to do this.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 7, 2014

@pwendell creating the shaded deliverable for Guava should be trivial, but where do I create a pull request for it? I can work on the pom file to enable that and publish it to my github in the meantime.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 7, 2014

Here's a repo with a pom file to shade guava:
https://github.com/vanzin/spark-shaded

@pwendell
Copy link
Contributor

pwendell commented Aug 7, 2014

@vanzin yeah so the model I'd like to have is that we include scripts in the Spark source code that can generate these "from scratch" - i.e. take a look at the script that @avati wrote. Then these can be contributed to Spark directly, we can put then in /dev/shaded-deps/

@vanzin
Copy link
Contributor Author

vanzin commented Aug 7, 2014

@pwendell I looked at that script but it looks overly complicated, at least for guava. A simple pom that invokes the shade plugin is much, much simpler. Would it be ok to have the script just run the pom?

@pwendell
Copy link
Contributor

pwendell commented Aug 7, 2014

Does the shader plug-in work if we want to actually publish a jar to maven central? If that's the case including a POM seems like a solid option. We'd want to at least have some simple script though that downloads guava and then inserts that pom - rather than maintain it in a separate repo.

@avati
Copy link
Contributor

avati commented Aug 8, 2014

shade plug-in works in many cases but not in all. From what I understand, if the dependency is direct, shade plugin works fine. In case of protobuf it was indirect dependency - akka-2.3 specifically needs protobuf 2.5 and hadoop1 specifically needs protobuf 2.4. So a custom rebuilt package of either one was inevitable.

I haven't looked deep into the guava situation, just provided some context.

* @param includes Regexes for classes to include inside the matching package (empty = all).
* @param excludes Regexes for classes to exclude from the matching package (empty = none).
*/
class Relocator(prefix: String, shaded: String, includes: Seq[Regex], excludes: Seq[Regex]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be useful as a standalone sbt plug-in, I'm guessing a lot of other projects would use this - did you think at all about doing that? (of course that's totally up to you, just had the thought).

@pwendell
Copy link
Contributor

pwendell commented Aug 8, 2014

Hey so after thinking about it a bit more, maybe it's worth experimenting with this simpler approach of in-lining the shaded dependency in our jars using build plug-ins. The main reason I think it's worth trying is that, in the worst case, if this has some random unintended consequence we can just revert to our existing method of publishing fully shaded dependencies to maven central.

One thought for @vanzin and @srowen - what if instead of having our own copy of Optional, we just continue advertise a full compile dependency on guava and we get Optional from that. Then basically we have a "public" use of Gauva which is for this Optional class and a "private" use of Guava which we shade away. I think that will work as long as Optional does not change in guava... but that is the exact scenario in which the current approach works as well.

@srowen
Copy link
Member

srowen commented Aug 8, 2014

@pwendell This is complicated, so let me summarize what I think I understand and then give an opinion.

The plan at hand concerned changes to the assembly JARs, pretty much exclusively. Spark continues to depend on Guava in order to compile, and so that consumers of Spark also see this dependency in their builds.

(You can see the discussion with @vanzin about whether to alter how downstream consumers of the Maven build see the Guava dependency -- provided vs compile vs optional -- but this doesn't affect the assembly, and only affects whether Guava automatically becomes a part of the downstream consumer's classpath. I still have a minor issue with provided, but unless @pwendell has any opinion on this, consider me outvoted.)

So I think the proposal was already to continue showing a dependency on Guava, but, to alter the assembly to include only Optional, and shaded copies of all else. And maybe to tweak the dependency to hide it by default for downstream consumers.

Inlining the shaded JARs is all but like removing use of Guava, as far as the world is concerned. That certainly takes it out of the picture. The little snag is Optional. It will now inevitably collide with any other app that builds Guava in too. And downstream consumers can't manage away the conflict normally. It may either be a harmless warning or cause a real problem if Optional changes later. In general I don't like pushing the shading into the compile-time world. Remember the hive-exec awfulness? This would not be my first choice, versus:

I think your second paragraph is basically the same thing proposed already, with one difference. Yes, the build still continues to say it depends on Guava. The concrete assembly is put together differently. I think you can use Guava's Optional rather than embedding a source copy of it and compiling that. Right @vanzin? with some shading cleverness?

@vanzin
Copy link
Contributor Author

vanzin commented Aug 8, 2014

Lots of comments, let me try to write a coherent response to all of them. :-)

Shaded jars and maven central

Shouldn't be a problem. You're publishing your project's artifact to maven central, which just happens to include the shaded classes. This is similar to how assembly jars are published (http://search.maven.org/#browse|-1363484021).

(EDIT: yes, they's only a pom for the assembly, but if the pom's packaging type is "jar", you get the jar there. Well, you get the point.)

The Optional class

It should be possible to use Guava's original Optional, without having to fork it. The only reason I forked it is to allow changing the version of Guava without affecting the Spark API. But that can be done on-demand if needed. I'd still package it in the spark-core jar, otherwise the guava dependency would need to have "compile" scope (see below).

Compile vs. Provided

My main argument for using "provided" is to avoid leaking guava into the user's compilation classpath. Users depend on spark-core (e.g.), and if spark-core has a compile dependency on guava, guava will be available in the user's compilation classpath (regadless of whether spark-core is set up as compile or provided). If that dependency is provided (and thus not transitive), then it never shows up; if the user needs guava, he needs to explicitly depend on it.

This does have effects on packaging, though: for people using an existing spark assembly to run their apps, nothing changes. But if the user is creating his own uber-jar with spark and everything else in it, and running it in some custom environment, he'll need to explicitly package guava (even if his app doesn't use it). My belief is that the former case is the overwhelming majority, and the latter case means the user has to care about too many things already (e.g. choosing compatible versions of everything between his app and spark and hadoop et al) for this little thing to be a real concern.

Shading at compile time

@srowen let me know if I misunderstood, but do you mean creating a spark-core jar with the shaded guava classes in it? I actually started with that alternative, but it seems messy. Either you have all downstream projects reference the shaded classes (something that would be needed if going with the separate shaded jar approach), or you get into a weird place, where spark-core is referencing the shaded classes but all downstream projects are not. The assembly would fix everything (by shading again), but someone not using the assembly could get really weird errors.

Also, having duplicate class names in the classpath really confuses some IDEs that automatically add imports. More manual configuration for people to add...

sbt shading plugin

I think the code I wrote can be added to the sbt-assembly plugin (with some modifications). It probably is the best place for that, anyway, since I created it as a merge strategy for that plugin. But that's for later.

@srowen
Copy link
Member

srowen commented Aug 8, 2014

@vanzin Yes there's no problem with publishing artifacts containing copies of other artifacts as far as Maven Central is concerned. It's just icky for downstream consumers. I think the non-assembly artifacts should be the 'pristine' raw ingredients and assemblies can be crazy remixes.

I can see the argument for provided. It has the pros and cons you mention. It's a moot point for people that use Spark as provided, which is what they're supposed to do. So, meh. Just drop a comment in the POM to explain the scope.

I had understood that the goal was to alter the Spark assembly in some way, such that with some certain footwork in the Hive on Spark build, the Hive on Spark assembly would work with the Spark assembly as deployed on people's clusters. That's mostly about getting Guava out of the way.

I don't think this entails making anyone depend on a shaded anything in Maven. I think that's avoidable and undesirable. So I think we are all basically agreeing on that? So to be clear, Spark continues to express a dependency on Guava in its POM. It must. The shading affects deployed assemblies.

The only delta with what @pwendell suggests seems to be shading in Guava's Optional rather than keeping a source copy. But that seems fine.

Marcelo Vanzin added 2 commits August 12, 2014 12:23
Just package the class from the guava dependency instead. Note the
dependency-fu needed in core/pom.xml and assembly/pom.xml to not
expose the guava dependency in the user's compilation classpath.
Conflicts:
	project/SparkBuild.scala
@vanzin
Copy link
Contributor Author

vanzin commented Aug 12, 2014

Updated to remove forked code. I'm not too proud of the code to package things in the sbt build, but it's how I got things to work without having to expose Guava in the "compile" scope.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 12, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1813. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18386/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1813:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Relocator(prefix: String, shaded: String, includes: Seq[Regex], excludes: Seq[Regex]) {
class RelocatorRemapper(relocators: List[Relocator]) extends Remapper {
class ShadeStrategy(relocators: List[Relocator]) extends MergeStrategy {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18386/consoleFull

lazy val settings = assemblySettings ++ Seq(
test in assembly := {},
jarName in assembly <<= (version, moduleName) map { (v, mName) => mName + "-"+v + "-hadoop" +
Option(System.getProperty("hadoop.version")).getOrElse("1.0.4") + ".jar" },
mergeStrategy in assembly := {
case PathList("org", "datanucleus", xs @ _*) => MergeStrategy.discard
case PathList("org", "objectweb", "asm", xs @ _*) => MergeStrategy.discard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to SPARK-2848?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, this is leftover from a previous change in this patch. I'll remove.

@pwendell
Copy link
Contributor

Hey Marcelo,

I took a look through this - I also discussed in some detail with @srowen this past weekend. If we want to fix this problem using shading in our build, this seems like as good a way as any to do it. It does substantially increase the complexity of the build which is not great, so maybe longer term we could consider moving towards just publishing a shaded version of guava upstream.

Regarding this specific implementation, I noticed there is a lot of effort to mirror this logic in the sbt build. I wonder if maybe we should just leave this out of the sbt build entirely to reduce the complexiy and in the sbt build simply have guava be a compile time dependency like it was before. We've said in the past that our official reference build for packagers is maven, so I think it's fine if some of the more esoteric build configurations do not exist in sbt.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 18, 2014

Hi Patrick,

Regarding having the compile-scoped dependency in sbt only, the only way I can think of doing it is to create a profile that is always enabled from the sbt build, since the dependencies always come from the poms.

I agree that the sbt build is a little hacky with this patch, but at least it generates the same artifacts as the maven one. So it would avoid "this works with maven but not with sbt" situations, even though they might be rare.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 18, 2014

Hi @pwendell, I reverted the sbt changes as we talked about offline; required a tiny change in the pom so that everything works as intended in both builds.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1813 at commit 9bdffb0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1813 at commit 9bdffb0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey just wondering - since spark core lists this a compile dependency, and everything else depends on spark core, what makes this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - I see - the issue is that elsewhere this is not in compile scope. However, why not just have it be compile scope by default in the root pom? Is the issue that we need to make sure it does not exist at compile scope in our published root pom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous discussion I had with Sean in this PR.

There's really no good scope in maven for what we want to do. "compile" means that guava will show up in the user's compilation classpath, and since we're shading guava in the assembly, it may lead to runtime errors since guava won't be there. "provided" means that guava is not exposed to the user at all; so if the user wants to use guava, he'll need to add an explicit dependency and package it himself.

I like "provided" better because it means things fail for the user at build time, not runtime, if the user is using guava but hasn't declared the dependency. And that's why I had all those hacks in the sbt build that you asked me to remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that, since we are using the shade plug-in, will it just remove the entire guava from our published pom's anyways... then I realized that it won't remove guava from the root pom most likely, since we only shade inside of spark-core.

@pwendell
Copy link
Contributor

Just a minor question but LGTM - we should merge this one into master only.

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 1813 at commit 9bdffb0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 1813 at commit 9bdffb0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

Okay - let's give this a try. Merging it into master.

vanzin pushed a commit that referenced this pull request Aug 20, 2014
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
@vanzin
Copy link
Contributor Author

vanzin commented Aug 21, 2014

@pwendell any idea why this PR wasn't automatically closed after the merge?

@vanzin vanzin closed this Aug 21, 2014
@vanzin vanzin deleted the SPARK-2848 branch August 21, 2014 19:42
@liancheng
Copy link
Contributor

@vanzin Would you mind to have a look at SPARK-3217? I think this PR breaks Spark Maven build. Jenkins didn't complain because it uses SBT and thus doesn't shade Guava.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
@qiaohaijun
Copy link

+1

littlee1032 pushed a commit to MediaV/spark that referenced this pull request Nov 11, 2014
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
christopherbozeman pushed a commit to christopherbozeman/spark that referenced this pull request Nov 28, 2014
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
christopherbozeman pushed a commit to christopherbozeman/spark that referenced this pull request Nov 28, 2014
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
@mfawzymkh
Copy link

do we have an ETA to get this pull request merged to master? The guava shading issue is causing a problem for client libs that has a dependency on swift-service when spark is compiled with hadoop-2.4

@srowen
Copy link
Member

srowen commented Jan 19, 2015

@mfawzymkh As you can see this was merged into Spark a long time ago. What are you waiting on?

@mfawzymkh
Copy link

Sorry , I confused this with some other issue. Thanks

Sent from my iPhone

On Jan 19, 2015, at 1:54 AM, Sean Owen notifications@github.com wrote:

@mfawzymkh As you can see this was merged into Spark a long time ago. What are you waiting on?


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants