Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2848] Shade Guava in uber-jars. #1813

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,12 @@
</properties>

<dependencies>
<!-- Promote Guava to compile scope in this module so it's included while shading. -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
Expand Down Expand Up @@ -113,6 +119,18 @@
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>org.spark-project.guava</shadedPattern>
<includes>
<include>com.google.common.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional**</exclude>
</excludes>
</relocation>
</relocations>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
Expand Down
35 changes: 35 additions & 0 deletions core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,15 @@
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
</dependency>
<!--
Promote Guava to "compile" so that maven-shade-plugin picks it up (for packaging the Optional
class exposed in the Java API). The plugin will then remove this dependency from the published
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there logic in here that specifically tells maven to exclude this from our published pom? Does this automatically happen by virtue of including the artifactSet below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this locally and it appears that it does get excluded by virtue of the artifact set.

pom, so that Guava does not pollute the client's compilation classpath.
-->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
Expand Down Expand Up @@ -322,6 +328,35 @@
</arguments>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<shadedArtifactAttached>false</shadedArtifactAttached>
<artifactSet>
<includes>
<include>com.google.guava:guava</include>
</includes>
</artifactSet>
<filters>
<!-- See comment in the guava dependency declaration above. -->
<filter>
<artifact>com.google.guava:guava</artifact>
<includes>
<include>com/google/common/base/Optional*</include>
</includes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>

<resources>
Expand Down
26 changes: 25 additions & 1 deletion examples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,14 @@
</dependencies>
</profile>
</profiles>

<dependencies>
<!-- Promote Guava to compile scope in this module so it's included while shading. -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
Expand Down Expand Up @@ -209,6 +215,12 @@
</includes>
</artifactSet>
<filters>
<filter>
<artifact>com.google.guava:guava</artifact>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here, add a comment on why we keep Optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I guess it's above too, but wouldn't hurt to place it exactly where we do the exclude).

<excludes>
<exclude>com/google/common/base/Optional*</exclude>
</excludes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
Expand All @@ -226,6 +238,18 @@
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>org.spark-project.guava</shadedPattern>
<includes>
<include>com.google.common.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional**</exclude>
</excludes>
</relocation>
</relocations>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
Expand Down
16 changes: 16 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,7 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
<scope>provided</scope>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably learn something here, but why is it necessary to mark it provided and then compile-scope again in the assembly? what does it matter if it's compile scope everywhere? I realize Hive on Spark may need it to be provided but it can change the nature of this dependency down-stream if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you leave it as compile, it will pollute the compilation classpath of people using Spark in their projects (Guava will be pulled as a compile-scope dependency too).

http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope

Adding it as compile-scope in the assembly project is then needed because maven-shade-plugin will only include those in the final jar (pretty sure there are setting to include other scopes, but didn't want to play with those).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand what the scopes mean. But Spark is already "provided" to any project compiling against Spark; you're not suppose to pull in Spark as compile-scope. This makes all its transitive dependencies provided in this scenario already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if someone "accidentally" pulls Spark as a compile-scope dependency, they'd run into what I described. This avoids that problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in that scenario, someone's trying to run Spark outside of a cluster providing Spark or something. But then this fails since Guava is said to be provided, but is not actually provided by anything! Unless I suppose the app also depends on Guava.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but that's what provided means, right? It's provided. Somehow. That "how" is not the concern of the library being pulled.

Normally it's provided by the Spark uber-jar which contains the dependency. But if someone is not using those, it becomes their job to provide it (and make sure it's a compatible version).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's what "provided" means, but why do you assume the app provides Guava itself, but none of the others? I see no reason to think of Guava specially and not Commons Math or log4j or whatever. What does it solve to treat this exceptionally? It means extra declarations in Spark, and does not affect your ability to shade the dependency in the assembly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is kind of a problem with the scopes in maven and the way things are generally distributed...

For libraries, I believe all dependencies should be "provided". So commons-math and all others should, in my view, also be provided. Of course I won't make that change here.

When you package libraries into an app (which is the case of the uber-jars, meant to be distributed in a cluster), then you need to package those dependencies. In maven that generally means making them "compile" scope.

For a pure library, you wouldn't have this packaging step; that would be done by the users of the library.

Leaving the library with a compile-scope dependency means you'll run into all the issues I pointed out in the bug:

  • people will depend on the transitive dependency
  • at some point they might want to update the version
  • and at that point things will fail if they forget to package the new dependency

So this change solves that problem for Guava. If you use it, you have to explicitly declare and package it. For others, I'll leave it for the time when people decide to tackle other dependencies (I explicitly created a sub-task for Guava to avoid having to deal with other dependencies).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that if you leave it as a compile-time dependency, and people are using Guava in their app, and not packaging Guava, their app will fail to run. Because the uber jar in the cluster won't have those classes.

So saying "provided" here moves that particular error to the build.

</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
Expand Down Expand Up @@ -1017,6 +1018,21 @@

<profiles>

<!--
This profile is enabled automatically by the sbt built. It changes the scope for the guava
dependency, since we don't shade it in the artifacts generated by the sbt build.
-->
<profile>
<id>sbt</id>
<dependencies>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey just wondering - since spark core lists this a compile dependency, and everything else depends on spark core, what makes this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - I see - the issue is that elsewhere this is not in compile scope. However, why not just have it be compile scope by default in the root pom? Is the issue that we need to make sure it does not exist at compile scope in our published root pom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous discussion I had with Sean in this PR.

There's really no good scope in maven for what we want to do. "compile" means that guava will show up in the user's compilation classpath, and since we're shading guava in the assembly, it may lead to runtime errors since guava won't be there. "provided" means that guava is not exposed to the user at all; so if the user wants to use guava, he'll need to add an explicit dependency and package it himself.

I like "provided" better because it means things fail for the user at build time, not runtime, if the user is using guava but hasn't declared the dependency. And that's why I had all those hacks in the sbt build that you asked me to remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant that, since we are using the shade plug-in, will it just remove the entire guava from our published pom's anyways... then I realized that it won't remove guava from the root pom most likely, since we only shade inside of spark-core.

</dependency>
</dependencies>
</profile>

<!-- Ganglia integration is not included by default due to LGPL-licensed code -->
<profile>
<id>spark-ganglia-lgpl</id>
Expand Down
5 changes: 2 additions & 3 deletions project/SparkBuild.scala
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ object SparkBuild extends PomBuild {
def backwardCompatibility = {
import scala.collection.mutable
var isAlphaYarn = false
var profiles: mutable.Seq[String] = mutable.Seq.empty
var profiles: mutable.Seq[String] = mutable.Seq("sbt")
if (Properties.envOrNone("SPARK_GANGLIA_LGPL").isDefined) {
println("NOTE: SPARK_GANGLIA_LGPL is deprecated, please use -Pspark-ganglia-lgpl flag.")
profiles ++= Seq("spark-ganglia-lgpl")
Expand Down Expand Up @@ -116,7 +116,7 @@ object SparkBuild extends PomBuild {
retrieveManaged := true,
retrievePattern := "[type]s/[artifact](-[revision])(-[classifier]).[ext]",
publishMavenStyle := true,

resolvers += Resolver.mavenLocal,
otherResolvers <<= SbtPomKeys.mvnLocalRepository(dotM2 => Seq(Resolver.file("dotM2", dotM2))),
publishLocalConfiguration in MavenCompile <<= (packagedArtifacts, deliverLocal, ivyLoggingLevel) map {
Expand Down Expand Up @@ -228,7 +228,6 @@ object SQL {
object Hive {

lazy val settings = Seq(

javaOptions += "-XX:MaxPermSize=1g",
// Multiple queries rely on the TestHive singleton. See comments there for more details.
parallelExecution in Test := false,
Expand Down
4 changes: 4 additions & 0 deletions project/plugins.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,7 @@ addSbtPlugin("com.alpinenow" % "junit_xml_listener" % "0.5.1")
addSbtPlugin("com.eed3si9n" % "sbt-unidoc" % "0.3.1")

addSbtPlugin("com.cavorite" % "sbt-avro" % "0.3.2")

libraryDependencies += "org.ow2.asm" % "asm" % "5.0.3"

libraryDependencies += "org.ow2.asm" % "asm-commons" % "5.0.3"