Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2848] Shade Guava in uber-jars. #1813

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,12 @@
</properties>

<dependencies>
<!-- Promote Guava to compile scope in this module so it's included while shading. -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
Expand Down Expand Up @@ -95,6 +101,12 @@
</includes>
</artifactSet>
<filters>
<filter>
<artifact>com.google.guava:guava</artifact>
<excludes>
<exclude>com/google/common/base/Optional*</exclude>
</excludes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
Expand All @@ -113,6 +125,18 @@
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>org.spark-project.guava</shadedPattern>
<includes>
<include>com.google.common.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional**</exclude>
</excludes>
</relocation>
</relocations>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
Expand Down
243 changes: 243 additions & 0 deletions core/src/main/java/com/google/common/base/Optional.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
/*
* Copyright (C) 2011 The Guava Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I checked and there is no additional change we need to make to LICENSE or NOTICE as a result of including this. Guava is already AL2 and has no NOTICE that pertains to this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still list it in our LICENSE? We've listed all source files that are not copyright Apache, which would include this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not according to my reading of
http://www.apache.org/dev/licensing-howto.html , in the part covering
bundled AL2 dependencies, but this stuff is always less than intuitive to
me.

I suppose the reasoning is that Spark's AL2 license already exactly
describes the terms of the licensing for that file. That's not quite true
for BSD or MIT licenses.

The copyright owner doesn't matter; there are hundreds of them. It matters
just how whoever they are license their IP to be used.
On Aug 11, 2014 12:22 AM, "Matei Zaharia" notifications@github.com wrote:

In core/src/main/java/com/google/common/base/Optional.java:

@@ -0,0 +1,243 @@
+/*

  • * Copyright (C) 2011 The Guava Authors

Shouldn't we still list it in our LICENSE? We've listed all source files
that are not copyright Apache, which would include this one.


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/1813/files#r16034129.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if we're comfortable with not locking this file to the 14.0.1 version (or remembering to do this when we change guava versions), I can remove this fork (and clean up some code elsewhere in this patch too).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I would still include it because the copyright is from another organization. I don't see why adding this would hurt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin You're referring to just not including this source file and instead getting it in binary form from the Guava artifact right? b/c yes if it stays in source form it certainly must keep its copyright statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen yes that's the alternative.

*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.google.common.base;

import static com.google.common.base.Preconditions.checkNotNull;

import com.google.common.annotations.Beta;
import com.google.common.annotations.GwtCompatible;

import java.io.Serializable;
import java.util.Iterator;
import java.util.Set;

import javax.annotation.Nullable;

/**
* An immutable object that may contain a non-null reference to another object. Each
* instance of this type either contains a non-null reference, or contains nothing (in
* which case we say that the reference is "absent"); it is never said to "contain {@code
* null}".
*
* <p>A non-null {@code Optional<T>} reference can be used as a replacement for a nullable
* {@code T} reference. It allows you to represent "a {@code T} that must be present" and
* a "a {@code T} that might be absent" as two distinct types in your program, which can
* aid clarity.
*
* <p>Some uses of this class include
*
* <ul>
* <li>As a method return type, as an alternative to returning {@code null} to indicate
* that no value was available
* <li>To distinguish between "unknown" (for example, not present in a map) and "known to
* have no value" (present in the map, with value {@code Optional.absent()})
* <li>To wrap nullable references for storage in a collection that does not support
* {@code null} (though there are
* <a href="http://code.google.com/p/guava-libraries/wiki/LivingWithNullHostileCollections">
* several other approaches to this</a> that should be considered first)
* </ul>
*
* <p>A common alternative to using this class is to find or create a suitable
* <a href="http://en.wikipedia.org/wiki/Null_Object_pattern">null object</a> for the
* type in question.
*
* <p>This class is not intended as a direct analogue of any existing "option" or "maybe"
* construct from other programming environments, though it may bear some similarities.
*
* <p>See the Guava User Guide article on <a
* href="http://code.google.com/p/guava-libraries/wiki/UsingAndAvoidingNullExplained#Optional">
* using {@code Optional}</a>.
*
* @param <T> the type of instance that can be contained. {@code Optional} is naturally
* covariant on this type, so it is safe to cast an {@code Optional<T>} to {@code
* Optional<S>} for any supertype {@code S} of {@code T}.
* @author Kurt Alfred Kluever
* @author Kevin Bourrillion
* @since 10.0
*/
@GwtCompatible(serializable = true)
public abstract class Optional<T> implements Serializable {
/**
* Returns an {@code Optional} instance with no contained reference.
*/
@SuppressWarnings("unchecked")
public static <T> Optional<T> absent() {
return (Optional<T>) Absent.INSTANCE;
}

/**
* Returns an {@code Optional} instance containing the given non-null reference.
*/
public static <T> Optional<T> of(T reference) {
return new Present<T>(checkNotNull(reference));
}

/**
* If {@code nullableReference} is non-null, returns an {@code Optional} instance containing that
* reference; otherwise returns {@link Optional#absent}.
*/
public static <T> Optional<T> fromNullable(@Nullable T nullableReference) {
return (nullableReference == null)
? Optional.<T>absent()
: new Present<T>(nullableReference);
}

Optional() {}

/**
* Returns {@code true} if this holder contains a (non-null) instance.
*/
public abstract boolean isPresent();

/**
* Returns the contained instance, which must be present. If the instance might be
* absent, use {@link #or(Object)} or {@link #orNull} instead.
*
* @throws IllegalStateException if the instance is absent ({@link #isPresent} returns
* {@code false})
*/
public abstract T get();

/**
* Returns the contained instance if it is present; {@code defaultValue} otherwise. If
* no default value should be required because the instance is known to be present, use
* {@link #get()} instead. For a default value of {@code null}, use {@link #orNull}.
*
* <p>Note about generics: The signature {@code public T or(T defaultValue)} is overly
* restrictive. However, the ideal signature, {@code public <S super T> S or(S)}, is not legal
* Java. As a result, some sensible operations involving subtypes are compile errors:
* <pre> {@code
*
* Optional<Integer> optionalInt = getSomeOptionalInt();
* Number value = optionalInt.or(0.5); // error
*
* FluentIterable<? extends Number> numbers = getSomeNumbers();
* Optional<? extends Number> first = numbers.first();
* Number value = first.or(0.5); // error}</pre>
*
* As a workaround, it is always safe to cast an {@code Optional<? extends T>} to {@code
* Optional<T>}. Casting either of the above example {@code Optional} instances to {@code
* Optional<Number>} (where {@code Number} is the desired output type) solves the problem:
* <pre> {@code
*
* Optional<Number> optionalInt = (Optional) getSomeOptionalInt();
* Number value = optionalInt.or(0.5); // fine
*
* FluentIterable<? extends Number> numbers = getSomeNumbers();
* Optional<Number> first = (Optional) numbers.first();
* Number value = first.or(0.5); // fine}</pre>
*/
public abstract T or(T defaultValue);

/**
* Returns this {@code Optional} if it has a value present; {@code secondChoice}
* otherwise.
*/
public abstract Optional<T> or(Optional<? extends T> secondChoice);

/**
* Returns the contained instance if it is present; {@code supplier.get()} otherwise. If the
* supplier returns {@code null}, a {@link NullPointerException} is thrown.
*
* @throws NullPointerException if the supplier returns {@code null}
*/
@Beta
public abstract T or(Supplier<? extends T> supplier);

/**
* Returns the contained instance if it is present; {@code null} otherwise. If the
* instance is known to be present, use {@link #get()} instead.
*/
@Nullable
public abstract T orNull();

/**
* Returns an immutable singleton {@link Set} whose only element is the contained instance
* if it is present; an empty immutable {@link Set} otherwise.
*
* @since 11.0
*/
public abstract Set<T> asSet();

/**
* If the instance is present, it is transformed with the given {@link Function}; otherwise,
* {@link Optional#absent} is returned. If the function returns {@code null}, a
* {@link NullPointerException} is thrown.
*
* @throws NullPointerException if the function returns {@code null}
*
* @since 12.0
*/
public abstract <V> Optional<V> transform(Function<? super T, V> function);

/**
* Returns {@code true} if {@code object} is an {@code Optional} instance, and either
* the contained references are {@linkplain Object#equals equal} to each other or both
* are absent. Note that {@code Optional} instances of differing parameterized types can
* be equal.
*/
@Override
public abstract boolean equals(@Nullable Object object);

/**
* Returns a hash code for this instance.
*/
@Override
public abstract int hashCode();

/**
* Returns a string representation for this instance. The form of this string
* representation is unspecified.
*/
@Override
public abstract String toString();

/**
* Returns the value of each present instance from the supplied {@code optionals}, in order,
* skipping over occurrences of {@link Optional#absent}. Iterators are unmodifiable and are
* evaluated lazily.
*
* @since 11.0 (generics widened in 13.0)
*/
@Beta
public static <T> Iterable<T> presentInstances(
final Iterable<? extends Optional<? extends T>> optionals) {
checkNotNull(optionals);
return new Iterable<T>() {
@Override
public Iterator<T> iterator() {
return new AbstractIterator<T>() {
private final Iterator<? extends Optional<? extends T>> iterator =
checkNotNull(optionals.iterator());

@Override
protected T computeNext() {
while (iterator.hasNext()) {
Optional<? extends T> optional = iterator.next();
if (optional.isPresent()) {
return optional.get();
}
}
return endOfData();
}
};
}
};
}

private static final long serialVersionUID = 0;
}
26 changes: 25 additions & 1 deletion examples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,14 @@
</dependencies>
</profile>
</profiles>

<dependencies>
<!-- Promote Guava to compile scope in this module so it's included while shading. -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
Expand Down Expand Up @@ -209,6 +215,12 @@
</includes>
</artifactSet>
<filters>
<filter>
<artifact>com.google.guava:guava</artifact>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here, add a comment on why we keep Optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I guess it's above too, but wouldn't hurt to place it exactly where we do the exclude).

<excludes>
<exclude>com/google/common/base/Optional*</exclude>
</excludes>
</filter>
<filter>
<artifact>*:*</artifact>
<excludes>
Expand All @@ -226,6 +238,18 @@
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>org.spark-project.guava</shadedPattern>
<includes>
<include>com.google.common.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional**</exclude>
</excludes>
</relocation>
</relocations>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
Expand Down
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,7 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
<scope>provided</scope>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably learn something here, but why is it necessary to mark it provided and then compile-scope again in the assembly? what does it matter if it's compile scope everywhere? I realize Hive on Spark may need it to be provided but it can change the nature of this dependency down-stream if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you leave it as compile, it will pollute the compilation classpath of people using Spark in their projects (Guava will be pulled as a compile-scope dependency too).

http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope

Adding it as compile-scope in the assembly project is then needed because maven-shade-plugin will only include those in the final jar (pretty sure there are setting to include other scopes, but didn't want to play with those).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand what the scopes mean. But Spark is already "provided" to any project compiling against Spark; you're not suppose to pull in Spark as compile-scope. This makes all its transitive dependencies provided in this scenario already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if someone "accidentally" pulls Spark as a compile-scope dependency, they'd run into what I described. This avoids that problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in that scenario, someone's trying to run Spark outside of a cluster providing Spark or something. But then this fails since Guava is said to be provided, but is not actually provided by anything! Unless I suppose the app also depends on Guava.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but that's what provided means, right? It's provided. Somehow. That "how" is not the concern of the library being pulled.

Normally it's provided by the Spark uber-jar which contains the dependency. But if someone is not using those, it becomes their job to provide it (and make sure it's a compatible version).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's what "provided" means, but why do you assume the app provides Guava itself, but none of the others? I see no reason to think of Guava specially and not Commons Math or log4j or whatever. What does it solve to treat this exceptionally? It means extra declarations in Spark, and does not affect your ability to shade the dependency in the assembly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is kind of a problem with the scopes in maven and the way things are generally distributed...

For libraries, I believe all dependencies should be "provided". So commons-math and all others should, in my view, also be provided. Of course I won't make that change here.

When you package libraries into an app (which is the case of the uber-jars, meant to be distributed in a cluster), then you need to package those dependencies. In maven that generally means making them "compile" scope.

For a pure library, you wouldn't have this packaging step; that would be done by the users of the library.

Leaving the library with a compile-scope dependency means you'll run into all the issues I pointed out in the bug:

  • people will depend on the transitive dependency
  • at some point they might want to update the version
  • and at that point things will fail if they forget to package the new dependency

So this change solves that problem for Guava. If you use it, you have to explicitly declare and package it. For others, I'll leave it for the time when people decide to tackle other dependencies (I explicitly created a sub-task for Guava to avoid having to deal with other dependencies).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that if you leave it as a compile-time dependency, and people are using Guava in their app, and not packaging Guava, their app will fail to run. Because the uber jar in the cluster won't have those classes.

So saying "provided" here moves that particular error to the build.

</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
Expand Down
Loading