[SPARK-8852] [flume] Trim dependencies in flume assembly. #7247

vanzin · 2015-07-06T23:50:37Z

Also, add support for the *-provided profiles. This avoids repackaging
things that are already in the Spark assembly, or, in the case of the
*-provided profiles, are provided by the distribution.

The flume-ng-auth dependency was also excluded since it's not really
used by Spark.

Also, add support for the *-provided profiles. This avoids repackaging things that are already in the Spark assembly, or, in the case of the *-provided profiles, are provided by the distribution. The flume-ng-auth dependency was also excluded since it's not really used by Spark.

vanzin · 2015-07-06T23:51:56Z

Assembly came down to ~ 2.5 MB from ~ 80 MB. @harishreedharan tells me the flume-ng-auth dependency is not needed, so I excluded it. I also fixed some indentation issues in the assembly pom.

Also tested with the flume-provided profile enabled, in which case the assembly is ~ 170 kB.

SparkQA · 2015-07-07T02:06:16Z

Test build #36621 has finished for PR 7247 at commit c962082.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-07-07T08:26:50Z

pom.xml

I don't really disagree with this, since there is only one usage of flume in the project now. But the exclusions are specific to the flume-assembly module's needs, not to consumers of Flume in general in the project right? can this be managed down in the module alongside the other changes for the same purpose?

Yes, mostly they're for the flume-assembly build. I don't mind moving them out.

srowen · 2015-07-07T18:45:34Z

pom.xml

Does this exclusion belong in the child POM too? I had actually thought potentially all of them should go, unless we systematically want to exclude some deps from all uses of Flume across the project, of which there's really only one now anyway. That is, if the reason for the exclusion is specific to one child module, they can live there only. It's up to your better judgment IMHO so LGTM either way.

According to Hari this dependency is not used in Spark, so it sounds better to exclude it everywhere so that if something is added that uses it, things break (instead of just generating a broken assembly without needed classes).

SparkQA · 2015-07-07T19:09:37Z

Test build #36693 has finished for PR 7247 at commit 298a7d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-07-09T19:45:49Z

Ping?

srowen · 2015-07-09T19:53:44Z

LGTM, any objections?

tdas · 2015-07-09T21:36:42Z

What is the flume provided field for? If flume is already provided at the runtime environment, then the only thing that needs to be add is spark-streaming-flume, which can be directly added as its own JAR. Why have a separate profile in assembly?

vanzin · 2015-07-09T21:39:59Z

Why have a separate profile in assembly?

Because that would mean that depending on the profiles you enable, the pydocs in the code need to be changed, which is ugly. This way, while you have some redundancy, at least the user interface (or at least the user docs) remain consistent.

EDIT: I'm referring to this in flume.py:

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
   Group Id = org.apache.spark, Artifact Id = spark-streaming-flume-assembly, Version = %s.
   Then, include the jar in the spark-submit command as

   $ bin/spark-submit --jars <spark-streaming-flume-assembly.jar> ...

tdas · 2015-07-09T22:10:13Z

Oh its easy to add it in the pydoc that if Flume is already present you can just download spark-streaming-flume.jar. That is much better than having another maven profile to manage and reason about. And I dont even know who will even use that profile to compile. If someone is smart enough to use that profile for some purpose, then its fair to assume that he/she will be knowledgeable enough to know that just including spark-stremaing-flume.jar is sufficient, instead of including the spark-streaming-flume-assembly.jar

vanzin · 2015-07-09T22:14:03Z

That maven profile already exists. We use it when packaging CDH. That's what all "*-provided" profiles are for - for distributions to use when they already provide the dependencies.

vanzin · 2015-07-09T22:28:12Z

if Flume is already present you can just download spark-streaming-flume.jar

On top of my previous comment, that's also not enough. You also need spark-streaming-flume-sink.jar, which is included in the assembly... which is why having yet another set of instructions is confusing to users.

tdas · 2015-07-10T00:34:21Z

I understand it now. All right LGTM.

srowen reviewed Jul 7, 2015
View reviewed changes

Feedback.

298a7d5

srowen reviewed Jul 7, 2015
View reviewed changes

asfgit closed this in 0e78e40 Jul 10, 2015

vanzin deleted the SPARK-8852 branch July 30, 2015 00:07

[SPARK-8852] [flume] Trim dependencies in flume assembly. #7247

[SPARK-8852] [flume] Trim dependencies in flume assembly. #7247

Uh oh!

Conversation

vanzin commented Jul 6, 2015

Uh oh!

vanzin commented Jul 6, 2015

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

srowen Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

srowen Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 7, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 7, 2015

Uh oh!

vanzin commented Jul 9, 2015

Uh oh!

srowen commented Jul 9, 2015

Uh oh!

tdas commented Jul 9, 2015

Uh oh!

vanzin commented Jul 9, 2015

Uh oh!

tdas commented Jul 9, 2015

Uh oh!

vanzin commented Jul 9, 2015

Uh oh!

vanzin commented Jul 9, 2015

Uh oh!

tdas commented Jul 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants