Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3812] [BUILD] Adapt maven build to publish effective pom. #2673

Closed

Conversation

ScrapCodes
Copy link
Member

I have tried maven help plugin first but that published all projects in top level pom. So I was left with no choice but to roll my own trivial plugin. This patch basically installs an effective pom after maven install is finished.

The problem it fixes is described as follows:
If you install using maven
mvn install -DskipTests -Dhadoop.version=2.2.0 -Phadoop-2.2
Then without this patch the published pom(s) will have hadoop version as 1.0.4. This can be a problem at some point.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2673 at commit cfe0531.

  • This patch merges cleanly.

@ScrapCodes ScrapCodes changed the title Build changes to publish effective pom. [SPARK-3812] Build changes to publish effective pom. Oct 6, 2014
@ScrapCodes ScrapCodes changed the title [SPARK-3812] Build changes to publish effective pom. [SPARK-3812] [BUILD] Build changes to publish effective pom. Oct 6, 2014
@ScrapCodes ScrapCodes changed the title [SPARK-3812] [BUILD] Build changes to publish effective pom. [SPARK-3812] [BUILD] Adapt maven build to publish effective pom. Oct 6, 2014
@ScrapCodes
Copy link
Member Author

@pwendell Take a look, whenever you get time. It would be good if we can publish https://github.com/ScrapCodes/effective-pom-plugin.

@ScrapCodes
Copy link
Member Author

I will have to add a similar thing for http://maven.apache.org/plugins/maven-deploy-plugin/deploy-file-mojo.html. But I am not sure about repository url field.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2673 at commit cfe0531.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(
    • case class UncacheTableCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21323/Test PASSed.

@vanzin
Copy link
Contributor

vanzin commented Oct 6, 2014

@ScrapCodes what's the end goal here?

I understand what your change does, but I'm confused about how this actually translates to the published bits, unless there's more information that I'm missing.

AFAIK, right now, only one "spark-core_2.10" artifact is published, as can bee seen at:

http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10

So, with this fix, the default hadoop version for that artifact would be changed if the build that publishes it is compiled with a different -Dhadoop.version=x.

Or is there a plan to publish different hadoop versions under different artifactIds (or versions)?

@ScrapCodes
Copy link
Member Author

Well, sorry for the confusion. For the published bits used now one can easily exclude and include another dependency as long as it is about a few deps. But once we start moving towards supporting scala 2.11 we would want to publish artifacts for both scala 2.10 and 2.11. At that point publishing effective pom(s) will be essential. There is a related discussion on another PR #2615.

@pwendell
Copy link
Contributor

pwendell commented Oct 7, 2014

Hey @vanzin this is something we need to support cross building in 2.10 and 2.11, but it's also a broadly useful thing that is lacking in the maven build ecosystem (SBT has this feature out of the box).

The reason why it's needed is that we need to publish different dependency trees for every artifact, say spark-core for both 2.10 and 2.11. The maven build tool isn't good at distinguishing the published artifacts from the build itself, so this will allow us to publish only the "effective" dependency graph for a particular profile that is activated during publishing.

The reason why it's useful more broadly is that we have a lot of profiles in Spark's build that get published with our poms, even though these are not usable by projects that depend on Spark (profiles are not a concept that works transitively across projects - or at least many build tools such as sbt do not respect profiles from other projects). They just make our poms more confusing looking for people. What we really want to publish is a single set of dependencies for each Spark artifact.

@vanzin
Copy link
Contributor

vanzin commented Oct 7, 2014

Thanks for the explanation. The issue with the Scala versions makes sense. What threw me off was the Hadoop example: I've always seen people say that the Spark API is independent of the Hadoop version, and they should explicitly say the Hadoop version they want in their projects (and have a matching Spark deployment). So explaining this PR in terms of publishing a different Hadoop versions sounds a little bit at odds with that.

@ueshin
Copy link
Member

ueshin commented Oct 7, 2014

Hi @pwendell, I had a similar issue related to artifacts in Maven Central and Hadoop versions.
Could you take a look at SPARK-3764 and #2638 please?

@pwendell
Copy link
Contributor

pwendell commented Oct 8, 2014

@vanzin yeah I agree - I think it was just sort of a red herring based on the example in @ScrapCodes description. AFAIK this is totally unrelated to Hadoop.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dispalt
Copy link

dispalt commented Oct 9, 2014

@pwendell As a fun-fact, hbase publishes jars for hadoop1 and hadoop2 like so. http://search.maven.org/#artifactdetails%7Corg.apache.hbase%7Chbase%7C0.98.6.1-hadoop2%7Cpom they have some tooling in dev-support folder to help them with the automation of that.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2673 at commit 285be82.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2673 at commit 285be82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21579/Test PASSed.

@ScrapCodes
Copy link
Member Author

@pwendell I tried maven shade plugin to somehow work as effective pom generator, but that does not happen unless we have dependencies apart from the project's itself to put in uber jar. Please see the above commit and their commit messages for reasons.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2673 at commit 242a24e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have finished for PR 2673 at commit 242a24e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21589/Test PASSed.

@ScrapCodes
Copy link
Member Author

@pwendell I don't see an easy way with maven shade plugin either ? Do you ?, One way is to include a fake dependency and then ask it to shade that across all artifacts. But I somehow felt this is more invasive.

@pwendell
Copy link
Contributor

What if we just add the maven-shade plug-in but we configure it in a way where it effectively is doing no operations? I.e. we add a relocation of a class that does not exist. Will it still produce an effective pom?

@pwendell
Copy link
Contributor

I think that a solution that doesn't require us using our own plug-in is good, even if we do something silly like we have to shade a class that is in fact un-used.

@ScrapCodes
Copy link
Member Author

Even if we do this trick and shade something in all artifacts, what about spark-parent ?(I mean how to get its effective pom.) There since we don't build jar - shading plugin throws NPEs.

@pwendell
Copy link
Contributor

Hm - is there any workaround to that? We don't actually need to publish the parent pom anyways if we are using effective poms. Could we just install the other modules only and not the parent?

The current approach also adds Guava logic to the parent pom which doesn't make much sense. I think ideally we'd just have a very simple "dummy" relocation (e.g. without any exclusions) in the parent and the existing shading stuff around guava would be in core/pom.xml. Does the NPE go away if you do that?

@ScrapCodes
Copy link
Member Author

We can definitely install other modules, I am afraid if the resultant(effective) pom(s) will carry reference to parent pom. Let me try that out.

@ScrapCodes
Copy link
Member Author

Yeah I verified the resultant dependency tree as shown in gists above.

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2673 at commit 6fff1ce.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

Tests timed out for PR 2673 at commit 6fff1ce after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21996/
Test FAILed.

@vanzin
Copy link
Contributor

vanzin commented Oct 21, 2014

Hey @ScrapCodes @pwendell,

Funny thing. I've been playing with a different part of the build and I've found out I need exactly this to get what I wanted to work. I think I found a workaround that doesn't require a new plugin, although it's not pretty.

What you can do is create a "dummy" sub-module that just publishes an empty jar (or a jar with ay random file, it really doesn't matter). In the projects where you want to publish the effective pom, add a call to the maven-shade-plugin that only shades this dummy dependency, and actually excludes all files from it in the filters section if needed.

That has both the effect of not packaging anything else in the final jar, and also removing the dummy dependency from the published pom. But well, it's pretty ugly.

@vanzin
Copy link
Contributor

vanzin commented Oct 21, 2014

Yay, I just noticed your last patch takes exactly that approach.

@pwendell
Copy link
Contributor

@vanzin yep that's exactly what we did!

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2673 at commit 6fff1ce.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

Tests timed out for PR 2673 at commit 6fff1ce after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22014/
Test FAILed.

@ScrapCodes
Copy link
Member Author

Not sure which is better ! Creating an empty project with just a pom file in it. Or depending on random jar from maven central. ? I prefer first approach.

@pwendell
Copy link
Contributor

I'll just publish an empty jar in maven central - we can use it forever.

@pwendell
Copy link
Contributor

I created a jar you can use. It might take a couple of hours to propagate to maven central:

  <groupId>org.spark-project.spark</groupId>
  <artifactId>unused</artifactId>
  <version>1.0.0</version>

@vanzin
Copy link
Contributor

vanzin commented Oct 22, 2014

@ScrapCodes here's an approach that doesn't rely on maven central:
https://gist.github.com/vanzin/1ac8114faf39a6e8359d

Feel free to use / adapt it if you guys think that's preferred. (Also not that this solution runs the maven-shade-plugin even for modules that are not "jar" modules; it seems to work and not do anything evil, though.)

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have started for PR 2673 at commit aa7b91d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have finished for PR 2673 at commit aa7b91d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22045/
Test PASSed.

@pwendell
Copy link
Contributor

Okay - both approach seem decent but the current one is a bit simpler for developers (otherwise we have to explain this emtpy source directory). I'm just gonna pull this in to lay the groundwork for the other Scala 2.11 stuff.

@asfgit asfgit closed this in c5882c6 Oct 23, 2014
@witgo
Copy link
Contributor

witgo commented Oct 23, 2014

@ScrapCodes @pwendell
This patch will cause maven-assembly-plugin error:
./make-distribution.sh -Dhadoop.version=2.3.0-cdh5.0.1 -Dyarn.version=2.3.0-cdh5.0.1 -Phadoop-2.3 -Pyarn -Pnetlib-lgpl

du -sh dist/lib/*

4.0K    dist/lib/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar
928K    dist/lib/spark-examples-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar

@pwendell
Copy link
Contributor

Thanks - I reverted this patch in master.

On Thu, Oct 23, 2014 at 2:39 AM, Guoqiang Li notifications@github.com
wrote:

@ScrapCodes https://github.com/ScrapCodes @pwendell
https://github.com/pwendell

This patch will cause maven-assembly-plugin error:
./make-distribution.sh -Dhadoop.version=2.3.0-cdh5.0.1
-Dyarn.version=2.3.0-cdh5.0.1 -Phadoop-2.3 -Pyarn -Pnetlib-lgpl

du -sh dist/lib/*

4.0K dist/lib/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar
928K dist/lib/spark-examples-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.1.jar


Reply to this email directly or view it on GitHub
#2673 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants