Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4048] Enhance and extend hadoop-provided profile. #2982

Closed
wants to merge 28 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Oct 28, 2014

This change does a few things to make the hadoop-provided profile more useful:

  • Create new profiles for other libraries / services that might be provided by the infrastructure
  • Simplify and fix the poms so that the profiles are only activated while building assemblies.
  • Fix tests so that they're able to run when the profiles are activated
  • Add a new env variable to be used by distributions that use these profiles to provide the runtime
    classpath for Spark jobs and daemons.

@vanzin
Copy link
Contributor Author

vanzin commented Oct 28, 2014

Notes:

  • I ran the unit tests and also deployed the assemblies built with all profiles enabled internally, and made sure everything still worked.
  • The individual commits have more explanations for each part of the overall patch, so take a look at them when in doubt about something.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22371 has started for PR 2982 at commit d261346.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22371 has finished for PR 2982 at commit d261346.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorTransformer(object):
    • class Normalizer(VectorTransformer):
    • class JavaModelWrapper(VectorTransformer):
    • class StandardScalerModel(JavaModelWrapper):
    • class StandardScaler(object):
    • class HashingTF(object):
    • class IDFModel(JavaModelWrapper):
    • class IDF(object):
    • class Word2VecModel(JavaModelWrapper):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22371/
Test PASSed.

@qiaohaijun
Copy link

+1

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22836 has started for PR 2982 at commit a89b833.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22836 has finished for PR 2982 at commit a89b833.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22836/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22838 has started for PR 2982 at commit 93b4789.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22838 has finished for PR 2982 at commit 93b4789.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):
    • case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22838/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22884 has started for PR 2982 at commit 615118f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22884 has finished for PR 2982 at commit 615118f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22884/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22893 has started for PR 2982 at commit 414e1f0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22893 has finished for PR 2982 at commit 414e1f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22893/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22952 has started for PR 2982 at commit a0f538c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22952 has finished for PR 2982 at commit a0f538c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22952/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23016 has started for PR 2982 at commit f298adc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23016 has finished for PR 2982 at commit f298adc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23016/
Test PASSed.

@@ -41,10 +41,6 @@
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see any direct dependency on jetty from the streaming code. I'll double check, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, streaming does not depend on Jetty, but it does use servlet APIs. Since Spark doesn't pull in any servlet API dependency directly, and instead uses the transitive one provided by Jetty, which is some Jetty repackaging of servlet APIs instead of some common dependency. (BTW it works without the explicit dependency because now it was coming as a transitive dependency of spark-core.)

In light of that I'll restore this dependency, even though I don't really like it very much (it's pulling a dependency that is too coarse for what it needs).

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23243 has started for PR 2982 at commit 3ee72b3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23244 has started for PR 2982 at commit f1173d6.

  • This patch merges cleanly.

@vanzin
Copy link
Contributor Author

vanzin commented Nov 12, 2014

Friendly ping. I still have to double check TD's question, but even pending that, the code builds and runs correctly.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23243 has finished for PR 2982 at commit 3ee72b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23243/
Test PASSed.

// compute-classpath.{cmd,sh} and makes all needed jars available to child processes
// when the assembly is built with the "*-provided" profiles enabled.
if (sys.props.contains("spark.testing")) {
environment.put("SPARK_DIST_CLASSPATH", sys.props("java.class.path"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed in all cases or only when you are actually running tests with a "hadoop provided" build and you've supplied you own SPARK_DIST_CLASSPATH for the test JVM. If the latter is true, can we just propagate the value of SPARK_DIST_CLASSPATH to the child from the parent? It's a bit weird here because you're setting SPARK_DIST_CLASSPATH to also include all of the normal spark classes.

Then you would just check if SPARK_DIST_CLASSPATH is set and if so, you'd propagate its value to the child environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to run into problems with tests spawning child processes that referenced hadoop classes, and those not being available when the "hadoop-provided" profile was enabled. Let me try without this and see how it goes, since I noticed some changes in the configuration of the test plugins in the root pom.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it is still a problem, my question was, can we just set SPARK_DIST_CLASSPATH in the child to the value of SPARK_DIST_CLASSPATH in the parent? This is instead of setting it to the value of java.class.path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But SPARK_DIST_CLASSPATH is not set in the parent (the parent being the surefire/scalatest runner, in this case). The goal of this code is to add the classpath built by those (which includes all the dependencies from the pom) to the child processes without having to modify the test code to manually do it.

@pwendell
Copy link
Contributor

pwendell commented Jan 8, 2015

Hey @vanzin. I left a comment about the test set-up but overall this looks good. Fine to keep SPARK_DIST_CLASSPATH as is. It seems reasonable to have a way for distributions to customize the default global classpath behavior. My main hesitation was that we used to have SPARK_CLASSPATH and many users would just add their jars, etc using that variable and it didn't work well in mutli-tenant environments. Now we have better mechanisms to point users to, so this seems fine.

Marcelo Vanzin added 2 commits January 8, 2015 12:29
Conflicts:
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25258 has started for PR 2982 at commit 4e38f4e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25258 has finished for PR 2982 at commit 4e38f4e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25258/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25263 has started for PR 2982 at commit eb228c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25263 has finished for PR 2982 at commit eb228c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25263/
Test PASSed.

<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-install-plugin</artifactId>
<configuration>
<skip>true</skip>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change a separate issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25273 has started for PR 2982 at commit 82eb688.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25273 has finished for PR 2982 at commit 82eb688.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25273/
Test PASSed.

@pwendell
Copy link
Contributor

pwendell commented Jan 9, 2015

Thanks @vanzin for this cleanup - I'll merge this.

@asfgit asfgit closed this in 48cecf6 Jan 9, 2015
@vanzin vanzin deleted the SPARK-4048 branch January 9, 2015 18:12
asfgit pushed a commit that referenced this pull request Mar 3, 2015
After #2982 (SPARK-4048) we rely on the newer HBase packaging format.
asfgit pushed a commit that referenced this pull request Mar 3, 2015
After #2982 (SPARK-4048) we rely on the newer HBase packaging format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants