Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

kimoonkim
Copy link
Member

@kimoonkim kimoonkim commented Feb 8, 2017

@ash211 @cvpatel @ssuchter

What changes were proposed in this pull request?

Currently, the kubernetes integration test runs in the maven test phase and fails because the test jobs and other jars are missing in the target dir. (See #74) Those jars are supposed to be copied in the pre-integration-test phase, which is after the test phase.

This change fixes the issue by triggering the scalatest plugin in the integration-test phase. The target directory now has the needed jars.

How was this patch tested?

Ran the integration test build command and saw it passed the previous failing point.

$ build/mvn -B clean integration-test -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

@ash211
Copy link

ash211 commented Feb 10, 2017

@kimoonkim I ran the build command you listed on a fresh checkout of the k8s-support-alternate-incremental branch (without this change) and didn't see any failures.

The exact command was: build/mvn clean; build/mvn -B integration-test -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

and the output included

KubernetesSuite:
- Run a simple example
- Run using spark-submit
- Run using spark-submit with the examples jar on the docker image
- Run with custom labels
- Enable SSL on the driver submit server
- Added files should exist on the driver.
Run completed in 10 minutes, 48 seconds.
Total number of tests run: 6
Suites: completed 2, aborted 0
Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0

Do you have any ideas why that might be? Am I not cleaning intermediate testing artifacts properly?

I'd expect this command to fail before your patch and succeed afterwards but what I'm observing is that it's actually succeeding before.

@kimoonkim
Copy link
Member Author

@ash211 Ah, thanks for trying out the commands. (I am also going to do that myself to verify :-)).

I noticed your first $ mvn clean command did not specify -Pkubernees-integration-tests, which means it won't remove resource-managers/kubernetes/integration-tests/target dir if the dir exists already. Any chance the target dir was pre-populated? If yes, I think that would explain why it succeeds.

Can you also try doing clean and integration-test in a single command. i.e.

$ build/mvn -B clean integration-test -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

@kimoonkim
Copy link
Member Author

FYI, I did try the single $ mvn clean integration-test command above on a fresh clone and reproduced the failure.

@kimoonkim
Copy link
Member Author

Reproduced the failure also with the two commands above issued on another fresh clone.

Please let me know if I should upload the full log.

@kimoonkim
Copy link
Member Author

kimoonkim commented Feb 10, 2017

I found a tool that can display maven build plan. To use the tool, one just needs to add a few lines in ~/.m2/settings.xml:

  <pluginGroups>
    <pluginGroup>fr.jcgay.maven.plugins</pluginGroup>
  </pluginGroups>

Then, issue $ mvn buildplan:list -Pkubernetes -Pkubernetes-integration-tests, which will show build plan in the time order.

Before this change, the intergration-tests project shows the following. Notice scalatest-maven-plugin runs in the test phase, which is before maven-dependency-plugin doing copy-test-spark-jobs in the pre-integration-test phase:

PLUGIN PHASE ID GOAL
maven-enforcer-plugin validate enforce-versions enforce
scala-maven-plugin initialize eclipse-add-source add-source
maven-dependency-plugin generate-sources default-cli build-classpath
maven-remote-resources-plugin generate-resources default process
maven-resources-plugin process-resources default-resources resources
scala-maven-plugin process-resources scala-compile-first compile
maven-compiler-plugin compile default-compile compile
maven-antrun-plugin generate-test-resources create-tmp-dir run
maven-resources-plugin process-test-resources default-testResources testResources
scala-maven-plugin process-test-resources scala-test-compile-first testCompile
maven-compiler-plugin test-compile default-testCompile testCompile
maven-dependency-plugin test-compile generate-test-classpath build-classpath
maven-surefire-plugin test default-test test
maven-surefire-plugin test test test
scalatest-maven-plugin test test test
maven-jar-plugin prepare-package prepare-test-jar test-jar
maven-jar-plugin package default-jar jar
maven-site-plugin package attach-descriptor attach-descriptor
maven-shade-plugin package default shade
maven-source-plugin package create-source-jar jar-no-fork
maven-source-plugin package create-source-jar test-jar-no-fork
maven-dependency-plugin pre-integration-test copy-test-spark-jobs copy
maven-dependency-plugin pre-integration-test unpack-docker-driver-bundle unpack
maven-dependency-plugin pre-integration-test unpack-docker-executor-bundle unpack
download-maven-plugin pre-integration-test download-minikube-linux wget
download-maven-plugin pre-integration-test download-minikube-darwin wget
scala-maven-plugin verify attach-scaladocs doc-jar
scalastyle-maven-plugin verify default check
maven-checkstyle-plugin verify default check
maven-install-plugin install default-install install
maven-deploy-plugin deploy default-deploy deploy

Here is the build plan after this change. Notice there is one more run of scalatest-maven-plugin that triggers in the integration-test phase, which is after the copy-test-spark-jobs. With this change, that's where the KubernetesSuite will run:

PLUGIN PHASE ID GOAL
maven-enforcer-plugin validate enforce-versions enforce
scala-maven-plugin initialize eclipse-add-source add-source
maven-dependency-plugin generate-sources default-cli build-classpath
maven-remote-resources-plugin generate-resources default process
maven-resources-plugin process-resources default-resources resources
scala-maven-plugin process-resources scala-compile-first compile
maven-compiler-plugin compile default-compile compile
maven-antrun-plugin generate-test-resources create-tmp-dir run
maven-resources-plugin process-test-resources default-testResources testResources
scala-maven-plugin process-test-resources scala-test-compile-first testCompile
maven-compiler-plugin test-compile default-testCompile testCompile
maven-dependency-plugin test-compile generate-test-classpath build-classpath
maven-surefire-plugin test default-test test
maven-surefire-plugin test test test
scalatest-maven-plugin test test test
maven-jar-plugin prepare-package prepare-test-jar test-jar
maven-jar-plugin package default-jar jar
maven-site-plugin package attach-descriptor attach-descriptor
maven-shade-plugin package default shade
maven-source-plugin package create-source-jar jar-no-fork
maven-source-plugin package create-source-jar test-jar-no-fork
maven-dependency-plugin pre-integration-test copy-test-spark-jobs copy
maven-dependency-plugin pre-integration-test unpack-docker-driver-bundle unpack
maven-dependency-plugin pre-integration-test unpack-docker-executor-bundle unpack
download-maven-plugin pre-integration-test download-minikube-linux wget
download-maven-plugin pre-integration-test download-minikube-darwin wget
scalatest-maven-plugin integration-test integration-test test
scala-maven-plugin verify attach-scaladocs doc-jar
scalastyle-maven-plugin verify default check
maven-checkstyle-plugin verify default check
maven-install-plugin install default-install install
maven-deploy-plugin deploy default-deploy deploy

See copy-test-spark-jobs execution of maven-dependency-plugin above. -->
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
<configuration>...</configuration>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the tree dots stand for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Turned out they aren't needed. I got this piece from a scalatest user form and the three dots were there just to indicate omission of text.

I am surprised that the presence of three dots in the config did not break maven. Thanks for pushing me to look at this. I'll remove them in the next patch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this in the new patch.

<goal>test</goal>
</goals>
<configuration>
<suffixes>(?&lt;!Suite)</suffixes>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the purpose of this negative pattern is to prevent the KubernetesSuite from being tested in the test phase. Better add a comment to explain this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Added a comment in the new patch.

<goal>test</goal>
</goals>
<configuration>
<suffixes>(?&lt;=Suite)</suffixes>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need explicitly setting this? I don't find it in the scalatest-maven-plugin's section in the top level pom.xml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this in the new patch.

@lins05
Copy link

lins05 commented Feb 11, 2017

@kimoonkim good to know the buildplan plugin, very nice tool!

@ash211
Copy link

ash211 commented Feb 13, 2017

This does seem to make the tests run successfully now though being newer to Maven I'm not fully following why.

@mccheah can you please take a look?

@kimoonkim
Copy link
Member Author

@lins05 Thanks for taking a look. Addressed your comments in the new patch.

@kimoonkim
Copy link
Member Author

@ash211 @mccheah Thanks for taking a look. I am also relatively new to Maven myself and I must say Maven is quite complicated.

Let me try to explain what I think is happening before this change. Suppose we issue a command like below (which will fail):

$ build/mvn -B clean integration-test -Pkubernetes -Pkubernetes-integration-tests  \
   -pl resource-managers/kubernetes/integration-tests -am -Dtest=none  \
   -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

Maven does three high level things:

  1. It activates the specified profiles, namely Pkubernetes Pkubernetes-integration-tests. This enables kubernetes maven modules such as resource-managers/kubernetes/core, resource-managers/kubernetes/integration-tests, resource-managers/kubernetes/integration-tests-spark-jobs, etc.
  2. Then, maven sorts and builds these modules in the dependency order, where resource-managers/kubernetes/integration-tests-spark-jobs builds earlier than resource-managers/kubernetes/integration-tests. The latter needs a test jobs jar from the former. Note maven will build one module at a time. So integration-tests-spark-jobs will be complete before integration-tests starts. This is why the test jobs jar will be available when integration-tests starts.
  3. For each module, maven goes through build phases in a pre-defined order.

For (3), you can find the full list on the linked page, but here are some that are relevant to us:

compile	compile the source code of the project.
test-compile	compile the test source code into the test destination directory
test	run tests using a suitable unit testing framework. These tests should not require the code be packaged or deployed.
packaging. This often results in an unpacked, processed version of the package. (Maven 2.1 and above)
package
pre-integration-test	perform actions required before integration tests are executed. This may involve things such as setting up the required environment.
integration-test	process and deploy the package if necessary into an environment where integration tests can be run.

Notice the test -> pre-integration-test -> integration-test ordering.

Now, let's see what the resource-managers/kubernetes/integration-tests module's pom.xml does. The pom.xml specifies a few plugins to download and copy a number of tarballs and jars. Here's one example. Notice it specifies the pre-integration-test phase:

     <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-dependency-plugin</artifactId>
        <executions>
          <execution>
            <id>copy-test-spark-jobs</id>
            <phase>pre-integration-test</phase>
            <goals>
              <goal>copy</goal>
            </goals>
            <configuration>
              <artifactItems>
                <artifactItem>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-kubernetes-integration-tests-spark-jobs_${scala.binary.version}</artifactId>
                  <version>${project.version}</version>
                  <type>jar</type>
                  <outputDirectory>${project.build.directory}/integration-tests-spark-jobs</outputDirectory>
                </artifactItem>

As shown above, those tarballs and jars will be unpacked to the project.build.directory, which will be resource-managers/kubernetes/integration-tests/target dir.

These are required inputs to KubernetsSuite. The test jobs jar is referred to by the KubernetesSuite code line 48 - 51 below:

 46 private[spark] class KubernetesSuite extends SparkFunSuite with BeforeAndAfter {
 47
 48   private val EXAMPLES_JAR = Paths.get("target", "integration-tests-spark-jobs")
 49     .toFile
 50     .listFiles()(0)
 51     .getAbsolutePath

The problem is that the scalatest plugin, which will execute KubernetsSuite, triggers only in the test phase by default. (The pom.xml of the resource-managers/kubernetes/integration-tests currently does not specify anything about scalatest plugin. The setting is inherited from the top project pom.xml.)

This plugin ordering is displayed well by the build plan plugin in a previous comment. Copying the relevant part here again:

scalatest-maven-plugin	test	test	test
...
maven-dependency-plugin	pre-integration-test	copy-test-spark-jobs	copy
maven-dependency-plugin	pre-integration-test	unpack-docker-driver-bundle	unpack
maven-dependency-plugin	pre-integration-test	unpack-docker-executor-bundle	unpack
download-maven-plugin	pre-integration-test	download-minikube-linux	wget
download-maven-plugin	pre-integration-test	download-minikube-darwin

So the above KubernetesSuite code will find the target directory missing, leading to the following exception which is raised because the line 50 above referrs to the first item in an empty list causing a NPE:

  java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite
  at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:84)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:38)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:37)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  ...
  Cause: java.lang.NullPointerException:
  at org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite.<init>(KubernetesSuite.scala:50)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
  at java.lang.Class.newInstance(Class.java:442)
  at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:69)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:38)
  at org.scalatest.tools.DiscoverySuite$$anonfun$1.apply(DiscoverySuite.scala:37)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  ...

Now, some of us did not encounter this exception before. How is this possible? Here's one sequence of commands that will hide this problem (I have tried this sequence on my local checkout and it does hide the problem).

  1. Issue a maven build command targeting pre-integration-test while specifying -DskipTests. This will allow the test jobs jar to be copied in place while skipping KubernetseSuite.
    $ build/mvn clean pre-integration-test -DskipTests -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am
  2. Then, issue another maven command specifying the integration-test target, but without the clean goal specified. This time, maven will run KubernetseSuite (again in the test phase), but the exception won't happen because the test job jar was pre-populated the command above.
    $ build/mvn -B integration-test -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

One more related command that can add to the confusion is doing mvn clean in between (1) and (2), without specifying the two kubernetes profiles. i.e. $ build/mvn clean. This will disable kubernete maven modules and thus won't wipe out the target dirs like resource-managers/kubernetes/integration-tests/target. So doing (2) after this way of cleaning will still hide the problem.

@mccheah
Copy link

mccheah commented Feb 13, 2017

+1 this makes sense - the Maven pom was originally created under the assumption that the build process would execute both the test phase and the integration-test phase, but it looks like they just invoke the Scalatest plugin and that doesn't trigger integration-test by default. Should we not just change our tests to target the test phase and not the integration-test phase? Thus instead of putting the build preparation steps in the pre-integration-test phase, is there some equivalent like a pre-test phase?

@kimoonkim
Copy link
Member Author

@mccheah Yes, targeting the test phase is one possible solution. There are phases like generate-test-resources or process-test-resources that run before the test phase. So we could use them.

From the same maven phase list web page above:

generate-test-resources	create resources for testing.
process-test-resources	copy and process the resources into the test destination directory.
test-compile	compile the test source code into the test destination directory
process-test-classes	post-process the generated files from test compilation, for example to do bytecode enhancement on Java classes. For Maven 2.0.5 and above.
test	run tests using a suitable unit testing framework. These tests should not require the code be packaged or deployed.
...
package	take the compiled code and package it in its distributable format, such as a JAR.
pre-integration-test	perform actions required before integration tests are executed. This may involve things such as setting up the required environment.
integration-test

The downside is that there is a subtle usage issue with this approach. If a user issues a maven command with the test target, then the other modules like resource-managers/kubernetes/integration-tests-spark-jobs might not produce the jars that resource-managers/kubernetes/integration-tests needs. Imagine the following command:

$ build/mvn -B clean test -Pkubernetes -Pkubernetes-integration-tests  \
   -pl resource-managers/kubernetes/integration-tests -am -Dtest=none  \
   -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

With this command, the resource-managers/kubernetes/integration-tests-spark-jobs module only runs until the test phase. The test jobs jar is produced at the package phase that comes after the test phase.

So when the resource-managers/kubernetes/integration-tests module starts, the test jobs jar will be missing. The copy-test-spark-jobs (now say at the process-test-resources phase) will fail as the result.

The failure can be avoided if one issues a maven command specifying the package or any subsequent target like below. But this is a bit counter-intuitive. People may try the above command first and do unnecessary trouble-shooting before they reach here:

$ build/mvn -B clean package -Pkubernetes -Pkubernetes-integration-tests  \
   -pl resource-managers/kubernetes/integration-tests -am -Dtest=none  \
   -DwildcardSuites=org.apache.spark.deploy.kubernetes.integrationtest.KubernetesSuite

Please let me know what you guys think.

@mccheah
Copy link

mccheah commented Feb 13, 2017

I thought copy-test-spark-jobs depends on the artifact from the integration-tests-spark-jobs module in a way such that it requires the equivalent phase of integration-tests-spark-jobs to be executed first?

Alternatively, we can try to take a compile-time dependency on integration-tests-spark-jobs and depend on that jar in a way that doesn't require a separate copy. For example, is it the case that in a multi-module build, if module B depends on module A, then does a.jar exist somewhere in module B's subtree in a location that we could easily reference from the integration test code? Maybe under target/?

@kimoonkim
Copy link
Member Author

@mccheah How exactly maven reactor handles multi-module dependencies is a bit mysterious to me. I found this blog saying below:

As part of all the refactoring in Maven 3, a dependency resolution has been reworked to consistently check the reactor output. Apparently, the reactor output depends on the lifecycle phases that a project has completed. So if you invoke mvn compile or mvn test on a multi-module project, the loose class files from target/classes and target/test-classes, respectively, are used to create the required class path. As soon as the actual artifact has been assembled which usually happens during the package phase, dependency resolution will use this file.

I think we still need to do mvn package to use the jar. Can integration-tests-spark-jobs directly use the target/classes? Probably not, given spark-submit needs the jar?

@mccheah
Copy link

mccheah commented Feb 14, 2017

Right - the test expects to ship the jar over as the application dependency of the tests.

@lins05
Copy link

lins05 commented Feb 14, 2017

I also wanted to propose moving the copy-test-spark-jobs action to some phase like generate-test-resources that happens before the test, which could effectively solve the problem this PR tries to address, but just found it seems pretty hard to get the spark-integration-test-jobs jar without running the package phase.

What if in the spark-integration-test-jobs's pom, we try to attach the jar:jar goal to the generate-test-resources phase?

@kimoonkim
Copy link
Member Author

@lins05 Creating jars at the generate-test-resources phase sounds promising. The only concern is whether the maven reactor component will recognize the jars from non-package phases. I think we can try and see. Also note this approach will lead to a slightly large change since we are going to touch other modules. @mccheah what do you think of this suggestion? If yes, I can write a new sister PR so we can compare with this PR.

Also, what do we think the upside of using the test phase? I like the overall build time can be shorter. Anything else? I am just curious.

@lins05
Copy link

lins05 commented Feb 14, 2017

Also note this approach will lead to a slightly large change since we are going to touch other modules

Emm, what other modules? IIUC the affected modules would only be the integration-tests, integration-tests-spark-jobs, and integration-tests-spark-jobs-helpers/ modules.

@kimoonkim
Copy link
Member Author

They are what I meant. Add docker-minimal-bundle there too?

@lins05
Copy link

lins05 commented Feb 14, 2017

I see, then that's ok.

docker-minimal-bundle there too?

Right.

@kimoonkim
Copy link
Member Author

kimoonkim commented Feb 14, 2017

@lins05 @mccheah
I think docker-minimal-bundle may be a show-stopper. The tarballs from the module has jars from all other spark modules (and their dependencies). These spark module jars need mvn package.

$ tar tvfz docker-minimal-bundle/target/spark-docker-minimal-bundle_2.11-2.2.0-SNAPSHOT-driver-docker-dist.tar.gz | grep spark-.*.jar | head
-rw-r--r-- kimoonkim/staff 12015040 2017-02-13 09:30 jars/spark-core_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 66585 2017-02-13 09:27 jars/spark-launcher_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 15490 2017-02-13 09:26 jars/spark-tags_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 2367947 2017-02-13 09:26 jars/spark-network-common_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 61878 2017-02-13 09:26 jars/spark-network-shuffle_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 44552 2017-02-13 09:26 jars/spark-unsafe_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 6195220 2017-02-13 09:39 jars/spark-mllib_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 2175666 2017-02-13 09:31 jars/spark-streaming_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 6847170 2017-02-13 09:37 jars/spark-sql_2.11-2.2.0-SNAPSHOT.jar
-rw-r--r-- kimoonkim/staff 30029 2017-02-13 09:26 jars/spark-sketch_2.11-2.2.0-SNAPSHOT.jar

$ tar tvfz docker-minimal-bundle/target/spark-docker-minimal-bundle_2.11-2.2.0-SNAPSHOT-driver-docker-dist.tar.gz | grep .jar | wc -l
163

@kimoonkim
Copy link
Member Author

@lins05 Looked at docker-minimal-bundle. It is using maven-assembly-plugin to put all spark jars in the tarballs. The driver-assembly.xml, puts all the jars it depends on except a few:

<dependencySets>
    <dependencySet>
      <outputDirectory>jars</outputDirectory>
      <useTransitiveDependencies>true</useTransitiveDependencies>
      <unpack>false</unpack>
      <scope>runtime</scope>
      <useProjectArtifact>false</useProjectArtifact>
      <excludes>
        <exclude>org.apache.spark:spark-assembly_${scala.binary.version}:pom</exclude>
        <exclude>org.spark-project.spark:unused</exclude>
        <exclude>org.apache.spark:spark-examples_${scala.binary.version}</exclude>
      </excludes>
    </dependencySet>

And the pom.xml specifies spark-assembly as its main dependency, from which all other spark module jars are pulled from.

   <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-assembly_${scala.binary.version}</artifactId>
      <version>${project.version}</version>
      <type>pom</type>
    </dependency>

I can't imagine we build all these jars in non-package phase. That's just too many pom.xml's to touch.

If we only make docker-minimal-bundle run maven-assembly-plugin in the generate-test-resources phase, I think it'll try to put target/classes of all other spark modules into the tarball. I don't know if that will even succeed, but even if it does, it doesn't sound like what we want to have inside a docker image for testing.

I am afraid this does look like a show-stopper. Thoughts?

@mccheah
Copy link

mccheah commented Feb 14, 2017

Hm, yeah I suppose our findings are showing that the integration-test phase is the right one to use, and resource preparation should be in pre-integration-test, and Scalatest should just invoke our integration-test phase.

@kimoonkim
Copy link
Member Author

@mccheah SGTM. Then this PR is ready for merge?

@kimoonkim
Copy link
Member Author

@lins05 @mccheah Thanks for discussions so far. Do we have any more questions or feedbacks?

Given our findings so far, I believe this PR is useful as is. FYI, the patch is updated earlier to address @lins05's comments. Can you guys please give one more look?

@ash211
Copy link

ash211 commented Feb 16, 2017

@kimoonkim an enormous thank you for all your work on this PR! Clearly you've put a lot of effort and research into getting this right.

I can't say I'm familiar enough with Maven to say this is right, but I think whether it's perfect or not it's certainly a step in the right direction. Let's merge and move closer to running integration tests in Travis (one of the goals coming out of this week's weekly meeting).

Thanks again for the well-researched contribution!

@ash211 ash211 merged commit 9d250a2 into apache-spark-on-k8s:k8s-support-alternate-incremental Feb 16, 2017
@kimoonkim
Copy link
Member Author

@ash211 I am not a big fan of Maven either, but it was a great learning experience for me :-) Thank you, @mccheah and @lins05 for asking right questions and having discussions together.

@kimoonkim kimoonkim deleted the run-scalatest-on-integration-test-phase branch February 17, 2017 23:22
ash211 pushed a commit that referenced this pull request Mar 8, 2017
* Trigger scalatest plugin in the integration-test phase

* Clean up unnecessary config section
foxish pushed a commit that referenced this pull request Jul 24, 2017
* Trigger scalatest plugin in the integration-test phase

* Clean up unnecessary config section
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…on-k8s#93)

* Trigger scalatest plugin in the integration-test phase

* Clean up unnecessary config section
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…on-k8s#93)

* Trigger scalatest plugin in the integration-test phase

* Clean up unnecessary config section
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Dec 4, 2019
…tions

### What changes were proposed in this pull request?

In order to avoid frequently changing the value of `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions`, we usually set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` much larger than `spark.sql.shuffle.partitions` after enabling adaptive execution, which causes some bucket map join lose efficacy and add more `ShuffleExchange`.

How to reproduce:
```scala
val bucketedTableName = "bucketed_table"
spark.range(10000).write.bucketBy(500, "id").sortBy("id").mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable(bucketedTableName)
val bucketedTable = spark.table(bucketedTableName)
val df = spark.range(8)

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
// Spark 2.4. spark.sql.adaptive.enabled=false
// We set spark.sql.shuffle.partitions <= 500 every time based on our data in this case.
spark.conf.set("spark.sql.shuffle.partitions", 500)
bucketedTable.join(df, "id").explain()
// Since 3.0. We enabled adaptive execution and set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions to a larger values to fit more cases.
spark.conf.set("spark.sql.adaptive.enabled", true)
spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 1000)
bucketedTable.join(df, "id").explain()
```

```
scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
*(4) Project [id#5L]
+- *(4) SortMergeJoin [id#5L], [id#7L], Inner
   :- *(1) Sort [id#5L ASC NULLS FIRST], false, 0
   :  +- *(1) Project [id#5L]
   :     +- *(1) Filter isnotnull(id#5L)
   :        +- *(1) ColumnarToRow
   :           +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
   +- *(3) Sort [id#7L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#7L, 500), true, [id=apache-spark-on-k8s#49]
         +- *(2) Range (0, 8, step=1, splits=16)
```
vs
```
scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [id#5L]
   +- SortMergeJoin [id#5L], [id#7L], Inner
      :- Sort [id#5L ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#5L, 1000), true, [id=apache-spark-on-k8s#93]
      :     +- Project [id#5L]
      :        +- Filter isnotnull(id#5L)
      :           +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
      +- Sort [id#7L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#7L, 1000), true, [id=apache-spark-on-k8s#92]
            +- Range (0, 8, step=1, splits=16)
```

This PR makes read bucketed tables always obeys `spark.sql.shuffle.partitions` even enabling adaptive execution and set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` to avoid add more `ShuffleExchange`.

### Why are the changes needed?
Do not degrade performance after enabling adaptive execution.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Unit test.

Closes apache#26409 from wangyum/SPARK-29655.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants