Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move docker image management and test entrypoint to Maven #31

Merged
merged 23 commits into from Jan 16, 2018

Conversation

mccheah
Copy link
Contributor

@mccheah mccheah commented Jan 12, 2018

Closes #30.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 12, 2018

Requires both #29 and #14.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 12, 2018

Also requires the test commands in Jenkins to change.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 12, 2018

todo and not handled by this patch:

  • Clean up docker images in the post-integration-test phase
  • Use your own docker images (it's theoretically possible but the usage of the examples file server and the usage of the same tag between the file server and the Spark code under test makes this workflow less than intuitive)

@@ -0,0 +1,158 @@
#!/usr/bin/env bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We'll need to ensure that it stays in sync with our repo? Or maybe we can get rid of it once we merge this upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get rid of this when we merge into upstream, but even aside from that, this has hardly been changing in upstream and it's theoretically completely isolated from what we will use to build Spark (since the shell forks the second Maven process it's completely independent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to add it here now because unlike before, now mvn is the entrypoint into the whole system, as it's only the substep of the build reactor that clones the Spark repository down. Aside from that, we also allow for Spark TGZs to be provided to remove the usage of the Spark source code entirely. So no matter what we're going to need our own Maven here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the content here, could you just have a much shorter script that just downloads from https://raw.githubusercontent.com/apache/spark/master/build/mvn and execs it? Something like build/mvn-getter whose content is:

#!/bin/sh
curl -s https://raw.githubusercontent.com/apache/spark/master/build/mvn | bash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think that works. Will incorporate that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#

set -ex
TEST_ROOT_DIR=$(git rev-parse --show-toplevel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these defaults are repeated again in pom.xml - the idea is that maintainers of the framework can run the scripts in isolation without using Maven. This can be done both with the top level setup-integration-test-env.sh script, or any of the sub-components - e.g. clone-spark.sh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. it seems error prone that this needs to stay in sync with the args in pom.xml. Is there a better dispatch mechanism than to use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pom.xml can assume that these are the defaults and that they are correct, but I like having them explicitly enumerated in the pom.xml because the behavior can be known right from the pom. But I also like being able to run the script independently without any arguments. But all of this is just for convenience, so for the sake of developer sanity we can make one of these ideas less convenient.

Don't have a strong opinion on the best way to move forward.

DEPLOY_MODE=minikube
IMAGE_REPO="docker.io/kubespark"
SKIP_BUILDING_IMAGES=false
SPARK_TGZ="N/A"
Copy link
Contributor Author

@mccheah mccheah Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less than ideal. It's more elegant to have here: SPARK_TGZ= and then to check in other places, if [[ -z $SPARK_TGZ ]]. The problem is that Maven's exec plugin has to pass in a fixed set of command line arguments. Maven will always have to pass in some spark-tgz argument. But we can't pass in empty, because we then end up with commands like setup-integration-test-env --spark-tgz --deploy-mode cloud which screws up the argument shifting.

An alternative we can employ in general is to allow these variables to be set by environment variables as well and to pass them in as environment variables from Maven. Ideas are welcome here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the maven exec plugin like this:

              <executable>/bin/sh</executable>
              <arguments>
                <argument>-c</argument>
                <argument>arbitrary length here ; another command</argument>
              </arguments>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the arguments themselves have to be fixed. I can't sometimes pass in the --spark-tgz flag and sometimes not pass the --spark-tgz flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets simpler with the idea that this code always takes --spark-tgz, I think.

fi

rm -f $IMAGE_TAG_OUTPUT_FILE
touch $IMAGE_TAG_OUTPUT_FILE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less than ideal. The problem is that Maven wants to pass the image tag into KubernetesSuite, but Maven doesn't know it offhand until it's resolved by the test environment script. We use a UUID for the default image tag because there can be multiple builds concurrently running against the same Spark git hash. But the UUID can't be generated by Maven, meaning the script has to make it and pass that information along somehow, and a temporary file was the best bridge I could think of. Other more elegant ideas would be appreciated here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent runs against the same git-sha, I'm not quite sure. So, if a new PR is submitted, it would apply that commit on the base and then run the tests - which would make each one have a unique sha. If they are the same sha, then invariably, they are identical and the image could be reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think there's two git hashes to consider and neither will work:

  • If we're talking about the integration test repository git hash, multiple builds should certainly be able to run against the same commit. Multiple test runs on the same node testing two different commits of Spark but using the same integration test version comes to mind.
  • If we're talking about the Spark repository git hash, the inverse rule applies here - multiple versions of the integration test running against the same Spark version. Reusing the images is difficult because we need to know when it's safe to GC the image, which is only if all tests holding that given tag are finished. Then we can run into race conditions, etc.

Seems a lot easier to bias towards isolation for correctness and optimize later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to just GC any images that are from this, but over a day old?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a potential race with who gets to build the image with the given hash though. Two builds trying to use the same hash need to agree on who builds the image. I don't want to have to deal with that locking at all, so just use completely unique tags everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@mccheah
Copy link
Contributor Author

mccheah commented Jan 12, 2018

The diff is against a branch that is the outcome of merging master with #29 and #14.

@@ -1,39 +0,0 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to GKE testsuites, I think we still need this setup script, even if the runner goes away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't notice the apt-get calls. I think we want these to be assumed by the framework. Installing components on demand in a test framework is scary-ish - not to mention that installation probably depends on the user's OS, e.g. if the user is running from Mac OSX this won't work. We should have separate scripts that aren't invoked in the build lifecycle for these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, did not intend for these to be part of the testsuite. These are external "setup" steps and the entrypoint for the k8s CI system.

@foxish
Copy link
Member

foxish commented Jan 12, 2018 via email

@mccheah mccheah changed the base branch from move-all-logic-to-maven-base to master January 12, 2018 23:19
The Kubernetes integration tests now always expect an image to be
pre-built, so we no longer build images with Scala code. Maven's
pre-integration-test invokes a single script to bootstrap the
environment with the built images, etc. In the transition we try to keep
as much of the same semantics as possible.
README.md Outdated
## Re-using Docker Images

By default, the test framework will build new Docker images on every test execution. A unique image tag is generated,
and it is written to file at `target/imageTag.txt`. To reuse the images built in a previous run, set:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked before that it was using the git sha - it made it easy to correlate to the distribution that was just tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in #31 (comment)

README.md Outdated
By default, the test framework will test the master branch of Spark from [here](https://github.com/apache/spark). You
can specify the following options to test against different source versions of Spark:

* `spark.kubernetes.test.sparkRepo` to the git or http URI of the Spark git repository to clone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gitRepo and gitBranch may be better names for this.

@@ -0,0 +1,158 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We'll need to ensure that it stays in sync with our repo? Or maybe we can get rid of it once we merge this upstream?

@@ -1,39 +0,0 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, did not intend for these to be part of the testsuite. These are external "setup" steps and the entrypoint for the k8s CI system.

mkdir -p $UNPACKED_SPARK_TGZ
if [[ $SPARK_TGZ == "N/A" ]];
then
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes -DskipTests;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we omit --tgz, we should just get the unpacked distribution under dist/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it easier to reuse the distribution with the spark.kubernetes.test.sparkTgz option. Thus that configuration can pull double duty with custom Spark distributions and re-using dynamically built ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for jenkins, we will want to use the pre-built distributions exclusively. This is because we don't want to keep the minikube locked during the building of the distribution. If the spark distribution builds "inside" of the spark-integration test, then it will be harder to express this partial locking. As it currently stands, one jenkins job creates the spark distribution, and another jenkins job does the integration test, making the locking easy.

#

set -ex
TEST_ROOT_DIR=$(git rev-parse --show-toplevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. it seems error prone that this needs to stay in sync with the args in pom.xml. Is there a better dispatch mechanism than to use this?

fi

rm -f $IMAGE_TAG_OUTPUT_FILE
touch $IMAGE_TAG_OUTPUT_FILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent runs against the same git-sha, I'm not quite sure. So, if a new PR is submitted, it would apply that commit on the base and then run the tests - which would make each one have a unique sha. If they are the same sha, then invariably, they are identical and the image could be reused.


package object config {
val KUBERNETES_TEST_DOCKER_TAG_SYSTEM_PROPERTY = "spark.kubernetes.test.imageDockerTag"
val DRIVER_DOCKER_IMAGE = "spark.kubernetes.driver.container.image"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be cleaned up I think.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 12, 2018

Changed the base to master and removed dependence on #29. Turns out we actually need this to run at all now - our tests are broken since the Docker image format change introduced in apache/spark#20192


if [[ $SPARK_TGZ == "N/A" ]];
then
$SCRIPTS_DIR/clone-spark.sh "$@";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing all the args around, we should parge the args here, and then set the right env-vars/args and call the other scripts maybe? I think it's an antipattern for each script to be calling parse-args and seeing unrelated args.

README.md Outdated
-Dspark.kubernetes.test.sparkRepo=https://github.com/apache-spark-on-k8s/spark \
-Dspark.kubernetes.test.sparkBranch=new-feature

Additionally, you can use a pre-built Spark distribution. In this case, the repository is not cloned at all, and no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation of why we should have the integration test code ever try to clone and build spark. Why not just always depend on a pre-built spark distribution? (Sorry if I missed the reason for this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found it a lot easier locally to be able to run a single command that both fetches the Spark distribution and builds it and then the integration test uses that Spark distribution. We provide the optionality for the local development scenario but we can still provide the TGZ specifically in Jenkins via spark.kubernetes.test.sparkTgz.

@@ -0,0 +1,158 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the content here, could you just have a much shorter script that just downloads from https://raw.githubusercontent.com/apache/spark/master/build/mvn and execs it? Something like build/mvn-getter whose content is:

#!/bin/sh
curl -s https://raw.githubusercontent.com/apache/spark/master/build/mvn | bash

mkdir -p $UNPACKED_SPARK_TGZ
if [[ $SPARK_TGZ == "N/A" ]];
then
./dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes -DskipTests;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for jenkins, we will want to use the pre-built distributions exclusively. This is because we don't want to keep the minikube locked during the building of the distribution. If the spark distribution builds "inside" of the spark-integration test, then it will be harder to express this partial locking. As it currently stands, one jenkins job creates the spark distribution, and another jenkins job does the integration test, making the locking easy.

fi

rm -f $IMAGE_TAG_OUTPUT_FILE
touch $IMAGE_TAG_OUTPUT_FILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to just GC any images that are from this, but over a day old?

@mccheah
Copy link
Contributor Author

mccheah commented Jan 13, 2018

@ssuchter @foxish approach has changed - the dev/dev-run-integration-tests.sh is for local running and is whats documented in README.md, which builds the Spark distribution and sends it to the Maven reactor. The maven reactor itself uses a separate script to build the Docker image, but the Maven reactor must be given the Spark distribution to run with.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 13, 2018

Reason we have to do this was discussed in sig-big-data - Jenkins will only use the Maven build without building the Spark distribution, so that the maven build doesn't lock the Minikube resource on building Spark, which takes a long time.

@kimoonkim
Copy link
Member

rerun integration tests please

1 similar comment
@kimoonkim
Copy link
Member

rerun integration tests please

@mccheah
Copy link
Contributor Author

mccheah commented Jan 15, 2018

@kimoonkim the dev/dev-run-integration-tests.sh script is for development setups, though I took a look at the command and that command should theoretically work in Jenkins as well.

Couple of questions:

  • Minikube appears to be starting in the output, shouldn't it already be running?
  • Minikube is stopping at the end. Shouldn't we be keeping it running?

@mccheah
Copy link
Contributor Author

mccheah commented Jan 15, 2018

@ssuchter let's just have quotes everywhere.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 15, 2018

@kimoonkim the minikube binary needs to be on the PATH.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 15, 2018

Sorry, forgot to mention what the specific command should be in Jenkins without using the dev script: build/mvn integration-test -Dspark.kubernetes.test.sparkTgz=<tgz-path>.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

rerun integration tests please

@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

@kimoonkim integration tests aren't running when I comment, can you trigger the build?

@kimoonkim
Copy link
Member

@kimoonkim integration tests aren't running when I comment, can you trigger the build?

I think this build 98 actually ran upon your comment, given the timing of build start. I believe there is a quiet period of dozen secs. So maybe it did not trigger right away.

Copy link
Member

@ssuchter ssuchter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@kimoonkim
Copy link
Member

Minikube appears to be starting in the output, shouldn't it already be running?
Minikube is stopping at the end. Shouldn't we be keeping it running?

Right. It gets stopped and deleted at the end. Pepperdata Jenkins cluster has only one node and it still has jobs running old integration tests from our fork. If we keep minikube running, that will interfere with the old integration tests. Stopping minikube is the best compromise for this setup, IMO.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

Hm, but the error is this:

  java.io.IOException: Cannot run program "minikube": error=2, No such file or directory
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

But the tests work locally for me. I wonder if Java's ProcessBuilder doesn't use the PATH to find locations of executables.

@kimoonkim
Copy link
Member

Sorry, forgot to mention what the specific command should be in Jenkins without using the dev script: build/mvn integration-test -Dspark.kubernetes.test.sparkTgz=.

Jenkins can switch to this command, but what is the advised way of building docker images now that the scala code does not build them?

@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

@kimoonkim this PR puts us in an in-between state where Maven is building the Docker images but we expect the external Jenkins job to build the Spark distribution. If we build using Maven, the reactor will still build the images for us.

@kimoonkim
Copy link
Member

Hm, but the error is this:

java.io.IOException: Cannot run program "minikube": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
But the tests work locally for me. I wonder if Java's ProcessBuilder doesn't use the PATH to find locations of executables.

I see I'll try adding the minikube location to the PATH used by maven.

@kimoonkim
Copy link
Member

rerun integration tests please

@kimoonkim
Copy link
Member

Now, the error from build 101 is this:

/home/jenkins/workspace/PR-dev-spark-integration/target/spark-dist-unpacked/bin/docker-image-tool.sh: line 66: docker: command not found

@ssuchter I believe you installed minikube on Pepperdata Jenkins. Did we install docker?

@ssuchter
Copy link
Member

We're working on that same issue (missing docker) in slack sig-big-data

@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

rerun integration tests please

1 similar comment
@mccheah
Copy link
Contributor Author

mccheah commented Jan 16, 2018

rerun integration tests please

@foxish foxish merged commit c651127 into master Jan 16, 2018
@foxish foxish deleted the move-all-logic-to-maven branch January 16, 2018 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants