Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move docker image management and test entrypoint to Maven #31

Merged
merged 23 commits into from Jan 16, 2018
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 6 additions & 2 deletions .gitignore
@@ -1,6 +1,10 @@
.idea/
spark/
integration-test/target/
target/
build/*.jar
build/apache-maven*
build/scala*
build/zinc*
*.class
*.log
*.iml
*.swp
163 changes: 68 additions & 95 deletions README.md
Expand Up @@ -8,98 +8,71 @@ title: Spark on Kubernetes Integration Tests
Note that the integration test framework is currently being heavily revised and
is subject to change. Note that currently the integration tests only run with Java 8.

As shorthand to run the tests against any given cluster, you can use the `e2e/runner.sh` script.
The script assumes that you have a functioning Kubernetes cluster (1.6+) with kubectl
configured to access it. The master URL of the currently configured cluster on your
machine can be discovered as follows:

```
$ kubectl cluster-info

Kubernetes master is running at https://xyz
```

If you want to use a local [minikube](https://github.com/kubernetes/minikube) cluster,
the minimum tested version is 0.23.0, with the kube-dns addon enabled
and the recommended configuration is 3 CPUs and 4G of memory. There is also a wrapper
script for running on minikube, `e2e/e2e-minikube.sh` for testing the master branch
of the apache/spark repository in specific.

```
$ minikube start --memory 4000 --cpus 3
```

If you're using a non-local cluster, you must provide an image repository
which you have write access to, using the `-i` option, in order to store docker images
generated during the test.

Example usages of the script:

```
$ ./e2e/runner.sh -m https://xyz -i docker.io/foxish -d cloud
$ ./e2e/runner.sh -m https://xyz -i test -d minikube
$ ./e2e/runner.sh -m https://xyz -i test -r https://github.com/my-spark/spark -d minikube
$ ./e2e/runner.sh -m https://xyz -i test -r https://github.com/my-spark/spark -b my-branch -d minikube
```

# Detailed Documentation

## Running the tests using maven

Integration tests firstly require installing [Minikube](https://kubernetes.io/docs/getting-started-guides/minikube/) on
your machine, and for the `Minikube` binary to be on your `PATH`.. Refer to the Minikube documentation for instructions
on how to install it. It is recommended to allocate at least 8 CPUs and 8GB of memory to the Minikube cluster.

Running the integration tests requires a Spark distribution package tarball that
contains Spark jars, submission clients, etc. You can download a tarball from
http://spark.apache.org/downloads.html. Or, you can create a distribution from
source code using `make-distribution.sh`. For example:

```
$ git clone git@github.com:apache/spark.git
$ cd spark
$ ./dev/make-distribution.sh --tgz \
-Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
```

The above command will create a tarball like spark-2.3.0-SNAPSHOT-bin.tgz in the
top-level dir. For more details, see the related section in
[building-spark.md](https://github.com/apache/spark/blob/master/docs/building-spark.md#building-a-runnable-distribution)


Once you prepare the tarball, the integration tests can be executed with Maven or
your IDE. Note that when running tests from an IDE, the `pre-integration-test`
phase must be run every time the Spark main code changes. When running tests
from the command line, the `pre-integration-test` phase should automatically be
invoked if the `integration-test` phase is run.

With Maven, the integration test can be run using the following command:

```
$ mvn clean integration-test \
-Dspark-distro-tgz=spark/spark-2.3.0-SNAPSHOT-bin.tgz
```

## Running against an arbitrary cluster

In order to run against any cluster, use the following:
```sh
$ mvn clean integration-test \
-Dspark-distro-tgz=spark/spark-2.3.0-SNAPSHOT-bin.tgz \
-DextraScalaTestArgs="-Dspark.kubernetes.test.master=k8s://https://<master>

## Reuse the previous Docker images

The integration tests build a number of Docker images, which takes some time.
By default, the images are built every time the tests run. You may want to skip
re-building those images during development, if the distribution package did not
change since the last run. You can pass the property
`spark.kubernetes.test.imageDockerTag` to the test process and specify the Docker
image tag that is appropriate.
Here is an example:

```
$ mvn clean integration-test \
-Dspark-distro-tgz=spark/spark-2.3.0-SNAPSHOT-bin.tgz \
-Dspark.kubernetes.test.imageDockerTag=latest
```
The simplest way to run the integration tests is to install and run Minikube, then run the following:

build/mvn integration-test

The minimum tested version of Minikube is 0.23.0. The kube-dns addon must be enabled. Minikube should
run with a minimum of 3 CPUs and 4G of memory:

minikube start --cpus 3 --memory 4G

You can download Minikube [here](https://github.com/kubernetes/minikube/releases).

# Integration test customization

Configuration of the integration test runtime is done through passing different Java system properties to the Maven
command. The main useful options are outlined below.

## Use a non-local cluster

To use your own cluster running in the cloud, set the following:

* `spark.kubernetes.test.deployMode` to `cloud` to indicate that Minikube will not be used.
* `spark.kubernetes.test.master` to your cluster's externally accessible URL
* `spark.kubernetes.test.imageRepo` to a write-accessible Docker image repository that provides the images for your
cluster. The framework assumes your local Docker client can push to this repository.

Therefore the command looks like this:

build/mvn integration-test \
-Dspark.kubernetes.test.deployMode=cloud \
-Dspark.kubernetes.test.master=https://example.com:8443/apiserver \
-Dspark.kubernetes.test.repo=docker.example.com/spark-images

## Re-using Docker Images

By default, the test framework will build new Docker images on every test execution. A unique image tag is generated,
and it is written to file at `target/imageTag.txt`. To reuse the images built in a previous run, set:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked before that it was using the git sha - it made it easy to correlate to the distribution that was just tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in #31 (comment)


* `spark.kubernetes.test.imageTag` to the tag specified in `target/imageTag.txt`
* `spark.kubernetes.test.skipBuildingImages` to `true`

Therefore the command looks like this:

build/mvn integration-test \
-Dspark.kubernetes.test.imageTag=$(cat target/imageTag.txt) \
-Dspark.kubernetes.test.skipBuildingImages=true

## Customizing the Spark Source Code to Test

By default, the test framework will test the master branch of Spark from [here](https://github.com/apache/spark). You
can specify the following options to test against different source versions of Spark:

* `spark.kubernetes.test.sparkRepo` to the git or http URI of the Spark git repository to clone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gitRepo and gitBranch may be better names for this.

* `spark.kubernetes.test.sparkBranch` to the branch of the repository to build.

An example:

build/mvn integration-test \
-Dspark.kubernetes.test.sparkRepo=https://github.com/apache-spark-on-k8s/spark \
-Dspark.kubernetes.test.sparkBranch=new-feature

Additionally, you can use a pre-built Spark distribution. In this case, the repository is not cloned at all, and no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation of why we should have the integration test code ever try to clone and build spark. Why not just always depend on a pre-built spark distribution? (Sorry if I missed the reason for this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found it a lot easier locally to be able to run a single command that both fetches the Spark distribution and builds it and then the integration test uses that Spark distribution. We provide the optionality for the local development scenario but we can still provide the TGZ specifically in Jenkins via spark.kubernetes.test.sparkTgz.

source code has to be compiled.

* `spark.kubernetes.test.sparkTgz` can be set to a tarball containing the Spark distribution to test.

When the tests are cloning a repository and building it, the Spark distribution is placed in
`target/spark/spark-<VERSION>.tgz`. Reuse this tarball to save a significant amount of time if you are iterating on
the development of these integration tests.
158 changes: 158 additions & 0 deletions build/mvn
@@ -0,0 +1,158 @@
#!/usr/bin/env bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We'll need to ensure that it stays in sync with our repo? Or maybe we can get rid of it once we merge this upstream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get rid of this when we merge into upstream, but even aside from that, this has hardly been changing in upstream and it's theoretically completely isolated from what we will use to build Spark (since the shell forks the second Maven process it's completely independent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to add it here now because unlike before, now mvn is the entrypoint into the whole system, as it's only the substep of the build reactor that clones the Spark repository down. Aside from that, we also allow for Spark TGZs to be provided to remove the usage of the Spark source code entirely. So no matter what we're going to need our own Maven here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the content here, could you just have a much shorter script that just downloads from https://raw.githubusercontent.com/apache/spark/master/build/mvn and execs it? Something like build/mvn-getter whose content is:

#!/bin/sh
curl -s https://raw.githubusercontent.com/apache/spark/master/build/mvn | bash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think that works. Will incorporate that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Determine the current working directory
_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# Preserve the calling directory
_CALLING_DIR="$(pwd)"
# Options used during compilation
_COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

# Installs any application tarball given a URL, the expected tarball name,
# and, optionally, a checkable binary path to determine if the binary has
# already been installed
## Arg1 - URL
## Arg2 - Tarball Name
## Arg3 - Checkable Binary
install_app() {
local remote_tarball="$1/$2"
local local_tarball="${_DIR}/$2"
local binary="${_DIR}/$3"

# setup `curl` and `wget` silent options if we're running on Jenkins
local curl_opts="-L"
local wget_opts=""
if [ -n "$AMPLAB_JENKINS" ]; then
curl_opts="-s ${curl_opts}"
wget_opts="--quiet ${wget_opts}"
else
curl_opts="--progress-bar ${curl_opts}"
wget_opts="--progress=bar:force ${wget_opts}"
fi

if [ -z "$3" -o ! -f "$binary" ]; then
# check if we already have the tarball
# check if we have curl installed
# download application
[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \
echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \
curl ${curl_opts} "${remote_tarball}" > "${local_tarball}"
# if the file still doesn't exist, lets try `wget` and cross our fingers
[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \
echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \
wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}"
# if both were unsuccessful, exit
[ ! -f "${local_tarball}" ] && \
echo -n "ERROR: Cannot download $2 with cURL or wget; " && \
echo "please install manually and try again." && \
exit 2
cd "${_DIR}" && tar -xzf "$2"
rm -rf "$local_tarball"
fi
}

# Determine the Maven version from the root pom.xml file and
# install maven under the build/ folder if needed.
install_mvn() {
local MVN_VERSION=`grep "<maven.version>" "${_DIR}/../pom.xml" | head -n1 | awk -F '[<>]' '{print $3}'`
echo $MVN_VERSION
MVN_BIN="$(command -v mvn)"
if [ "$MVN_BIN" ]; then
local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
fi
# See simple version normalization: http://stackoverflow.com/questions/16989598/bash-comparing-version-numbers
function version { echo "$@" | awk -F. '{ printf("%03d%03d%03d\n", $1,$2,$3); }'; }
if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then
local APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}

install_app \
"${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries" \
"apache-maven-${MVN_VERSION}-bin.tar.gz" \
"apache-maven-${MVN_VERSION}/bin/mvn"

MVN_BIN="${_DIR}/apache-maven-${MVN_VERSION}/bin/mvn"
fi
}

# Install zinc under the build/ folder
install_zinc() {
local zinc_path="zinc-0.3.15/bin/zinc"
[ ! -f "${_DIR}/${zinc_path}" ] && ZINC_INSTALL_FLAG=1
local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.typesafe.com}

install_app \
"${TYPESAFE_MIRROR}/zinc/0.3.15" \
"zinc-0.3.15.tgz" \
"${zinc_path}"
ZINC_BIN="${_DIR}/${zinc_path}"
}

# Determine the Scala version from the root pom.xml file, set the Scala URL,
# and, with that, download the specific version of Scala necessary under
# the build/ folder
install_scala() {
# determine the Scala version used in Spark
local scala_version=`grep "scala.version" "${_DIR}/../pom.xml" | head -n1 | awk -F '[<>]' '{print $3}'`
local scala_bin="${_DIR}/scala-${scala_version}/bin/scala"
local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.typesafe.com}

install_app \
"${TYPESAFE_MIRROR}/scala/${scala_version}" \
"scala-${scala_version}.tgz" \
"scala-${scala_version}/bin/scala"

SCALA_COMPILER="$(cd "$(dirname "${scala_bin}")/../lib" && pwd)/scala-compiler.jar"
SCALA_LIBRARY="$(cd "$(dirname "${scala_bin}")/../lib" && pwd)/scala-library.jar"
}

# Setup healthy defaults for the Zinc port if none were provided from
# the environment
ZINC_PORT=${ZINC_PORT:-"3030"}

# Remove `--force` for backward compatibility.
if [ "$1" == "--force" ]; then
echo "WARNING: '--force' is deprecated and ignored."
shift
fi

# Install the proper version of Scala, Zinc and Maven for the build
install_zinc
install_scala
install_mvn

# Reset the current working directory
cd "${_CALLING_DIR}"

# Now that zinc is ensured to be installed, check its status and, if its
# not running or just installed, start it
if [ -n "${ZINC_INSTALL_FLAG}" -o -z "`"${ZINC_BIN}" -status -port ${ZINC_PORT}`" ]; then
export ZINC_OPTS=${ZINC_OPTS:-"$_COMPILE_JVM_OPTS"}
"${ZINC_BIN}" -shutdown -port ${ZINC_PORT}
"${ZINC_BIN}" -start -port ${ZINC_PORT} \
-scala-compiler "${SCALA_COMPILER}" \
-scala-library "${SCALA_LIBRARY}" &>/dev/null
fi

# Set any `mvn` options if not already present
export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"}

echo "Using \`mvn\` from path: $MVN_BIN" 1>&2

# Last, call the `mvn` command as usual
${MVN_BIN} -DzincPort=${ZINC_PORT} "$@"
39 changes: 0 additions & 39 deletions e2e/e2e-prow.sh

This file was deleted.