Skip to content

Dockerfile setup to install cloud related utilities onto the standard Spark K8s Docker images

License

Notifications You must be signed in to change notification settings

dsaidgovsg/spark-k8s-addons

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-k8s-addons

CI Status

CI Dockerfile setup to install cloud related utilities onto the standard Spark K8s Docker images.

The Spark K8s Docker images are built using this repository.

Note that the images for Spark version below 3.4.0 here are Debian based because of how the official script generates the Spark-Kubernetes images. For Spark version above 3.4.0, Ubuntu-based images are generated instead based on the official script.

How to build

BASE_VERSION=v3
SPARK_VERSION=3.4.1
JAVA_VERSION=11
HADOOP_VERSION=3.3.4
SCALA_VERSION=2.13
PYTHON_VERSION=3.9
IMAGE_VERSION=""

docker pull dsaidgovsg/spark-k8s-py:${BASE_VERSION}_${SPARK_VERSION}_hadoop-${HADOOP_VERSION}_scala-${SCALA_VERSION}_java-${JAVA_VERSION}

IMAGE_NAME=spark-k8s-addons
docker build -t "${IMAGE_NAME}" \
    --build-arg BASE_VERSION="${BASE_VERSION}" \
    --build-arg SPARK_VERSION="${SPARK_VERSION}" \
    --build-arg JAVA_VERSION="${JAVA_VERSION}" \
    --build-arg HADOOP_VERSION="${HADOOP_VERSION}" \
    --build-arg SCALA_VERSION="${SCALA_VERSION}" \
    --build-arg PYTHON_VERSION="${PYTHON_VERSION}" \
    --build-arg IMAGE_VERSION="${IMAGE_VERSION}" \
    .

How to properly manage pip packages

Since raw pip is terrible at managing installation of dependencies in a version compatible across multiple pip install sessions, poetry has been installed in a system wide manner (whose directory to contain pyproject.toml is the value of the env var POETRY_SYSTEM_PROJECT_DIR).

All pip installation is recommended to go through via poetry completely, and this can be done like this:

pushd "${POETRY_SYSTEM_PROJECT_DIR}"
poetry add <package1> [<other packages to add>]
popd

Add-ons

User spark

A more human-friendly spark username has been added at UID 185, which is the default UID dictated by the official Spark-Kubernetes Docker image build.

CLIs

The following command-line tools have been added onto the original K8s Docker images:

  • poetry to properly manage pip installation
  • AWS CLI installed via poetry
  • AWS IAM Authenticator This is a Go statically linked binary, so this does not interact with any of the above said items.

JARs

The following JARs have been added onto the original K8s Docker images:

  • AWS Hadoop SDK JAR
    • Appends spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem into spark-defaults.conf
  • Google Cloud Storage SDK JAR
  • MariaDB JDBC Connector JAR

Spark Configuration

AWS S3A Client

In your Spark application configuration, to use AWS S3A client JAR, do the following:

echo "spark.hadoop.fs.s3a.impl  org.apache.hadoop.fs.s3a.S3AFileSystem" >> ${SPARK_HOME}/conf/spark-defaults.conf; \

If you are using spark-shell or spark-submit, then you can add the above as a flag instead:

spark-shell --conf "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

How to Apply Template for CI build

For Linux user, you can download Tera CLI v0.4 at https://github.com/guangie88/tera-cli/releases and place it in PATH.

Otherwise, you will need cargo, which can be installed via rustup.

Once cargo is installed, simply run cargo install tera-cli --version=^0.4.0.

Always make changes in templates/ci.yml.tmpl since the template will be applied onto .github/workflows/ci.yml.

Run templates/apply-vars.sh to apply the template once tera-cli has been installed.

About

Dockerfile setup to install cloud related utilities onto the standard Spark K8s Docker images

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published