## Spark on GPU-enabled Kubernetes

Build image to run and submit Apache Spark applications on Kubernetes. Steps include downloading files from Nvidia and Spark into a local `src/` folder. In this example, no operators are required.

## Google Cloud services and resources:

- `Vertex AI`
- `Artifact Registry`
- `Cloud Storage`
- `Kubernetes Engine`
- `Compute Engine`

### Spark Docker Image preparation

Set up required parameters.

Container Image:
- `REPO_NAME`: version or tag of the Docker image. Default set as `latest`
- `REPO_NAME`: The name of the Artifact Registry repository that will store the compiled pipeline file
- `JOB_IMAGE_ID`: The name of the image that will be used to run spark jobs on Kubernetes. The full image name: `<REGION>-docker.pkg.dev/<PROJECT_ID>/<REPO_NAME>/<JOB_IMAGE_ID>:<VERSION>`
- `BASE_IMAGE_ID`: The name of the image that will be used to submit jobs using Vertex AI. The full image name: `<REGION>-docker.pkg.dev/<PROJECT_ID>/<REPO_NAME>/<BASE_IMAGE_ID>:<VERSION>`
<br>

Custom Job and Pipeline:
- `SERVICE_ACCOUNT`: The service account to use to run custom jobs and pipeline

The final local `/src` folder will include the following: Dockerfile.cuda, spark (folder), getGpusResources.sh, rapids-4-spark_2.12-23.02.0.jar

In [None]:
# Image Parameters
VERSION="latest"
REPO_NAME="gke-mlops-pilot-docker" # @param {type:"string"}
JOB_IMAGE_ID="spark-gke" # @param {type:"string"}
BASE_IMAGE_ID = "component-base" # @param {type:"string"}

# Vertex Custom Job parameters
SERVICE_ACCOUNT="757654702990-compute@developer.gserviceaccount.com" # @param {type:"string"}

In [None]:
%%writefile ./src/Dockerfile.cuda

#
# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM nvidia/cuda:11.8.0-devel-ubuntu20.04
ARG spark_uid=185

# https://forums.developer.nvidia.com/t/notice-cuda-linux-repository-key-rotation/212771
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

# Install java dependencies
ENV DEBIAN_FRONTEND="noninteractive"
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the docker image, first either download Apache Spark 3.1+ from
# http://spark.apache.org/downloads.html or build and make a Spark distribution following the
# instructions in http://spark.apache.org/docs/3.1.2/building-spark.html (see
# https://nvidia.github.io/spark-rapids/docs/download.html for other supported versions).  If this
# docker file is being used in the context of building your images from a Spark distribution, the
# docker build command should be invoked from the top level directory of the Spark
# distribution. E.g.: docker build -t spark:3.1.2 -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/jars && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    mkdir -p /opt/sparkRapidsPlugin && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd

COPY spark/jars /opt/spark/jars
COPY spark/bin /opt/spark/bin
COPY spark/sbin /opt/spark/sbin
COPY spark/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark/examples /opt/spark/examples
COPY spark/kubernetes/tests /opt/spark/tests
COPY spark/data /opt/spark/data

COPY rapids-4-spark_2.12-23.10.0.jar /opt/sparkRapidsPlugin
COPY getGpusResources.sh /opt/sparkRapidsPlugin

RUN mkdir /opt/spark/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apt-get update && \
    apt install -y python wget && wget https://bootstrap.pypa.io/pip/2.7/get-pip.py && python get-pip.py && \
    apt install -y python3 python3-pip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY spark/python/pyspark /opt/spark/python/pyspark
COPY spark/python/lib /opt/spark/python/lib

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
RUN chmod +rx /usr/bin/tini

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}

In [None]:
# Download latest Spark and shell scripts locally
! wget -P ./src https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

# Un-tar and rename to spark
! tar -xzf ./src/spark-3.5.0-bin-hadoop3.tgz -C ./src && \
  mv ./src/spark-3.5.0-bin-hadoop3 ./src/spark && \
  rm ./src/spark-3.5.0-bin-hadoop3.tgz

# Download GPU resources
! wget -P ./src https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh

# download Rapids Jar file
! wget -P ./src https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.10.0/rapids-4-spark_2.12-23.10.0.jar

In [None]:
# Build and push image to reigstry
! docker build ./src -f ./src/Dockerfile.cuda -t {REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{JOB_IMAGE_ID}:{VERSION}
! gcloud auth configure-docker us-central1-docker.pkg.dev --quiet
! docker push {REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{JOB_IMAGE_ID}:{VERSION}

## Custom Jobs

Submit Spark applications to Kubernetes using [Custom Jobs and Worker Pool Specs](https://cloud.google.com/vertex-ai/docs/training/create-custom-job).

### Create Custom Job base-image

This image includes Spark, kubectl, and the gcloud SDK.

In [None]:
%%writefile ./src/Dockerfile

FROM google/cloud-sdk:latest

# Install kubectl
RUN apt-get update && \
    apt-get install -y apt-transport-https ca-certificates curl && \
    curl -fsSL https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - && \
    echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list && \
    apt-get update && \
    apt-get install -y kubectl

# Install pip
RUN apt-get install -y python3-pip
RUN pip3 install --upgrade pip
RUN apt-get install -y cmake
RUN apt-get install -y net-tools

# Download and extract Apache Spark
RUN curl -O https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz && \
    tar -xzf spark-3.5.0-bin-hadoop3.tgz && \
    mv spark-3.5.0-bin-hadoop3 spark

# Change working directory to spark
WORKDIR /spark

In [None]:
# Build and push image to reigstry

! docker build ./src -f ./src/Dockerfile -t {REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{BASE_IMAGE_ID}:{VERSION}
! docker push {REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{BASE_IMAGE_ID}:{VERSION}