# GKE workloads and executions using Vertex AI 

## Objective
Execute a Vertex Custom job that runs a Spark workload on a GKE cluster that uses custom compute classes. Users would define ComputeClass with a list of resource preferences. GKE would attempt to fulfill resources according to this list (e.g. L4 > T4 > CPU), and when a preferred resource is unavailable, a fallback strategy would shift to the next suitable resource.

## Flow Diagram
![image_png2.PNG](./img/vertex_gke_flow.PNG)<br>

### Work Flow Pattern
- GKE cluster (Standard Mode, Autopilot) is created and a Custom Compute class is set as the default for a namespace 
- Vertex Custom Job pulls and submits containerized workloads from Artifact Registry using WorkerPoolSpecs
- Spark workload is run in the Kubernetes cluster specified in configuration


## Google Cloud services and resources:

- `Vertex AI`
- `Artifact Registry`
- `Cloud Storage`
- `Kubernetes Engine`
- `Compute Engine`

In [None]:
# Check the versions of the packages installed

! kubectl version --client
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"

In [350]:
# Project parameters
PROJECT_ID = "sandbox-401718" # @param {type:"string"}
REGION="us-central1" # @param {type:"string"}

# Cluster parameters
NETWORK="beusebio-network" # @param {type:"string"}
cluster_name = "ccc-test-region-autopilot" # @param {type:"string"}
cluster_zone = "us-central1" # @param {type:"string"}

# storage bucket to store intermediate artifacts such as YAML job files
BUCKET_URI = "gs://sandbox-401718-us-notebooks/gke-yaml"  # @param {type:"string"}

In [352]:
! gcloud container clusters create-auto {cluster_name} \
    --network={NETWORK} \
    --location=us-central1 \
    --release-channel=regular

Creating cluster ccc-test-region-autopilot in us-central1... Cluster is being c
onfigured...⠼                                                                  
Creating cluster ccc-test-region-autopilot in us-central1... Cluster is being d
eployed...⠶                                                                    
Creating cluster ccc-test-region-autopilot in us-central1... Cluster is being h
ealth-checked (Kubernetes Control Plane is healthy)...done.                    
Created [https://container.googleapis.com/v1/projects/sandbox-401718/zones/us-central1/clusters/ccc-test-region-autopilot].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/ccc-test-region-autopilot?project=sandbox-401718
kubeconfig entry generated for ccc-test-region-autopilot.
NAME                       LOCATION     MASTER_VERSION      MASTER_IP      MACHINE_TYPE  NODE_VERSION        NUM_NODES  STATUS
ccc-test-region-autopilot  us-central1  1

### Set and connect to the Kubernetes Master Server IP address

In [353]:
K8S = "https://34.173.27.183" # @param {type:"string"}

! gcloud container clusters get-credentials {cluster_name} --location {cluster_zone} --project {PROJECT_ID}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for ccc-test-region-autopilot.


In [None]:
! gcloud container clusters describe {cluster_name} --location {cluster_zone}

### Define a Custom Compule Class

In [354]:
%%writefile ./src/computeclass.yaml

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: l4-t4-cpu
spec:
  priorities:
  - gpu:
      count: 1
      type: nvidia-l4
  - gpu:
      count: 1
      type: nvidia-tesla-t4
  - machineFamily: n1
    minCores: 16
  activeMigration:
    optimizeRulePriority: true
  nodePoolAutoCreation:
    enabled: true

Overwriting ./src/computeclass.yaml


In [355]:
# Apply compute class
! kubectl apply -f ./src/computeclass.yaml

computeclass.cloud.google.com/l4-t4-cpu created


In [356]:
! kubectl describe computeclass l4-t4-cpu

Name:         l4-t4-cpu
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  cloud.google.com/v1
Kind:         ComputeClass
Metadata:
  Creation Timestamp:  2025-02-07T00:28:02Z
  Generation:          1
  Managed Fields:
    API Version:  cloud.google.com/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:activeMigration:
          .:
          f:optimizeRulePriority:
        f:nodePoolAutoCreation:
          .:
          f:enabled:
        f:priorities:
        f:whenUnsatisfiable:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2025-02-07T00:28:02Z
  Resource Version:  10366
  UID:               4334af08-1960-4cec-8e23-c334a57625bf
Spec:
  Active Migration:
    Optimize Rule Priority:  true
  Node Pool Auto Creation:
    Enabled:  true
  Priorities:
    Gpu:
      Count:       

### Test Example Workload

In [359]:
%%writefile ./src/workload.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-workload
spec:
  replicas: 2
  selector:
    matchLabels:
      app: custom-workload
  template:
    metadata:
      labels:
        app: custom-workload
    spec:
      nodeSelector:
        cloud.google.com/compute-class: l4-t4-cpu
      containers:
      - name: test
        image: gcr.io/google_containers/pause
        resources:
          requests:
            cpu: 1.5
            memory: "4Gi"

Overwriting ./src/workload.yaml


In [361]:
# Apply compute class
! kubectl apply -f ./src/workload.yaml

deployment.apps/custom-workload created


In [362]:
# # Gives detailed information about the  Deployment
! kubectl describe deployment custom-workload 

In [367]:
# Check that all Pods are running
! kubectl get pods -l=app=custom-workload

NAME                              READY   STATUS    RESTARTS   AGE
custom-workload-79dd44d75-f2gbd   1/1     Running   0          106s
custom-workload-79dd44d75-lf99j   1/1     Running   0          106s


In [167]:
# View nodes
! kubectl get nodes

NAME                                               STATUS   ROLES    AGE   VERSION
gke-ccc-test-autoprov-default-pool-f68a1614-f9wh   Ready    <none>   3m    v1.31.4-gke.1256000
gke-ccc-test-autoprov-default-pool-f68a1614-hv5t   Ready    <none>   3m    v1.31.4-gke.1256000
gke-ccc-test-autoprov-default-pool-f68a1614-wbx8   Ready    <none>   3m    v1.31.4-gke.1256000


## Spark on GPU-enabled Kubernetes

Build image to run and submit Apache Spark applications on Kubernetes. Steps include downloading files from Nvidia and Spark into a local `src/` folder. In this example, no operators are required.

### Configure RBAC Role
Create namespace, configure user control for managing access to Kubernetes cluster resources, and verify permissions to run Spark workloads on Kubernetes

In [370]:
%%writefile ./src/spark-role.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: spark-demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-demo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-role
  namespace: spark-demo
subjects:
  - kind: ServiceAccount
    name: spark
    namespace: spark-demo
roleRef:
  kind: ClusterRole
  name: edit
  apiGroup: rbac.authorization.k8s.io
---

Overwriting ./src/spark-role.yaml


In [349]:
# Create namespace, apply RBAC cofig, Custom Compute Class as default, and verify permissions to run Spark workloads on Kubernetes

In [371]:
# Create namespace, apply RBAC cofig, and verify permissions to run Spark workloads on Kubernetes
! kubectl create namespace spark-demo
! kubectl label namespaces spark-demo \
    cloud.google.com/default-compute-class=l4-t4-cpu
! ! kubectl --namespace=spark-demo apply -f ./src/spark-role.yaml
! kubectl auth can-i create pod --namespace spark --as=system:serviceaccount:spark-demo:spark
! kubectl auth can-i delete services --namespace spark --as=system:serviceaccount:spark-demo:spark

namespace/spark-demo created
namespace/spark-demo labeled


### Spark Workload

In [378]:
# Image Parameters
VERSION="latest"
REPO_NAME="gke-mlops-pilot-docker" # @param {type:"string"}
JOB_IMAGE_ID="spark-gke" # @param {type:"string"}
BASE_IMAGE_ID = "component-base" # @param {type:"string"}

# Vertex Custom Job parameters
SERVICE_ACCOUNT="757654702990-compute@developer.gserviceaccount.com" # @param {type:"string"}
PIPELINE_ROOT="gs://sanbox-bucket-kfp-intro-demo" # @param {type:"string"}

In [379]:
# Import libraries

import os
from google.cloud import aiplatform

In [380]:
# Sprk Pi test

CMD = [
    r"""gcloud container clusters get-credentials {cluster_name_} --zone {cluster_zone_} --project {project} &&./bin/spark-submit \
        --master k8s://{k8s} \
        --deploy-mode cluster \
        --name spark-pi \
        --class org.apache.spark.examples.SparkPi \
        --conf spark.kubernetes.driver.request.cores=400m \
        --conf spark.kubernetes.executor.request.cores=100m \
        --conf spark.kubernetes.container.image={image} \
        --conf spark.kubernetes.namespace=spark-demo \
        --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
        local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar""".format(
        cluster_name_=cluster_name,
        cluster_zone_=cluster_zone,
        project=PROJECT_ID,
        k8s=K8S,
        image=f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{JOB_IMAGE_ID}:{VERSION}",
    )
]

In [381]:
WORKER_POOL_SPEC_ = [
    {
        "replica_count": 1,
        "machine_spec": {"machine_type": "n1-standard-4", "accelerator_count": 0},
        "container_spec": {
            "image_uri": f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/{BASE_IMAGE_ID}:{VERSION}",
            "command": ["sh", "-c"],
            "args": CMD
        },
    }
]

In [382]:
custom_job = aiplatform.CustomJob(
    display_name="k8s-custom-job",
    worker_pool_specs=WORKER_POOL_SPEC_,
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=PIPELINE_ROOT
)

custom_job.run(sync=False, service_account=SERVICE_ACCOUNT)

### Check Kubernetes Task Completion and Output

In [383]:
! kubectl get pods --namespace=spark-demo

NAME                               READY   STATUS      RESTARTS   AGE
spark-pi-d68f3f94ddd6e3e4-driver   0/1     Completed   0          34m


In [384]:
 # Check the logs for any Pod
    
pod = "spark-pi-d68f3f94ddd6e3e4-driver"    # @param {type:"string"}
! kubectl logs {pod} --namespace=spark-demo

++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ '[' -z /usr/lib/jvm/java-1.8.0-openjdk-amd64 ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
++ command -v readarray
+ '[' readarray ']'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*:/opt/spark/work-dir'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --conf "spark.executorEnv.SPARK_DRIVER_POD_IP=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.86.0.71 --conf spark.executorEn

In [387]:
! kubectl describe pod spark-pi-d68f3f94ddd6e3e4-driver --namespace=spark-demo

Name:             spark-pi-d68f3f94ddd6e3e4-driver
Namespace:        spark-demo
Priority:         0
Service Account:  spark
Node:             gk3-ccc-test-region-auto-nap-1rlkbinn-a000bd62-hm6z/10.128.15.206
Start Time:       Fri, 07 Feb 2025 00:37:23 +0000
Labels:           spark-app-name=spark-pi
                  spark-app-selector=spark-4709dc795e3747178f21befd37354f2b
                  spark-role=driver
                  spark-version=3.5.0
Annotations:      autopilot.gke.io/resource-adjustment:
                    {"input":{"containers":[{"limits":{"memory":"1408Mi"},"requests":{"cpu":"400m","memory":"1408Mi"},"name":"spark-kubernetes-driver"}]},"out...
                  autopilot.gke.io/warden-version: 31.23.0-gke.7
Status:           Succeeded
IP:               10.86.0.71
IPs:
  IP:  10.86.0.71
Containers:
  spark-kubernetes-driver:
    Container ID:  containerd://4e921f6493c8a0fa6daaf0ba79855aaef46df264b34e9df6aeb74a31f744f500
    Image:         us-central1-docker.pkg.dev/sandb

In [388]:
! kubectl get nodes -l cloud.google.com/compute-class=l4-t4-cpu

NAME                                                  STATUS   ROLES    AGE   VERSION
gk3-ccc-test-region-auto-nap-1rlkbinn-a000bd62-hm6z   Ready    <none>   92m   v1.31.4-gke.1256000


In [262]:
# Delete Cluster
! gcloud container clusters delete {cluster_name} --zone {cluster_zone} --quiet

deployment.apps "custom-workload" deleted


## Additional References
* [About Custom Compute Classes](https://cloud.google.com/kubernetes-engine/docs/concepts/about-custom-compute-classes)
* [Running Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html)
* [Getting Started with RAPIDS and Kubernetes](https://docs.nvidia.com/ai-enterprise/deployment-guide-spark-rapids-accelerator/0.1.0/kubernetes.html)