Skip to content

Latest commit

 

History

History
109 lines (85 loc) · 4.73 KB

kubeflow-integration.md

File metadata and controls

109 lines (85 loc) · 4.73 KB

Credit: This manifest refers a lot to the engineering blog "Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine" from Google Cloud.

Kubeflow: an interactive development solution

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

Requirements

  • Dependencies

    • kustomize: v3.2.0 (Kubeflow manifest is sensitive to kustomize version.)
    • Kubernetes: v1.23
  • Computing resources:

    • 16GB RAM
    • 8 CPUs

Example: Use Kubeflow to provide an interactive development envirzonment

image

Step 1: Create a Kubernetes cluster with Kind.

# Kubeflow is sensitive to Kubernetes version and Kustomize version.
kind create cluster --image=kindest/node:v1.23.0
kustomize version --short
# 3.2.0

Step 2: Install Kubeflow v1.6-branch

  • This example installs Kubeflow with the v1.6-branch.

  • Install all Kubeflow official components and all common services using one command.

    • If you do not want to install all components, you can comment out KNative, Katib, Tensorboards Controller, Tensorboard Web App, Training Operator, and KServe from example/kustomization.yaml.

Step 3: Install KubeRay operator

  • Follow this document to install the latest stable KubeRay operator via Helm repository.

Step 4: Install RayCluster

# Create a RayCluster CR, and the KubeRay operator will reconcile a Ray cluster
# with 1 head Pod and 1 worker Pod.
helm install raycluster kuberay/ray-cluster --version 0.5.0 --set image.tag=2.2.0-py38-cpu

# Check RayCluster
kubectl get pod -l ray.io/cluster=raycluster-kuberay
# NAME                                          READY   STATUS    RESTARTS   AGE
# raycluster-kuberay-head-bz77b                 1/1     Running   0          64s
# raycluster-kuberay-worker-workergroup-8gr5q   1/1     Running   0          63s
  • This step uses rayproject/ray:2.2.0-py38-cpu as its image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. This image uses:
    • Python 3.8.13
    • Ray 2.2.0

Step 5: Forward the port of Istio's Ingress-Gateway

  • Follow the instructions to forward the port of Istio's Ingress-Gateway and log in to Kubeflow Central Dashboard.

Step 6: Create a JupyterLab via Kubeflow Central Dashboard

  • Click "Notebooks" icon in the left panel.
  • Click "New Notebook"
  • Select kubeflownotebookswg/jupyter-scipy:v1.6.1 as OCI image.
  • Click "Launch"
  • Click "CONNECT" to connect into the JupyterLab instance.

Step 7: Use Ray client in the JupyterLab to connect to the RayCluster

Warning: Ray client has some known limitations and is not actively maintained.

  • As mentioned in Step 4, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. Open a terminal in the JupyterLab:
    # Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8)
    python --version 
    # Python 3.8.10
    
    # Install Ray 2.2.0
    pip install -U ray[default]==2.2.0
  • Connect to RayCluster via Ray client.
    # Open a new .ipynb page.
    
    import ray
    # ray://${RAYCLUSTER_HEAD_SVC}.${NAMESPACE}.svc.cluster.local:${RAY_CLIENT_PORT}
    ray.init(address="ray://raycluster-kuberay-head-svc.default.svc.cluster.local:10001")
    print(ray.cluster_resources())
    # {'node:10.244.0.41': 1.0, 'memory': 3000000000.0, 'node:10.244.0.40': 1.0, 'object_store_memory': 805386239.0, 'CPU': 2.0}
    
    # Try Ray task
    @ray.remote
    def f(x):
        return x * x
    
    futures = [f.remote(i) for i in range(4)]
    print(ray.get(futures)) # [0, 1, 4, 9]
    
    # Try Ray actor
    @ray.remote
    class Counter(object):
        def __init__(self):
            self.n = 0
    
        def increment(self):
            self.n += 1
    
        def read(self):
            return self.n
    
    counters = [Counter.remote() for i in range(4)]
    [c.increment.remote() for c in counters]
    futures = [c.read.remote() for c in counters]
    print(ray.get(futures)) # [1, 1, 1, 1]