# Ray on kubernetes

In [ray-on-compute-cluster](../2.ray-on-compute-cluster/ray-on-compute-cluster.ipynb), we learned how to submit a distributed training job with Ray cluster enabled onto multi-nodes Azure ML compute clusters.

As Azure ML supports [Kubernetes as a compute target](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-attach-kubernetes-anywhere?view=azureml-api-2), we can easily submit a job with Ray cluster enabled onto an existing Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc Kubernetes) cluster.


## Prerequisites

With a simple cluster extension deployment on AKS or Arc Kubernetes cluster, Kubernetes cluster is seamlessly supported in Azure Machine Learning to run training or inference workload. It's easy to enable and use an existing Kubernetes cluster for Azure Machine Learning workload with the following simple steps:
1. Prepare an Azure [Kubernetes Service cluster](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli) or [Arc Kubernetes cluster](https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster).
2. Deploy the [Azure Machine Learning extension](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-kubernetes-extension).


## Attach Kubernetes cluster to your Azure ML workspace

### Import required libraries

In [None]:
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment

### Connect to workspace using DefaultAzureCredential

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
    workspace = ml_client.workspace_name
    subscription_id = ml_client.workspaces.get(workspace).id.split("/")[2]
    resource_group = ml_client.workspaces.get(workspace).resource_group
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

### Attach to workspace


In [None]:
from azure.ai.ml import load_compute

compute_name = "aks-cluster"

try:
    ml_client.compute.get(compute_name)
    print("Found attached kubernetes cluster")
except Exception:
    print("Attaching kubernetes culster...")
    # aks_cluster_id = "/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/managedClusters/<CLUSTER_NAME>"
    aks_cluster_id = "/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/daweil-ray/providers/Microsoft.ContainerService/managedClusters/daweil-ray"

    compute_params = [
        {"name": compute_name},
        {"type": "kubernetes"},
        {
            "resource_id": aks_cluster_id
        },
    ]
    k8s_compute = load_compute(source=None, params_override=compute_params)
    ml_client.compute.begin_create_or_update(k8s_compute).result()
    print("Kubernetes culster is ready to use.")


## Prepare the training script
We would use the PyTorch example from Ray:
[https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py)

Script is downloaded into [src/mnist_pytorch.py](./src/mnist_pytorch.py)

## Configure and Run Command

In this section we will be configuring and running a distributed training `Command` job.

The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located.
- `command` - This is the command that needs to be run. In this example, we would execute `mnist_pytorch.py` we downloaded from [ray github repo](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py).
- `environment` - This is the environment needed for the command to run. In this example, we would use `rayproject/ray-ml` image provided by `ray`.
- `compute` - The compute on which the command will run. In this example, compute is set to kubernetes cluster we just attached.
- `instance_count` - The number of nodes to use for the job. In this example, we would scale `2` nodes.
- `distribution` - Distribution configuration for distributed training scenarios. In this example, we would set it to `ray`. Azure ML job engine would setup Ray cluster automatically.
  - `port` - \[Optional\] The port of the head ray process. Default is `6379`
  - `address` - \[Optional\] The address of Ray head node.
  - `include_dashboard` - \[Optional\] Provide this argument to start the Ray dashboard GUI. Default is `True`
  - `dashboard_port` - \[Optional\] The port to bind the dashboard server to. Default is `8265`
  - `head_node_additional_args` - \[Optional\] Additional arguments passed to ray start in head node.
  - `worker_node_additional_args` - \[Optional\] Additional arguments passed to ray start in worker node.

In [None]:
job = command(
    experiment_name="mnist_pytorch",
    code="./src",  # local path where the code is stored
    command="python mnist_pytorch.py;",
    environment=Environment(
        image="rayproject/ray-ml:2.4.0-py38-cpu"
    ),
    compute="aks-cluster",
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "ray",
        # "port": 6379, # [Optional] The port of the head ray process.
        # "include_dashboard": True, # [Optional] The port of the head ray process.
        # "dashboard_port": 8265, # [Optional] The port of the head ray process.
        # "head_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
        # "worker_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
    },
)

## Submit the job

In [None]:
active_job = ml_client.jobs.create_or_update(job)

active_job

## View Ray Dashboard

### Retrieve Ray dashboard link through SDK
After job started **running**, you could get link from `job.services`

In [None]:
import time
from IPython.display import display

# wait until Ray dashboard is ready
active_job = ml_client.jobs.get(active_job.name)

dashboard_url = active_job.services['ray-dashboard'].endpoint.replace('<nodeIndex>', '0')
display({'text/html': f'Ray Dashboard: <a href="{dashboard_url}" rel="noreferrer">{dashboard_url}</a>'}, raw=True)