# Ray on compute cluster

In [ray-on-compute-instance notebook](../1.ray-on-compute-instance/ray-on-compute-instance.ipynb), you learned how to start a local Ray cluster and interactively execute Ray script on compute instance.

In this notebook, you would learn how to submit a distributed training job with Ray cluster enabled onto multi-nodes Azure ML compute clusters.

The user should have completed the Azure Machine Learning Tutorial: [Get started creating your first ML experiment with the Python SDK](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-sdk-setup). 

## Install required packages

In [None]:
%pip install "azure-ai-ml>=1.6.0"

## Import required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment, BuildContext

## Connect to workspace using DefaultAzureCredential
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
    workspace = ml_client.workspace_name
    subscription_id = ml_client.workspaces.get(workspace).id.split("/")[2]
    resource_group = ml_client.workspaces.get(workspace).resource_group
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"

    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

## Prepare the training script
We would use the PyTorch example from Ray:
[https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py)

Script is downloaded into [src/mnist_pytorch.py](./src/mnist_pytorch.py)

## Configure and Run Command

In this section we will be configuring and running a distributed training `Command` job.

The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located.
- `command` - This is the command that needs to be run. In this example, we would execute `mnist_pytorch.py` we downloaded from [ray github repo](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py).
- `environment` - This is the environment needed for the command to run. In this example, we would use `rayproject/ray-ml` image provided by `ray`.
- `compute` - The compute on which the command will run. In this example, compute is not specified which means it would use `serverless` compute.
- `instance_type` - VMSize of the `serverless` compute. In this example, we would use `Standard_DS3_v2` cpu cluster.
- `instance_count` - The number of nodes to use for the job. In this example, we would scale `2` nodes.
- `distribution` - Distribution configuration for distributed training scenarios. In this example, we would set it to `ray`. Azure ML job engine would setup Ray cluster automatically.
  - `port` - \[Optional\] The port of the head ray process. Default is `6379`
  - `address` - \[Optional\] The address of Ray head node.
  - `include_dashboard` - \[Optional\] Provide this argument to start the Ray dashboard GUI. Default is `True`
  - `dashboard_port` - \[Optional\] The port to bind the dashboard server to. Default is `8265`
  - `head_node_additional_args` - \[Optional\] Additional arguments passed to ray start in head node.
  - `worker_node_additional_args` - \[Optional\] Additional arguments passed to ray start in worker node.

In [None]:
job = command(
    experiment_name="mnist_pytorch",
    code="./src",  # local path where the code is stored
    command="python mnist_pytorch.py;",
    environment=Environment(
        image="rayproject/ray-ml:2.4.0-py38-cpu"
    ),
    # environment="ray-env",
    # compute="azureml:cpu-cluster",
    instance_type="Standard_DS3_v2",
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "ray",
        # "port": 6379, # [Optional] The port of the head ray process.
        # "include_dashboard": True, # [Optional] The port of the head ray process.
        # "dashboard_port": 8265, # [Optional] The port of the head ray process.
        # "head_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
        # "worker_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
    },
)

## Submit the job

In [None]:
active_job = ml_client.jobs.create_or_update(job)

active_job

## View Ray Dashboard

### Retrieve Ray dashboard link through SDK
After job started **running**, you could get link from `job.services`

In [None]:
import time
from IPython.display import display

# wait until Ray dashboard is ready
active_job = ml_client.jobs.get(active_job.name)

dashboard_url = active_job.services['ray-dashboard'].endpoint.replace('<nodeIndex>', '0')
display({'text/html': f'Ray Dashboard: <a href="{dashboard_url}" rel="noreferrer">{dashboard_url}</a>'}, raw=True)

### View Ray dashboard from Azure ML Studio

## Submit the job through Azure ML CLI

We can also submit the same job through Azure ML CLI by running `az ml job create`. More info about how to submit job through CLI could be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-model?view=azureml-api-2&tabs=azurecli#4-submit-the-training-job)

Here's the equivalent yaml file:
```yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: mnist_pytorch
code: ./src
command: python mnist_pytorch.py;
environment:
  image: "rayproject/ray-ml:2.4.0-py38-cpu"
# compute: azureml:cpu-cluster
resources:
  instance_type: Standard_DS3_v2
  instance_count: 2
distribution:
  type: ray
  # port: 6379
  # include_dashboard: True
  # dashboard_port: 8265
  # head_node_additional_args: "--verbose"
  # worker_node_additional_args: "--verbose"
```