# Ray on interactive compute cluster

In [ray-on-compute-instance notebook](../1.ray-on-compute-instance/ray-on-compute-instance.ipynb), we learned how to start a local Ray cluster and interactively execute Ray script on compute instance.

In [ray-on-compute-cluster](../2.ray-on-compute-cluster/ray-on-compute-cluster.ipynb), we learned how to submit a distributed training job with Ray cluster enabled onto multi-nodes Azure ML compute clusters.

In this notebook, we would learn how to combine this 2 scenarios to build an interactive multi-nodes heterogeneous Ray cluster.


## Prerequisites
To build an interactive multi-nodes heterogeneous Ray cluster, we need one compute instance as head node and one or more cpu/gpu compute clusters as worker nodes.

The compute instance and compute cluster are required to be placed in one virtual network and subnet.

Please follow [this document](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-secure-training-vnet?view=azureml-api-2&tabs=cli%2Crequired) to setup 1 cpu compute instance and 2 nodes gpu compute cluster.

## Install required packages

More info about installing Ray could be found [here](https://docs.ray.io/en/latest/ray-overview/installation.html).

In [None]:
# Get and set python and ray version
from platform import python_version

python_version = python_version()
ray_version = '2.4.0'

In [1]:
%pip install --no-cache-dir \
 ../../private_wheel/azure_ai_ml-1.6.0a20230421002-py3-none-any.whl \
 'ray[default, air, tune]==2.4.0' \
 gpustat==1.0.0 \
 torch \
 torchvision

Processing /mnt/batch/tasks/shared/LS_root/mounts/clusters/ray/code/Users/daweil/ray/azureml-insiders/private_wheel/azure_ai_ml-1.6.0a20230421002-py3-none-any.whl
azure-ai-ml is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.
Note: you may need to restart the kernel to use updated packages.


## Start a Ray cluster on compute instance

We would use the current compute instance as head node of the Ray cluster we are trying to build.

In [2]:
import ray

dashboard_port = 8266

# shutdown existing cluster
ray.shutdown()

ray_instance = ray.init(
    include_dashboard= True,
    dashboard_port=dashboard_port,
)
ray_instance

  from .autonotebook import tqdm as notebook_tqdm
2023-04-28 03:24:52,857	INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m


0,1
Python version:,3.8.5
Ray version:,2.4.0
Dashboard:,http://127.0.0.1:8266


## Attach worker nodes using compute cluster

After head node started, we can submit a worker nodes only job by passing the head node address.

### Import required libraries

In [3]:
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment

### Connect to workspace using DefaultAzureCredential

`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 


In [4]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
    workspace = ml_client.workspace_name
    subscription_id = ml_client.workspaces.get(workspace).id.split("/")[2]
    resource_group = ml_client.workspaces.get(workspace).resource_group
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Found the config file in: /config.json
Class FeatureStoreOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class FeatureSetOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class FeatureStoreEntityOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


## Build environment

We would use Azure ML image and a conda yaml file to build an environment. More info about how to build environment could be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?view=azureml-api-2&tabs=python).

As Ray requires exact version match of both `python` and `ray`, let's generate a `conda.yml` file matches current kernel.


In [5]:
import yaml
from platform import python_version

# Get and set python and ray version
python_version = python_version()
ray_version = '2.4.0'

conda = yaml.load(f"""
    name: ray-environment
    dependencies:
    - python={python_version}
    - pip:
        - ray[default, tune]=={ray_version}
        - torch
        - torchvision
""", Loader=yaml.CLoader)

# Write to conda.yml file
with open('conda.yml', 'w') as conda_file:
    yaml.dump(conda, conda_file, default_flow_style=False)


# Build environment using AzureML image and conda.yml we built
environment=Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.3-cudnn8-ubuntu20.04",
    conda_file="conda.yml"
)


### Configure and Run Command
In this section we will be configuring and running a `Command` job.

The `command` allows user to configure the following key aspects.
- `command` - This is the command that needs to be run. In this example, we would execute `sleep infinity` which would block the job to complete.
- `environment` - This is the environment needed for the command to run. In this example, we would use the environment we just build.
- `compute` - The compute on which the command will run. In this example, we specify the compute we created in the same vnet of current compute instance.
- `instance_count` - The number of nodes to use for the job. In this example, we would scale `2` nodes.
- `distribution` - Distribution configuration for distributed training scenarios. In this example, we would set it to `ray`. Azure ML job engine would setup Ray cluster automatically.
  - `port` - \[Optional\] The port of the head ray process. Default is `6379`
  - `address` - \[Optional\] The address of Ray head node.
  - `worker_node_additional_args` - \[Optional\] Additional arguments passed to ray start in worker node.

In [6]:
job = command(
    experiment_name="mnist_pytorch",
    command="sleep infinity",
    environment=environment,
    compute="gpu-cluster",
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "ray",
        "address": ray_instance.address_info["address"], # [Optional] The address of ray head node
        # "worker_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
    },
)

Field 'None': This is an experimental field, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class RayDistributionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class RayDistribution: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


### Submit the job

By submitting the command job, Azure ML would scale up the compute cluster and connect to the head node.

In [7]:
active_job = ml_client.jobs.create_or_update(job)

active_job

Field 'distribution': This is an experimental field, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Experiment,Name,Type,Status,Details Page
mnist_pytorch,busy_fish_kybk6hbkzy,command,Starting,Link to Azure Machine Learning studio


## Prepare the training script
We would continue to use the same PyTorch example from Ray:
[https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py)

Script is downloaded into [src/mnist_pytorch.py](./src/mnist_pytorch.py)

We would run the application _interactively_ and see the output in real time.

In [8]:
%run src/mnist_pytorch.py --cuda

0,1
Current time:,2023-04-28 03:53:15
Running for:,00:28:00.61
Memory:,5.7/27.4 GiB

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
train_mnist_4a723_00000,TERMINATED,10.0.0.4:211,0.000317946,0.436329,0.175,100,33.2725
train_mnist_4a723_00001,TERMINATED,10.0.0.5:207,0.00123547,0.61251,0.065625,1,3.88125
train_mnist_4a723_00002,TERMINATED,10.0.0.5:207,0.000565768,0.477196,0.0875,4,1.14075
train_mnist_4a723_00003,TERMINATED,10.0.0.5:207,0.00843014,0.381995,0.86875,100,27.159
train_mnist_4a723_00004,TERMINATED,10.0.0.4:211,0.000238998,0.889096,0.040625,1,0.329112
train_mnist_4a723_00005,TERMINATED,10.0.0.4:211,0.00758023,0.647785,0.903125,100,27.8633
train_mnist_4a723_00006,TERMINATED,10.0.0.5:207,0.00297821,0.534108,0.121875,1,0.311179
train_mnist_4a723_00007,TERMINATED,10.0.0.5:207,0.00743872,0.479855,0.428125,4,1.13743
train_mnist_4a723_00008,TERMINATED,10.0.0.5:207,0.000120783,0.240711,0.14375,1,0.311366
train_mnist_4a723_00009,TERMINATED,10.0.0.5:207,0.00086271,0.786129,0.0875,1,0.314516


[2m[1m[36m(autoscaler +26s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[2m[1m[33m(autoscaler +26s)[0m Error: No available node types can fulfill resource request {'CPU': 2.0, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
[2m[1m[33m(autoscaler +1m1s)[0m Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 2.0}. Add suitable node types to this cluster to resolve this issue.
[2m[1m[33m(autoscaler +1m36s)[0m Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 2.0}. Add suitable node types to this cluster to resolve this issue.
[2m[1m[33m(autoscaler +2m12s)[0m Error: No available node types can fulfill resource request {'CPU': 2.0, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
[2m[1m[33m(autoscaler +2m47s)[0m Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU

[2m[33m(raylet, ip=10.0.0.5)[0m bash: /azureml-envs/azureml_215668d644dbddd7265ecc2b521fbe22/lib/libtinfo.so.6: no version information available (required by bash)
[2m[33m(raylet, ip=10.0.0.4)[0m bash: /azureml-envs/azureml_215668d644dbddd7265ecc2b521fbe22/lib/libtinfo.so.6: no version information available (required by bash)
100.0%36m(train_mnist pid=211, ip=10.0.0.4)[0m 


Trial name,date,done,hostname,iterations_since_restore,mean_accuracy,node_ip,pid,time_since_restore,time_this_iter_s,time_total_s,timestamp,training_iteration,trial_id
train_mnist_4a723_00000,2023-04-28_03-52-32,True,7a537cc00f9f465e84e4fc26b8f5bbe9000002,100,0.175,10.0.0.4,211,33.2725,0.293424,33.2725,1682653952,100,4a723_00000
train_mnist_4a723_00001,2023-04-28_03-52-03,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,1,0.065625,10.0.0.5,207,3.88125,3.88125,3.88125,1682653923,1,4a723_00001
train_mnist_4a723_00002,2023-04-28_03-52-04,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,4,0.0875,10.0.0.5,207,1.14075,0.269105,1.14075,1682653924,4,4a723_00002
train_mnist_4a723_00003,2023-04-28_03-52-33,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,100,0.86875,10.0.0.5,207,27.159,0.26853,27.159,1682653953,100,4a723_00003
train_mnist_4a723_00004,2023-04-28_03-52-33,True,7a537cc00f9f465e84e4fc26b8f5bbe9000002,1,0.040625,10.0.0.4,211,0.329112,0.329112,0.329112,1682653953,1,4a723_00004
train_mnist_4a723_00005,2023-04-28_03-53-00,True,7a537cc00f9f465e84e4fc26b8f5bbe9000002,100,0.903125,10.0.0.4,211,27.8633,0.264632,27.8633,1682653980,100,4a723_00005
train_mnist_4a723_00006,2023-04-28_03-52-34,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,1,0.121875,10.0.0.5,207,0.311179,0.311179,0.311179,1682653954,1,4a723_00006
train_mnist_4a723_00007,2023-04-28_03-52-35,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,4,0.428125,10.0.0.5,207,1.13743,0.269109,1.13743,1682653955,4,4a723_00007
train_mnist_4a723_00008,2023-04-28_03-52-35,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,1,0.14375,10.0.0.5,207,0.311366,0.311366,0.311366,1682653955,1,4a723_00008
train_mnist_4a723_00009,2023-04-28_03-52-36,True,7a537cc00f9f465e84e4fc26b8f5bbe9000003,1,0.0875,10.0.0.5,207,0.314516,0.314516,0.314516,1682653956,1,4a723_00009


2023-04-28 03:53:16,167	INFO tune.py:945 -- Total run time: 1681.56 seconds (1680.60 seconds for the tuning loop).


Best config is: {'lr': 0.007336481579534608, 'momentum': 0.7151465086180758}


### Show Ray cluster resources

In [9]:
ray.cluster_resources()

{'object_store_memory': 25599780454.0,
 'accelerator_type:P40': 2.0,
 'GPU': 2.0,
 'memory': 227431527630.0,
 'node:10.0.0.5': 1.0,
 'CPU': 16.0,
 'node:10.0.0.4': 1.0,
 'node:10.0.0.12': 1.0}

## Shutdown the head and worker node

In [10]:
# shutdown head node
ray.shutdown()

# cancel worker job would automaticlaly shutdown worker node
poller = ml_client.jobs.begin_cancel(name=active_job.name)

# wait until job cancelled
poller.wait()

[2m[36m(train_mnist pid=207, ip=10.0.0.5)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[2m[36m(train_mnist pid=211, ip=10.0.0.4)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /root/data/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[2m[36m(train_mnist pid=211, ip=10.0.0.4)[0m Extracting /root/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /root/data/MNIST/raw[32m [repeated 7x across cluster][0m
[2m[36m(train_mnist pid=211, ip=10.0.0.4)[0m [32m [repeated 6x across cluster][0m


[2m[36m(train_mnist pid=211, ip=10.0.0.4)[0m 100.0%[32m [repeated 8x across cluster][0m
