# Ray and MLflow

[MLflow](https://www.mlflow.org/) is an open-source framework that's designed to manage the complete machine learning lifecycle. Its ability to train and serve models on different platforms allows you to use a consistent set of tools regardless of where your experiments are running: locally on your computer, on a remote compute target, on a virtual machine, or on an Azure Machine Learning compute instance.

Azure Machine Learning workspaces are **MLflow-compatible**, which means you can use Azure Machine Learning workspaces in the same way that you'd use an MLflow server. See [MLflow and Azure Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/concept-mlflow?view=azureml-api-2&viewFallbackFrom=azureml-api-1) for all supported MLflow and Azure Machine Learning functionality including MLflow Project support (preview) and model deployment.


In [ray-on-compute-cluster](../2.ray-on-compute-cluster/ray-on-compute-cluster.ipynb), we learned how to submit a distributed training job with Ray cluster enabled onto multi-nodes Azure ML compute clusters.

In this notebook, we would show an example of Ray Tune, MLflow and Azure ML integration.



## Install required packages

In [None]:
%pip install "azure-ai-ml>=1.6.0"

## Import required libraries

In [1]:
# import required libraries
from azure.identity import DefaultAzureCredential

from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment, BuildContext

## Connect to workspace using DefaultAzureCredential
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [3]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
    workspace = ml_client.workspace_name
    subscription_id = ml_client.workspaces.get(workspace).id.split("/")[2]
    resource_group = ml_client.workspaces.get(workspace).resource_group
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    
    
    subscription_id = "381b38e9-9840-4719-a5a0-61d9585e1e91"
    resource_group = "daweil_canary"
    workspace = "daweil_canary"

    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

We could not find config.json in: . or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories.


## Build training environment

We would use Azure ML image and a conda yaml file to build an environment. More info about how to build environment could be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-v2?view=azureml-api-2&tabs=python).

**`azureml-mlflow`** package is required for MLflow and Azure ML integration.

In [4]:
import yaml
from platform import python_version

# Get and set python and ray version
python_version = python_version()
ray_version = '2.4.0'

conda = yaml.load(f"""
    name: ray-environment
    dependencies:
    - python={python_version}
    - pip:
        - ray[default, tune]=={ray_version}
        - azureml-mlflow
        - torch
        - torchvision
""", Loader=yaml.CLoader)

# Write to conda.yml file
with open('conda.yml', 'w') as conda_file:
    yaml.dump(conda, conda_file, default_flow_style=False)


# Build environment using AzureML image and conda.yml we built
environment=Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="conda.yml"
)

## Enable `MLflow` tracking

Ray support multiple integration with `MLflow`. We would use `MLflowLoggerCallback` with Ray Tune here.

Follow [Use MLflow with Tune](https://docs.ray.io/en/latest/tune/examples/tune-mlflow.html) document to modify the training script.

In this example, we would add each trial as nested run of the Command job we are going to submit.

Here's the modification we need to make to enable `MLflow` tracking.

```python
from ray.air.integrations.mlflow import MLflowLoggerCallback
import mlflow
from mlflow.utils.mlflow_tags import MLFLOW_PARENT_RUN_ID

# get or start mlflow run.
current_run = mlflow.active_run()
if(current_run is None):
    current_run = mlflow.start_run()

tuner = tune.Tuner(
    run_config=air.RunConfig(
        # Enable MLflow by using MLflowLoggerCallback
        callbacks=[MLflowLoggerCallback(
            tags={
                MLFLOW_PARENT_RUN_ID: current_run.info.run_id # each trial would be added as nested run.
            })],
        # ... other run config
    ),
    # .. other configs
)
```

## Configure and Run Command

In this section we will be configuring and running a distributed training `Command` job.

The `command` allows user to configure the following key aspects.
- `code` - This is the path where the code to run the command is located.
- `command` - This is the command that needs to be run. In this example, we would execute `mnist_pytorch.py` we downloaded from [ray github repo](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py).
- `environment` - This is the environment needed for the command to run. In this example, we would use the environment we just build.
- `compute` - The compute on which the command will run. In this example, compute is not specified which means it would use `serverless` compute.
- `instance_type` - VMSize of the `serverless` compute. In this example, we would use `Standard_DS3_v2` cpu cluster.
- `instance_count` - The number of nodes to use for the job. In this example, we would scale `2` nodes.
- `shm_size` - Size of the docker container's shared memory block. 
- `distribution` - Distribution configuration for distributed training scenarios. In this example, we would set it to `ray`. Azure ML job engine would setup Ray cluster automatically.
  - `port` - \[Optional\] The port of the head ray process. Default is `6379`
  - `address` - \[Optional\] The address of Ray head node.
  - `include_dashboard` - \[Optional\] Provide this argument to start the Ray dashboard GUI. Default is `True`
  - `dashboard_port` - \[Optional\] The port to bind the dashboard server to. Default is `8265`
  - `head_node_additional_args` - \[Optional\] Additional arguments passed to ray start in head node.
  - `worker_node_additional_args` - \[Optional\] Additional arguments passed to ray start in worker node.

In [5]:
job = command(
    experiment_name="mnist_pytorch_mlflow",
    code="./src",  # local path where the code is stored
    command="python mnist_pytorch_mlflow.py;",
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="conda.yml"
    ),
    # compute="azureml:cpu-cluster",
    instance_type="Standard_DS3_v2",
    instance_count=2,  # In this, only 2 node cluster was created.
    shm_size="4g", # ~30% of 14G node memory
    distribution={
        "type": "ray",
        # "port": 6379, # [Optional] The port of the head ray process.
        # "include_dashboard": True, # [Optional] The port of the head ray process.
        # "dashboard_port": 8265, # [Optional] The port of the head ray process.
        # "head_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
        # "worker_node_additional_args": "--verbose", # [Optional] Additional arguments passed to ray start in head node.
    },
)

Field 'None': This is an experimental field, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class RayDistributionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class RayDistribution: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


## Submit the job

In [14]:
active_job = ml_client.jobs.create_or_update(job)

active_job

Experiment,Name,Type,Status,Details Page
mnist_pytorch_mlflow,gray_actor_wvr4gqpny6,command,Preparing,Link to Azure Machine Learning studio


In [15]:

active_job = ml_client.jobs.get(active_job.name)
active_job.services

{'Tracking': {'endpoint': 'azureml://eastus2euap.api.azureml.ms/mlflow/v1.0/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/daweil_canary/providers/Microsoft.MachineLearningServices/workspaces/daweil_canary?', 'type': 'Tracking'},
 'Studio': {'endpoint': 'https://ml.azure.com/runs/gray_actor_wvr4gqpny6?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/daweil_canary/workspaces/daweil_canary&tid=72f988bf-86f1-41af-91ab-2d7cd011db47', 'type': 'Studio'}}

## Use Azure ML Studio to explorer MLflow runs

We can use [Azure ML Studio](https://ml.azure.com/) to explorer MLflow runs. All trials are displayed inside of the command job's **child jobs** tab.
![Using Studio to explorer MFflow runs](./assets/mlflow_studio.png)