# Multi-Node Training with Hugging Face accelerate and AzureML

Reference: https://nateraw.com/posts/multinode_training_accelerate_azureml.html

This notebook shows a basic example of multi-node distributed training.

[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.

## Step 1 - Load config file

---


In [11]:
%load_ext autoreload
%autoreload 2

import time
import json
import ipykernel
    
def check_kernel():
    kernel_id = ipykernel.connect.get_connection_file()

    with open(kernel_id, 'r') as f:
        data = json.load(f)  

    if data["kernel_name"] == "":
        print("Select kernel first!")
    else:
        print(f"Kernel: {data['kernel_name']}")

check_kernel()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Kernel: python310-sdkv2


In [12]:
import os
import yaml
from logger import logger
from datetime import datetime
snapshot_date = datetime.now().strftime("%Y-%m-%d")

with open('config.yml') as f:
    d = yaml.load(f, Loader=yaml.FullLoader)
    
AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']
AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']
AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']
AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME']    
USE_LOWPRIORITY_VM = d['config']['USE_LOWPRIORITY_VM']

use_builtin_env = d['train']['use_builtin_env']  
azure_env_name = d['train']['azure_env_name']  
azure_compute_cluster_name = d['train']['azure_compute_cluster_name']
azure_compute_cluster_size = d['train']['azure_compute_cluster_size']
num_training_nodes = d['train']['num_training_nodes']
experiment_name = d['train']['experiment_name']    
    
logger.info("===== 1. Azure ML Training Info =====")

logger.info(f"--- Global Config")
logger.info(f"AZURE_SUBSCRIPTION_ID={AZURE_SUBSCRIPTION_ID}")
logger.info(f"AZURE_RESOURCE_GROUP={AZURE_RESOURCE_GROUP}")
logger.info(f"AZURE_WORKSPACE={AZURE_WORKSPACE}")
logger.info(f"AZURE_DATA_NAME={AZURE_DATA_NAME}")
logger.info(f"USE_LOWPRIORITY_VM={USE_LOWPRIORITY_VM}")

logger.info(f"--- Train Config")
logger.info(f"use_builtin_env={use_builtin_env}")
logger.info(f"azure_env_name={azure_env_name}")
logger.info(f"azure_compute_cluster_name={azure_compute_cluster_name}")
logger.info(f"azure_compute_cluster_size={azure_compute_cluster_size}")
logger.info(f"num_training_nodes={num_training_nodes}")
logger.info(f"experiment_name={experiment_name}")

2025-03-10 05:33:24,966 - logger - INFO - ===== 1. Azure ML Training Info =====
2025-03-10 05:33:24,974 - logger - INFO - --- Global Config
2025-03-10 05:33:24,979 - logger - INFO - AZURE_SUBSCRIPTION_ID=59282147-4e06-47ed-bb04-cd383dd85c09
2025-03-10 05:33:24,985 - logger - INFO - AZURE_RESOURCE_GROUP=rg-az02-rnd-gpu-aks
2025-03-10 05:33:24,989 - logger - INFO - AZURE_WORKSPACE=ml-az02-rnd-gpu-aks
2025-03-10 05:33:24,995 - logger - INFO - AZURE_DATA_NAME=<YOUR-DATA>
2025-03-10 05:33:24,999 - logger - INFO - USE_LOWPRIORITY_VM=False
2025-03-10 05:33:25,011 - logger - INFO - --- Train Config
2025-03-10 05:33:25,016 - logger - INFO - use_builtin_env=True
2025-03-10 05:33:25,022 - logger - INFO - azure_env_name=aml-accelerate
2025-03-10 05:33:25,022 - logger - INFO - azure_compute_cluster_name=aks-test
2025-03-10 05:33:25,023 - logger - INFO - azure_compute_cluster_size=Standard_ND96isr_H100_v5
2025-03-10 05:33:25,023 - logger - INFO - num_training_nodes=2
2025-03-10 05:33:25,024 - logger

### Configure workspace details

To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.


In [13]:
# import required libraries
import time
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data, Environment, BuildContext
from azure.ai.ml.entities import Model
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError

logger.info(f"===== 2. Training preparation =====")
logger.info(f"Calling DefaultAzureCredential.")
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)

2025-03-10 05:33:25,265 - logger - INFO - ===== 2. Training preparation =====
2025-03-10 05:33:25,266 - logger - INFO - Calling DefaultAzureCredential.
Found the config file in: /config.json


In [14]:
 # Get workspace info
 ws=ml_client.workspaces.get(name=AZURE_WORKSPACE)
 print(ws.container_registry)

/subscriptions/59282147-4e06-47ed-bb04-cd383dd85c09/resourceGroups/rg-az02-rnd-gpu-aks/providers/Microsoft.ContainerRegistry/registries/craz02rndgpuaks


## Step 2 - Create Compute Targets

---


In [15]:
from azure.ai.ml.entities import AmlCompute
def get_or_create_compute_target(ml_client, compute_cluster_name, compute_cluster_size, 
                                 num_training_nodes=1,
                                 use_lowpriority_vm=False, update=False):

    try:
        compute = ml_client.compute.get(compute_cluster_name)
        print("The compute cluster already exists! Reusing it for the current run")
    except Exception as ex:
        print(
            f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
        )
        try:
            logger.info("Attempt #1 - Trying to create a dedicated compute")
            tier = 'LowPriority' if use_lowpriority_vm else 'Dedicated'
            compute = AmlCompute(
                name=compute_cluster_name,
                size=compute_cluster_size,
                tier=tier,
                max_instances=num_training_nodes,  # For multi node training set this to an integer value more than 1
            )
            ml_client.compute.begin_create_or_update(compute).wait()
        except Exception as e:
            print(f"Error: {e}")
    return compute

In [16]:
cpu_compute_cluster_name = "cpu-cluster"
cpu_compute_cluster_size = "Standard_E4ds_v4"
gpu_compute_cluster_name = azure_compute_cluster_name
gpu_compute_cluster_size = azure_compute_cluster_size

def get_num_gpus(gpu_compute_cluster_size):
    num_gpu_dict = {
        "Standard_NC24ads_A100_v4": 1,
        "Standard_NC48ads_A100_v4": 2,
        "Standard_NC96ads_A100_v4": 4,
        "Standard_NC40ads_H100_v5": 1,
        "Standard_NC80adis_H100_v5": 2,
        "Standard_ND96isr_H100_v5": 8,
    }
    return num_gpu_dict[gpu_compute_cluster_size]

num_gpus_per_node = get_num_gpus(gpu_compute_cluster_size)

cpu_compute = get_or_create_compute_target(
    ml_client, cpu_compute_cluster_name, 
    cpu_compute_cluster_size, num_training_nodes=1, update=False
)

gpu_compute = get_or_create_compute_target(
    ml_client, gpu_compute_cluster_name, 
    gpu_compute_cluster_size, num_training_nodes=num_training_nodes, update=False
)

The compute cluster already exists! Reusing it for the current run
The compute cluster already exists! Reusing it for the current run


## Step 3 - Upload Data to AzureML

---

So, let’s get some data into AzureML! To do that, we’ll create a data-prep-step that:

-   downloads compressed data from a URL,
-   extracts it to a new location in AzureML workspace’s storage
    Once we do this, we’ll be able to mount this data to our training run later. 💾

We start off by creating a `./src` directory where all of our code will live. AzureML uploads all the files within this source directory, so we want to keep it clean.

We’ll also define an experiment name, so all the jobs we run here are grouped together.


In [17]:
from pathlib import Path
src_dir = './src'
Path(src_dir).mkdir(exist_ok=True, parents=True)

### Define Data Upload Script

Here’s the data upload script. It simply takes in a path (to a `.tar.gz` file) and extracts it to `output_folder`.


In [18]:
%%writefile {src_dir}/read_write_data.py
import argparse
import os
import tarfile

parser = argparse.ArgumentParser()
parser.add_argument("--input_data", type=str)
parser.add_argument("--output_folder", type=str)
args = parser.parse_args()

file = tarfile.open(args.input_data)
output_path = os.path.join(args.output_folder)
file.extractall(output_path)
file.close()

Overwriting ./src/read_write_data.py


### Define Data Upload Job

Now that we have some code to run, we can define the job. The below basically defines:

-   Inputs: The inputs to our script. In our case it’s a `tar.gz` file stored at a URL. This will be downloaded when the job runs. We provide it to our script we wrote above via the `--input_data` flag.
-   Outputs: The path where we will save the outputs in our workspace’s data store. We pass this to `--output_folder` in our script.
    Environment: We use one of AzureML’s curated environments, which will result in the job starting faster. Later, for the training job, we’ll define a custom environment.
-   Compute: We tell the job to run on our cpu-cluster.
    Any inputs/outputs you define can be referenced via `${{inputs.<name>}}` and `${{outputs.<name>}}` in the command, so the values are passed along to the script.


In [19]:
aml_sub = AZURE_SUBSCRIPTION_ID
aml_rsg = AZURE_RESOURCE_GROUP
aml_ws_name = AZURE_WORKSPACE

In [24]:
# Input in this case is a URL that will be downloaded
inputs = {
    "pets_zip": Input(
        type=AssetTypes.URI_FILE,
        path="https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz",
    ),
}

# Define output data. The resulting path will be used in run.py
outputs = {
    "pets": Output(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/pets",
    )
}

env = Environment(
    image='mcr.microsoft.com/azureml/curated/sklearn-1.5:21',
    name="aks-test-env",
    description="Environment created from a Docker image.",
)

# Define our job
job = command(
    code=src_dir,
    command="python read_write_data.py --input_data ${{inputs.pets_zip}} --output_folder ${{outputs.pets}}",
    inputs=inputs,
    outputs=outputs,
    environment=env,
    compute=cpu_compute_cluster_name,
    experiment_name=experiment_name,
    display_name='data-prep-step'
)

### Run Data Upload Job

If everything goes smoothly, the below should launch the `data-prep` job, and spit out a link for you to watch it run.

You only really need to run this job once, and then can reference it as many times as you like in the training step we are going to define in the next section.


In [25]:
# submit the command
returned_job = ml_client.jobs.create_or_update(job)
returned_job

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
pathOnCompute is not a known attribute

Experiment,Name,Type,Status,Details Page
accelerate-cv-multinode-example,gifted_queen_6bvlpql332,command,Starting,Link to Azure Machine Learning studio


## Step 4 - Train

---

Ok, we have some data! 🙏

Let’s see how we can set up multi-node/multi-gpu training with accelerate.


### Define Training Environment

For the training job, we’ll define a custom training environment, as our dependencies aren’t included in the curated environments offered by AzureML. We try to pin most of these to very specific versions so the environment won’t break in the future/if we share it with others.


In [36]:
def get_or_create_environment_asset(ml_client, env_name, base_image, conda_yml="cloud/conda.yml", update=False):
    
    try:
        latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)])
        if update:
            raise ResourceExistsError('Found Environment asset, but will update the Environment.')
        else:
            env_asset = ml_client.environments.get(name=env_name, version=latest_env_version)
            logger.info(f"Found Environment asset: {env_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        print(f"Exception: {e}")        
        env_docker_image = Environment(
            image=base_image,
            conda_file=conda_yml,
            name=env_name,
            description="Environment created for llm fine-tuning.",
        )
        env_asset = ml_client.environments.create_or_update(env_docker_image)
        logger.info(f"Created/Updated Environment asset: {env_name}")
        
    return env_asset

def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False):
    
    try:
        latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)])
        if update:
            raise ResourceExistsError('Found Environment asset, but will update the Environment.')
        else:
            env_asset = ml_client.environments.get(name=env_name, version=latest_env_version)
            logger.info(f"Found Environment asset: {env_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        logger.info(f"Exception: {e}")
        env_docker_image = Environment(
            build=BuildContext(path=docker_dir),
            name=env_name,
            description="Environment created from a Docker context.",
        )
        env_asset = ml_client.environments.create_or_update(env_docker_image)
        logger.info(f"Created Environment asset: {env_name}")
    
    return env_asset

def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False):
    
    try:
        latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)])
        if update:
            raise ResourceExistsError('Found Data asset, but will update the Data.')            
        else:
            data_asset = ml_client.data.get(name=data_name, version=latest_data_version)
            logger.info(f"Found Data asset: {data_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        data = Data(
            path=data_local_dir,
            type=AssetTypes.URI_FOLDER,
            description=f"{data_name} for fine tuning",
            tags={"FineTuningType": "Instruction", "Language": "En"},
            name=data_name
        )
        data_asset = ml_client.data.create_or_update(data)
        logger.info(f"Created/Updated Data asset: {data_name}")
        
    return data_asset

Now we use the conda environment file we just wrote to specify additional dependencies on top of the curated `openmpi3.1.2-ubuntu18.04` docker image from AzureML.

For more information on creating environments in AzureML SDK v2, check out the docs.


In [56]:
%%writefile {src_dir}/train_environment.yml
name: aml-video-accelerate
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pip=24.0
  - pip:
    - pyarrow==18.0.0
    - timm==1.0.11
    - setfit==1.1.0
    - fire==0.7.0

Overwriting ./src/train_environment.yml


In [57]:
# base_image = "mcr.microsoft.com/azureml/curated/acpt-pytorch-2.2-cuda12.1:19"
# env = get_or_create_environment_asset(ml_client, azure_env_name, base_image, conda_yml=f"{src_dir}/train_environment.yml", update=True)

### Define Training Script

For our training script, we’re going to use the [complete_cv_example.py](https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py) script from the official [accelerate examples](https://github.com/huggingface/accelerate/tree/main/examples)
on GitHub.


In [58]:
! wget -O {src_dir}/train.py -nc https://raw.githubusercontent.com/huggingface/accelerate/main/examples/complete_cv_example.py

File ‘./src/train.py’ already there; not retrieving.


In [59]:
%%writefile {src_dir}/requirements.txt
timm
setfit
fire

Overwriting ./src/requirements.txt


## Define Training Job

The moment of truth! Let’s see if we can train an image classifier using multiple GPUs across multiple nodes on AzureML 🤞

Here, we’ll define a job called `train-step` where we define:

-   An input, `pets`, which points to the data store path where we stored our processed data earlier.
-   Our training command, providing the following flags:
    -   `--data_dir:` supplying the input reference path
    -   `--with_tracking`: To make sure we save logs
    -   `--checkpointing_steps epoch`: To make sure we are saving checkpoints every epoch
    -   `--output_dir ./outputs:` Save to the `./outputs` directory, which is a special directory in AzureML meant for saving any artifacts from training.
-   Our `training_environment` we defined above.
-   The `distribution` as `PyTorch`, specifying `process_count_per_instance`, which is how many GPUs there are per node. (in our case, 2).

For more information on how Multi-Node GPU training works on AzureML, you can refer to the [docs](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2).

The `command` allows user to configure the following key aspects.

-   `inputs` - This is the dictionary of inputs using name value pairs to the command.
    -   `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`.
    -   `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported.
        -   Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')
    -   `mode` - Mode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`
-   `code` - This is the path where the code to run the command is located
-   `compute` - The compute on which the command will run. You can run it on the local machine by using `local` for the compute.
-   `command` - This is the command that needs to be run
    in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:
-   `environment` - This is the environment needed for the command to run. Curated (built-in) or custom environments from the workspace can be used.
-   `instance_count` - Number of nodes. Default is 1.
-   `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed.


In [60]:
str_command = ""

if use_builtin_env:
    str_env = "azureml://registries/azureml/environments/tensorflow-2.16-cuda12/versions/6"  # Use Curated (built-in) Environment asset
    str_command += "pip install -r requirements.txt && "
else:
    str_env = f"{azure_env_name}@latest" # Use Custom Environment asset
    
str_command += "python train.py --data_dir ${{inputs.pets}} --with_tracking --checkpointing_steps epoch --output_dir ./outputs"

In [61]:
print(str_env)

azureml://registries/azureml/environments/tensorflow-2.16-cuda12/versions/6


In [62]:
# Define inputs, which in our case is the path from upload_cats_and_dogs.py
inputs = dict(
    pets=Input(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{aml_sub}/resourcegroups/{aml_rsg}/workspaces/{aml_ws_name}/datastores/workspaceblobstore/paths/pets/images",
    ),
)

# Define the job!
job = command(
    code=src_dir,
    inputs=inputs,
    command=str_command,
    environment=str_env,
    compute=gpu_compute_cluster_name,
    instance_count=num_training_nodes,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        # In our case (using Standard_NC12) we have 2 GPUs per node.
        "process_count_per_instance": num_gpus_per_node,
    },
    # environment_variables={
    #     "MLFLOW_TRACKING_URI": ""  # no use mlflow
    # },    
    experiment_name=experiment_name,
    display_name='train-step'
)

### Run Training Job


In [63]:
train_job = ml_client.jobs.create_or_update(job)
display(train_job)

[32mUploading src (0.03 MBs):   0%|          | 0/29292 [00:00<?, ?it/s][32mUploading src (0.03 MBs): 100%|██████████| 29292/29292 [00:00<00:00, 928376.98it/s]
[39m



Experiment,Name,Type,Status,Details Page
accelerate-cv-multinode-example,helpful_river_3j3jpdw446,command,Starting,Link to Azure Machine Learning studio


In [64]:
logger.info("""Started training job. Now a dedicated Compute Cluster for training is provisioned and the environment
required for training is automatically set up from Environment.

If you have set up a new custom Environment, it will take approximately 20 minutes or more to set up the Environment before provisioning the training cluster.
""")
ml_client.jobs.stream(train_job.name)

RunId: helpful_river_3j3jpdw446
Web View: https://ml.azure.com/runs/helpful_river_3j3jpdw446?wsid=/subscriptions/59282147-4e06-47ed-bb04-cd383dd85c09/resourcegroups/rg-az02-rnd-gpu-aks/workspaces/ml-az02-rnd-gpu-aks

Execution Summary
RunId: helpful_river_3j3jpdw446
Web View: https://ml.azure.com/runs/helpful_river_3j3jpdw446?wsid=/subscriptions/59282147-4e06-47ed-bb04-cd383dd85c09/resourcegroups/rg-az02-rnd-gpu-aks/workspaces/ml-az02-rnd-gpu-aks

{"Compliant":"System unhealthy with error: Component Executor unhealthy with err Service 'Executor' returned invalid response: status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} }"}
{"Compliant":"Component Executor unhealthy with err Service 'Executor' returned invalid response: status: Unknown, message: \"transport error\", details: [], metadata: MetadataMap { headers: {} }"}
{"Compliant":"Service 'Executor' returned invalid response: status: Unknown
AML Kubernetes Compute job failed.
-1200: Conta

JobException: Exception : 
 {
    "error": {
        "code": "ServiceError",
        "message": "{\"Compliant\":\"System unhealthy with error: Component Executor unhealthy with err Service 'Executor' returned invalid response: status: Unknown, message: \\\"transport error\\\", details: [], metadata: MetadataMap { headers: {} }\"}\n{\"Compliant\":\"Component Executor unhealthy with err Service 'Executor' returned invalid response: status: Unknown, message: \\\"transport error\\\", details: [], metadata: MetadataMap { headers: {} }\"}\n{\"Compliant\":\"Service 'Executor' returned invalid response: status: Unknown, message: \\\"transport error\\\", details: [], metadata: MetadataMap { headers: {} }\"}\n{\"Compliant\":\"Service 'Executor' returned invalid response: status: Unknown, message: \\\"transport error\\\", details: [], metadata: MetadataMap { headers: {} }\"}\n{\n  \"code\": \"LifecycleError\",\n  \"target\": \"\",\n  \"category\": \"SystemError\",\n  \"error_details\": [],\n  \"inner_error\": {\n    \"code\": \"SystemUnhealthyError\",\n    \"target\": \"\",\n    \"category\": \"SystemError\",\n    \"error_details\": [],\n    \"inner_error\": {\n      \"code\": \"HealthError\",\n      \"target\": \"\",\n      \"category\": \"SystemError\",\n      \"error_details\": [],\n      \"inner_error\": {\n        \"code\": \"InvalidResponse\",\n        \"target\": \"\",\n        \"category\": \"SystemError\",\n        \"error_details\": []\n      }\n    }\n  }\n}",
        "message_parameters": {},
        "details": []
    },
    "time": "0001-01-01T00:00:00.000Z"
} 

In [None]:
# check if the `trained_model` output is available
job_name = train_job.name

In [None]:
%store job_name

<br>

## Step 5 (Optional) - Create model asset and get fine-tuned LLM to local folder

---

### Create model asset


In [None]:
def get_or_create_model_asset(ml_client, model_name, job_name, model_dir="outputs", model_type="custom_model", update=False):
    
    try:
        latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])
        if update:
            raise ResourceExistsError('Found Model asset, but will update the Model.')
        else:
            model_asset = ml_client.models.get(name=model_name, version=latest_model_version)
            logger.info(f"Found Model asset: {model_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        logger.info(f"Exception: {e}")        
        model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/"    
        run_model = Model(
            name=model_name,        
            path=model_path,
            description="Model created from run.",
            type=model_type # mlflow_model, custom_model, triton_model
        )
        model_asset = ml_client.models.create_or_update(run_model)
        logger.info(f"Created Model asset: {model_name}")

    return model_asset

In [None]:
azure_model_name = d['serve']['azure_model_name']
model_dir = d['train']['model_dir']

model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type="custom_model", update=False)

logger.info("===== 4. (Optional) Create model asset and get fine-tuned LLM to local folder =====")
logger.info(f"azure_model_name={azure_model_name}")
logger.info(f"model_dir={model_dir}")
logger.info(f"model={model}")

### Download the model (this is optional)


In [None]:
# local_model_dir = "./artifact_downloads"
# os.makedirs(local_model_dir, exist_ok=True)
# ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version)