# Fine-tuning Open Source LLM using the Azure ML Python SDK (MLflow)

### Overview

Azure ML Workspace is compatible with MLflow and can be used as an MLflow Tracking Server, as described in the following official guide from Microsoft. MLflow provides features such as experiment tracking, model management, and model deployment, allowing you to manage data science and machine learning workflows more efficiently and systematically. Below are the main advantages of using Azure ML and MLflow together.

#### 1. Experiment tracking and management

You can systematically manage the parameters, metrics, and artifacts of all your experiments. Integrating with Azur eML allows you to easily track and manage this information within your Azure ML workspace.

#### 2. Model management

MLflow provides a model registry for model versioning. Integrate with AzureML to systematically manage and deploy all versions of your models. When combined with AzureML's deployment capabilities, models can be easily deployed to a variety of environments (e.g. Azure Kubernetes Service, Azure Container Instances).

#### 3. Reproducibility and collaboration

MLflow records the parameters and environment of every experiment, so you can accurately reproduce the experiment. This is very useful when you need to redo the same experiment across collaborating team members, or when you need to rerun an experiment at a later date.

#### 4. CI/CD integration

MLflow makes it easy to implement continuous integration (CI) and continuous deployment (CD) of machine learning models. Integrate with Azure DevOps or GitHub Actions to automatically run training, validation, and deployment processes as model changes occur.

#### 5. Integrating Logging with HF

When training a model with Hugging Face's Trainer API, if you specify `report_to="azure_ml"`, basic indicators will be automatically logged without any additional code. Of course, you can freely log custom indicators using Bring Your Own Script like the conventional method, but Azure ML's basic logging function is also excellent, so try using it as a baseline.

[Note] Please use `Python 3.10 - SDK v2 (azureml_py310_sdkv2)` conda environment.


## Configuration

---

### Load config file

In [15]:
%load_ext autoreload
%autoreload 2

import os, sys
lab_prep_dir = os.getcwd().split("SLMWorkshopCN")[0] + "SLMWorkshopCN/0_lab_preparation"
sys.path.append(os.path.abspath(lab_prep_dir))

from common import check_kernel
check_kernel()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Kernel: python31014jvsc74a57bd01f90a0206bde5cf3732dab79adbbcc7570d5fab64b89fc69d46a8fe33664a709


In [1]:
import os
# from dotenv import load_dotenv
# load_dotenv()
import yaml
from logger import logger
from datetime import datetime
snapshot_date = datetime.now().strftime("%Y-%m-%d")

with open('config_prd.yml') as f:
    d = yaml.load(f, Loader=yaml.FullLoader)
    
AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']
AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']
AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']
AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME']    
DATA_DIR = d['config']['DATA_DIR']
CLOUD_DIR = d['config']['CLOUD_DIR']
HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']

azure_env_name = d['train']['azure_env_name']
azure_compute_cluster_name = d['train']['azure_compute_cluster_name']
azure_compute_cluster_size = d['train']['azure_compute_cluster_size']

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(CLOUD_DIR, exist_ok=True)

logger.info("===== 0. Azure ML Training Info =====")
logger.info(f"AZURE_SUBSCRIPTION_ID={AZURE_SUBSCRIPTION_ID}")
logger.info(f"AZURE_RESOURCE_GROUP={AZURE_RESOURCE_GROUP}")
logger.info(f"AZURE_WORKSPACE={AZURE_WORKSPACE}")
logger.info(f"AZURE_DATA_NAME={AZURE_DATA_NAME}")
logger.info(f"DATA_DIR={DATA_DIR}")
logger.info(f"CLOUD_DIR={CLOUD_DIR}")
logger.info(f"HF_MODEL_NAME_OR_PATH={HF_MODEL_NAME_OR_PATH}")

logger.info(f"azure_env_name={azure_env_name}")
logger.info(f"azure_compute_cluster_name={azure_compute_cluster_name}")
logger.info(f"azure_compute_cluster_size={azure_compute_cluster_size}")

2025-02-26 16:02:24,152 - logger - INFO - ===== 0. Azure ML Training Info =====
2025-02-26 16:02:24,153 - logger - INFO - AZURE_SUBSCRIPTION_ID=49aee8bf-3f02-464f-a0ba-e3467e7d85e2
2025-02-26 16:02:24,154 - logger - INFO - AZURE_RESOURCE_GROUP=rg-slmwrkshp_9
2025-02-26 16:02:24,155 - logger - INFO - AZURE_WORKSPACE=mlw-pgwgybluulpec
2025-02-26 16:02:24,156 - logger - INFO - AZURE_DATA_NAME=lgds-gsm8k-main-demo
2025-02-26 16:02:24,157 - logger - INFO - DATA_DIR=./dataset
2025-02-26 16:02:24,158 - logger - INFO - CLOUD_DIR=./cloud
2025-02-26 16:02:24,159 - logger - INFO - HF_MODEL_NAME_OR_PATH=microsoft/phi-4
2025-02-26 16:02:24,160 - logger - INFO - azure_env_name=llm-sft-2024-11-05
2025-02-26 16:02:24,161 - logger - INFO - azure_compute_cluster_name=gpu-h100
2025-02-26 16:02:24,162 - logger - INFO - azure_compute_cluster_size=Standard_NC40ads_H100_v5


### Configure workspace details

To connect to a workspace, we need identifying parameters - a subscription, a resource group, and a workspace name. We will use these details in the MLClient from azure.ai.ml to get a handle on the Azure Machine Learning workspace we need. We will use the default Azure authentication for this hands-on.

In [2]:
# import required libraries
import time
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data, Environment, BuildContext
from azure.ai.ml.entities import Model
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError

logger.info(f"===== 2. Training preparation =====")
logger.info(f"Calling DefaultAzureCredential.")
credential = DefaultAzureCredential()
ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)  # 创建AML workspace client, 其实一个AIF ai prj就对应了一个AML wrkspac
ml_client

2025-02-26 16:02:32,572 - logger - INFO - ===== 2. Training preparation =====
2025-02-26 16:02:32,573 - logger - INFO - Calling DefaultAzureCredential.


MLClient(credential=<azure.identity._credentials.default.DefaultAzureCredential object at 0x7f055e126050>,
         subscription_id=49aee8bf-3f02-464f-a0ba-e3467e7d85e2,
         resource_group_name=rg-slmwrkshp_9,
         workspace_name=mlw-pgwgybluulpec)

<br>

## 1. Dataset preparation

---

Preparing dataset is the first step in training a model. You can use the `datasets` library to load the dataset if you want to use Hugging Face datasets.<br>
Otherwise, you can use your own dataset from previous hands-on sessions.

We have prepared a dataset, [`lab1_augmented_samples.json`](lab1_augmented_samples.json), for this hands-on session.


In [None]:
# USE_HF_DATASETS = False # Determine if we use Hugging Face Datasets or not

# import json
# import random
# from datasets import load_dataset
# from random import randrange
# from logger import logger

In [None]:
# if not USE_HF_DATASETS:

#     # Function to load data from the provided file and convert to JSONL format for single-turn conversations
#     def load_and_convert_to_jsonl(file_path, system_prompt_msg="You're an AI assistant."):
#         with open(file_path, 'r') as file:
#             data = json.load(file)
        
#         result = []
        
#         for item in data:
#             jsonl_entry = {
#                 "prompt": system_prompt_msg,
#                 "messages": [
#                     {"content": item["input"], "role": "user"},
#                     {"content": item["output"], "role": "assistant"}
#                 ]
#             }
#             result.append(json.dumps(jsonl_entry))
        
#         return result

#     def save_jsonl_data(jsonl_data, file_path):
#         with open(file_path, 'w') as file:
#             for entry in jsonl_data:
#                 file.write(entry + '\n')
                
#     # Function to split data into training and testing sets
#     def split_train_test(jsonl_data, train_size=0.8):
#         # Shuffle the data
#         random.shuffle(jsonl_data)
        
#         # Calculate split index
#         split_index = int(len(jsonl_data) * train_size)
        
#         # Split the data
#         train_data = jsonl_data[:split_index]
#         test_data = jsonl_data[split_index:]
        
#         return train_data, test_data            

#     logger.info(f"===== 1. Custom Dataset preparation from Lab 1.  =====")
#     logger.info(f"Preparing dataset.")
#     file_path = "lab1_augmented_samples.json"
#     system_prompt_msg = "You are the SME (Subject Matter Expert) in Distributed training on Cloud. Please answer the questions accurately."
#     jsonl_dataset = load_and_convert_to_jsonl(file_path, system_prompt_msg) # 转成训练数据的格式
#     train_dataset, test_dataset = split_train_test(jsonl_dataset, train_size=0.8)
#     logger.info(f"Save dataset to {DATA_DIR}")
#     save_jsonl_data(train_dataset, f"{DATA_DIR}/train.jsonl")
#    save_jsonl_data(test_dataset, f"{DATA_DIR}/eval.jsonl")

2025-01-08 13:48:37,357 - logger - INFO - ===== 1. Custom Dataset preparation from Lab 1.  =====
2025-01-08 13:48:37,359 - logger - INFO - Preparing dataset.
2025-01-08 13:48:37,801 - logger - INFO - Save dataset to ./dataset


Training data can be used as a dataset stored in the local development environment, but can also be registered as AzureML data.

In [3]:
def get_or_create_data_asset(ml_client, data_name, data_local_dir, update=False):
    
    try:
        latest_data_version = max([int(d.version) for d in ml_client.data.list(name=data_name)])
        if update:
            raise ResourceExistsError('Found Data asset, but will update the Data.')            
        else:
            data_asset = ml_client.data.get(name=data_name, version=latest_data_version)
            logger.info(f"Found Data asset: {data_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        data = Data(
            path=data_local_dir,
            type=AssetTypes.URI_FOLDER,
            description=f"{data_name} for fine tuning",
            tags={"FineTuningType": "Instruction", "Language": "En"},
            name=data_name
        )
        data_asset = ml_client.data.create_or_update(data)#AIF/AiPrj/Data+idx中；AML对应的wrkspac/Data中
        logger.info(f"Created/Updated Data asset: {data_name}")
        
    return data_asset


In [5]:
# We have prepared the data in `./dataset`. And the data is extracted from ·openai/gsm8k· using `./dataset-preparation/datapreparation.py`.

data = get_or_create_data_asset(ml_client, AZURE_DATA_NAME, data_local_dir=DATA_DIR, update=False)
print(data)

[32mUploading dataset (0.03 MBs): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31660/31660 [00:00<00:00, 1213041.61i

creation_context:
  created_at: '2025-02-26T00:43:11.477091+00:00'
  created_by: Gang Luo
  created_by_type: User
  last_modified_at: '2025-02-26T00:43:11.487717+00:00'
description: lgds-gsm8k-main-demo for fine tuning
id: /subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourceGroups/rg-slmwrkshp_9/providers/Microsoft.MachineLearningServices/workspaces/mlw-pgwgybluulpec/data/lgds-gsm8k-main-demo/versions/1
name: lgds-gsm8k-main-demo
path: azureml://subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourcegroups/rg-slmwrkshp_9/workspaces/mlw-pgwgybluulpec/datastores/workspaceblobstore/paths/LocalUpload/22bb1f97e3eec91f29c1fb50c45854a9/dataset/
properties: {}
tags:
  FineTuningType: Instruction
  Language: En
type: uri_folder
version: '1'



<br>

## 2. Training preparation

---

### 2.1. Create AzureML environment
Azure ML defines containers (called environment asset) in which your code will run. We can use the built-in environment or build a custom environment (Docker container, conda).
This hands-on uses conda yaml.


#### Docker environment


In [53]:
%%writefile {CLOUD_DIR}/train/Dockerfile
FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2004-cu124-py310-torch241:biweekly.202410.2

USER root

RUN apt-get update && apt-get -y upgrade
RUN pip install --upgrade pip
RUN apt-get install -y openssh-server openssh-client

COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir
Run pip install vllm


Overwriting ./cloud/train/Dockerfile


In [None]:
%%writefile {CLOUD_DIR}/train/requirements.txt
azureml-mlflow==1.58.0
accelerate==1.4.0
beautifulsoup4==4.13.3
bitsandbytes==0.45.3
datasets==3.3.2
deepspeed==0.15.4
huggingface_hub==0.29.1
latex2sympy2_extended==1.0.6
Markdown==3.7
math_verify==0.5.2
mlflow_skinny==2.15.0
numpy~=1.23.5
openai==1.64.0
packaging==24.2
pandas==2.2.3
peft==0.14.0
python-dotenv==1.0.1
safetensors==0.5.2
torch==2.5.1
tqdm==4.66.4
transformers==4.48.2
trl==0.15.1
unsloth==2025.2.15
unsloth_zoo==2025.2.7
wandb==0.19.7
azureml-sdk==1.58.0

Overwriting ./cloud/train/requirements.txt


In [55]:

def get_or_create_docker_environment_asset(ml_client, env_name, docker_dir, update=False):
    
    try:
        latest_env_version = max([int(e.version) for e in ml_client.environments.list(name=env_name)])
        if update:
            raise ResourceExistsError('Found Environment asset, but will update the Environment.')
        else:
            env_asset = ml_client.environments.get(name=env_name, version=latest_env_version)
            logger.info(f"Found Environment asset: {env_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        logger.info(f"Exception: {e}")
        env_docker_image = Environment(
            build=BuildContext(path=docker_dir),
            name=env_name,
            description="Environment created from a Docker context.",
        )
        env_asset = ml_client.environments.create_or_update(env_docker_image)#AIF没有,但AIF/Code有用到一个内置的env。真身在AML对应的wrkspac/Environments中
        logger.info(f"Created Environment asset: {env_name}")
    
    return env_asset


In [None]:
env = get_or_create_docker_environment_asset(ml_client, azure_env_name, docker_dir=f"{CLOUD_DIR}/train", update=False)
env

2025-02-26 20:19:21,115 - logger - INFO - Exception: Found Environment asset, but will update the Environment.
[32mUploading train (0.0 MBs): 100%|██████████| 1158/1158 [00:01<00:00, 640.94it/s]
[39m

2025-02-26 20:19:42,330 - logger - INFO - Created Environment asset: llm-sft-2024-11-05


Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'llm-sft-2024-11-05', 'description': 'Environment created from a Docker context.', 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourceGroups/rg-slmwrkshp_9/providers/Microsoft.MachineLearningServices/workspaces/mlw-pgwgybluulpec/environments/llm-sft-2024-11-05/versions/12', 'Resource__source_path': '', 'base_path': '/mnt/d/BT/SRC/NLP/LLM/O1/reasoningimprove/phi4_rl', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f052919f610>, 'serialize': <msrest.serialization.Serializer object at 0x7f052919de70>, 'version': '12', 'conda_file': None, 'build': <azure.ai.ml.entities._assets.environment.BuildContext object at 0x7f052919f310>, 'inference_config': None, 'os

<br>

## 3. Training

---

### 3.1. Create the compute cluster


In [57]:
from azure.ai.ml.entities import AmlCompute

logger.info(f"===== 3. Training =====")
### Create the compute cluster
try:
    compute = ml_client.compute.get(azure_compute_cluster_name)
    logger.info("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    logger.info(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        tier = 'Dedicated'
        compute = AmlCompute(
            name=azure_compute_cluster_name,
            size=azure_compute_cluster_size, # Standard_NC40ads_H100_v5
            tier=tier,
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()#AIF/ai prj/->management center/对应的hub那一组菜单项/Compute。要修改Compute所需的CPU/GPU Quota则在同一个management center页面菜单中选择Quota
    except Exception as e:
        logger.info(f"Error: {e}")
print(compute)

2025-02-26 20:19:42,357 - logger - INFO - ===== 3. Training =====
2025-02-26 20:19:43,438 - logger - INFO - The compute cluster already exists! Reusing it for the current run


enable_node_public_ip: true
id: /subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourceGroups/rg-slmwrkshp_9/providers/Microsoft.MachineLearningServices/workspaces/mlw-pgwgybluulpec/computes/gpu-h100
idle_time_before_scale_down: 120
location: eastus
max_instances: 1
min_instances: 0
name: gpu-h100
network_settings: {}
provisioning_state: Succeeded
size: Standard_NC40ads_H100_v5
ssh_public_access_enabled: true
tier: dedicated
type: amlcompute



### 3.2. Training script


In [58]:
# !pygmentize src_train/train_mlflow.py

### 3.3. Start training job

The `command` allows user to configure the following key aspects.

-   `inputs` - This is the dictionary of inputs using name value pairs to the command.
    -   `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`.
    -   `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported.
        -   Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')
    -   `mode` - Mode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`
-   `code` - This is the path where the code to run the command is located
-   `compute` - The compute on which the command will run. You can run it on the local machine by using `local` for the compute.
-   `command` - This is the command that needs to be run
    in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:
-   `environment` - This is the environment needed for the command to run. Curated (built-in) or custom environments from the workspace can be used.
-   `instance_count` - Number of nodes. Default is 1.
-   `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed.


In [59]:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

USE_BUILTIN_ENV = False
str_command = ""

if USE_BUILTIN_ENV:
    str_env = "azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/77" # Use built-in Environment asset 这个是Asset ID，在AML对应的wrkspac/Enviroments菜单项/Curated env选项卡/选一个内置的env后在其详情页的overview选项卡中能找到
    str_command += "pip install -r requirements.txt && " # requirements.txt在输入command函数的code参数指向的文件夹中
else:
    str_env = f"{azure_env_name}@latest" # Use Curated (built-in) Environment asset
    
# str_command += "python train_mlflow.py \
#             --model_name_or_path ${{inputs.model_name_or_path}} \
#             --train_dir ${{inputs.train_dir}} \
#             --epochs ${{inputs.epoch}} \
#             --train_batch_size ${{inputs.train_batch_size}} \
#             --eval_batch_size ${{inputs.eval_batch_size}} \
#             --model_dir ${{inputs.model_dir}}"
str_command += "python train_mlflow.py \
            --train_dir ${{inputs.train_dir}} \
            --model_dir ${{inputs.model_dir}}" # command字符串引用的参数值都是从下面输入command的inputs dict中来的

logger.info(f"Env: {str_env}")
logger.info(f"Command: {str_command}")

job = command(
    inputs=dict( # 就是训练脚本src_train/train_mlflow.py的参数
        # model_name_or_path=HF_MODEL_NAME_OR_PATH,
        #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path
        train_dir=Input(path=f"{AZURE_DATA_NAME}@latest"),  # Get data from Data asset
        # epoch=d['train']['epoch'],
        # train_batch_size=d['train']['train_batch_size'],
        # eval_batch_size=d['train']['eval_batch_size'],  
        model_dir=d['train']['model_dir']
    ),
    code="./src_train",  # local path where the code is stored
    compute=azure_compute_cluster_name,
    command=str_command,
    environment=str_env,
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1
    },
)
returned_job = ml_client.jobs.create_or_update(job, experiment_name='mlflwphi4') # Command/Spark objects can be used directly 创建并启动一个job. AIF没有，在AML对应的wrkspac/Jobs中
logger.info("""Started training job. Now a dedicated Compute Cluster for training is provisioned and the environment
required for training is automatically set up from Environment.

If you have set up a new custom Environment, it will take approximately 20 minutes or more to set up the Environment before provisioning the training cluster.
""")
ml_client.jobs.stream(returned_job.name)# 返回正在运行的job的日志

2025-02-26 20:19:43,478 - logger - INFO - Env: llm-sft-2024-11-05@latest
2025-02-26 20:19:43,482 - logger - INFO - Command: python train_mlflow.py             --train_dir ${{inputs.train_dir}}             --model_dir ${{inputs.model_dir}}
2025-02-26 20:19:51,399 - logger - INFO - Started training job. Now a dedicated Compute Cluster for training is provisioned and the environment
required for training is automatically set up from Environment.

If you have set up a new custom Environment, it will take approximately 20 minutes or more to set up the Environment before provisioning the training cluster.



RunId: happy_gold_q5lypvqkw5
Web View: https://ml.azure.com/runs/happy_gold_q5lypvqkw5?wsid=/subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourcegroups/rg-slmwrkshp_9/workspaces/mlw-pgwgybluulpec

Streaming azureml-logs/20_image_build_log.txt

The run ID for the image build on serverless compute is imgbldrun_dfabf2a
Additional logs for the run: https://ml.azure.com/experiments/id/prepare_image/runs/imgbldrun_dfabf2a?wsid=/subscriptions/49aee8bf-3f02-464f-a0ba-e3467e7d85e2/resourcegroups/rg-slmwrkshp_9/workspaces/mlw-pgwgybluulpec&tid=16b3c013-d300-468d-ac64-7eda0820b6d3
2025-02-26T12:20:04: Logging into Docker registry: f7c27ee9fb96407c9b8fa5c76209316e.azurecr.io
2025-02-26T12:20:04: https://docs.docker.com/engine/reference/commandline/login/#credentials-store

2025-02-26T12:20:04: Login Succeeded


2025-02-26T12:20:04: Running: ['docker', 'build', '-f', 'Dockerfile', '.', '-t', 'f7c27ee9fb96407c9b8fa5c76209316e.azurecr.io/azureml/azureml_cc48f16dfd4daa2effa03cfb0dd70415', '-t',

In [60]:
display(returned_job)# 到这里时job已经执行完

Experiment,Name,Type,Status,Details Page
mlflwphi4,happy_gold_q5lypvqkw5,command,Starting,Link to Azure Machine Learning studio


In [61]:
# check if the `trained_model` output is available
job_name = returned_job.name

In [62]:
%store job_name

Stored 'job_name' (str)


<br>

## 4. (Optional) Create model asset and get fine-tuned LLM to local folder

---

### 4.1. Create model asset


In [63]:
def get_or_create_model_asset(ml_client, model_name, job_name, model_dir="outputs", model_type="custom_model", update=False):
    
    try:
        latest_model_version = max([int(m.version) for m in ml_client.models.list(name=model_name)])
        if update:
            raise ResourceExistsError('Found Model asset, but will update the Model.')
        else:
            model_asset = ml_client.models.get(name=model_name, version=latest_model_version)
            logger.info(f"Found Model asset: {model_name}. Will not create again")
    except (ResourceNotFoundError, ResourceExistsError) as e:
        logger.info(f"Exception: {e}")        
        model_path = f"azureml://jobs/{job_name}/outputs/artifacts/paths/{model_dir}/"    
        run_model = Model(
            name=model_name,        
            path=model_path,
            description="Model created from run.",
            type=model_type # mlflow_model, custom_model, triton_model
        )
        model_asset = ml_client.models.create_or_update(run_model)# 在AML对应的wrkspac/Models中，目前AIF/AiPrj/Models只有deploy和service endpoint，而AML的还有权重文件
        logger.info(f"Created Model asset: {model_name}")

    return model_asset

Note that `model_type="custom_model` is intentional. This is because for newer models, MLflow's auto-logging compatibility is not as good and models need to be saved the traditional way.


In [64]:
azure_model_name = d['serve']['azure_model_name']
model_dir = d['train']['model_dir']
model = get_or_create_model_asset(ml_client, azure_model_name, job_name, model_dir, model_type="custom_model", update=False)

logger.info("===== 4. (Optional) Create model asset and get fine-tuned LLM to local folder =====")
logger.info(f"azure_model_name={azure_model_name}")
logger.info(f"model_dir={model_dir}")
logger.info(f"model={model}")

2025-02-26 21:18:51,490 - logger - INFO - Exception: (UserError) The specified resource was not found.
Code: UserError
Message: The specified resource was not found.
Exception Details:	(ModelNotFound) Model container with name: phi4-grpo-2024-11-05 not found.
	Code: ModelNotFound
	Message: Model container with name: phi4-grpo-2024-11-05 not found.
2025-02-26 21:18:57,620 - logger - INFO - Created Model asset: phi4-grpo-2024-11-05
2025-02-26 21:18:57,621 - logger - INFO - ===== 4. (Optional) Create model asset and get fine-tuned LLM to local folder =====
2025-02-26 21:18:57,622 - logger - INFO - azure_model_name=phi4-grpo-2024-11-05
2025-02-26 21:18:57,623 - logger - INFO - model_dir=./outputs
2025-02-26 21:18:57,628 - logger - INFO - model=creation_context:
  created_at: '2025-02-26T13:18:56.848645+00:00'
  created_by: Gang Luo
  created_by_type: User
  last_modified_at: '2025-02-26T13:18:56.848645+00:00'
  last_modified_by: Gang Luo
  last_modified_by_type: User
description: Model cre

### 4.2. Get fine-tuned LLM to local folder

You can copy it to your local directory to perform inference or serve the model in Azure environment. (e.g., real-time endpoint)


In [65]:
# Download the model (this is optional) 
local_model_dir = "./artifact_downloads"
os.makedirs(local_model_dir, exist_ok=True)

ml_client.models.download(name=azure_model_name, download_path=local_model_dir, version=model.version)# 下载有可能需要开通Storage File Data Privileged Contributor跟Storage blob data contributor权限

Downloading the model ExperimentRun/dcid.happy_gold_q5lypvqkw5/outputs at ./artifact_downloads/phi4-grpo-2024-11-05/outputs

Unable to stream download: HTTPSConnectionPool(host='stpgwgybluulpec.blob.core.windows.net', port=443): Read timed out.


ConnectionError: HTTPSConnectionPool(host='stpgwgybluulpec.blob.core.windows.net', port=443): Read timed out.

## Clean up


In [None]:
!rm -rf $DATA_DIR {local_model_dir}