## Chat Completion - Ultrachat-200k

This sample shows how use `chat-completion` components from the `azureml` system registry to fine tune a model to complete a conversation between 2 people using ultrachat_200k dataset. We then deploy the fine tuned model to an online endpoint for real time inference.

### Training data
We will use the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.

### Model
We will use the `Phi-3-mini-4k-instruct` model to show how user can finetune a model for chat-completion task. If you opened this notebook from a specific model card, remember to replace the specific model name.

### Outline
* Setup pre-requisites such as compute.
* Pick a model to fine tune.
* Pick and explore training data.
* Configure the fine tuning job.
* Run the fine tuning job.
* Review training and evaluation metrics. 
* Register the fine tuned model. 
* Deploy the fine tuned model for real time inference.
* Clean up resources. 

### 1. Setup pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name
* Check or create compute. A single GPU node can have multiple GPU cards. For example, in one node of `Standard_NC24rs_v3` there are 4 NVIDIA V100 GPUs while in `Standard_NC12s_v3`, there are 2 NVIDIA V100 GPUs. Refer to the [docs](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) for this information. The number of GPU cards per node is set in the param `gpus_per_node` below. Setting this value correctly will ensure utilization of all GPUs in the node. The recommended GPU compute SKUs can be found [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).

Install dependencies by running below cell. This is not an optional step if running in a new environment.

In [None]:
%pip install azure-ai-ml
%pip install azure-identity
%pip install datasets==2.9.0
%pip install mlflow
%pip install azureml-mlflow

In [105]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
import time
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
credential = InteractiveBrowserCredential(tenant_id="0fbe7234-45ea-498b-b7e4-1a8b2d3be4d9")
subscription_id = "840b5c5c-3f4a-459a-94fc-6bad2a969f9d" # your subscription id
resource_group = "ml"#your resource group
workspace = "ws01ent" #your workspace name
workspace_ml_client = MLClient(credential, subscription_id, resource_group, workspace)
# the models, fine tuning pipelines and environments are available in the AzureML registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
experiment_name = "chat_completion_Phi-3-mini-128k-instruct"

# generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

### 2. Pick a foundation model to fine tune

`Phi-3-mini-4k-instruct` is a 3.8B parameters, lightweight, state-of-the-art open model built upon datasets used for Phi-2. The model belongs to the Phi-3 model family, and the Mini version comes in two variants 4K and 128K which is the context length (in tokens) it can support, we need to finetune the model for our specific purpose in order to use it. You can browse these models in the Model Catalog in the AzureML Studio, filtering by the `chat-completion` task. In this example, we use the `Phi-3-mini-4k-instruct` model. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine tuning job. This is also available as the `Asset ID` field in model details page in AzureML Studio Model Catalog. 

In [106]:
model_name = "Phi-3-mini-128k-instruct"
foundation_model = registry_ml_client.models.get(model_name, label="latest")
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)



Using model name: Phi-3-mini-128k-instruct, version: 2, id: azureml://registries/azureml/models/Phi-3-mini-128k-instruct/versions/2 for fine tuning


### 3. Create a compute to be used with the job

The finetune job works `ONLY` with `GPU` compute. The size of the compute depends on how big the model is and in most cases it becomes tricky to identify the right compute for the job. In this cell, we guide the user to select the right compute for the job.

`NOTE1` The computes listed below work with the most optimized configuration. Any changes to the configuration might lead to Cuda Out Of Memory error. In such cases, try to upgrade the compute to a bigger compute size.

`NOTE2` While selecting the compute_cluster_size below, make sure the compute is available in your resource group. If a particular compute is not available you can make a request to get access to the compute resources.

In [107]:
import ast

if "computes_allow_list" in foundation_model.tags:
    computes_allow_list = ast.literal_eval(
        foundation_model.tags["computes_allow_list"]
    )  # convert string to python list
    print(f"Please create a compute from the above list - {computes_allow_list}")
else:
    computes_allow_list = None
    print("Computes allow list is not part of model tags")

Computes allow list is not part of model tags


In [108]:
# If you have a specific compute size to work with change it here. By default we use the 8 x V100 compute from the above list
compute_cluster_size = "Standard_ND40rs_v2"

# If you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "NC48adsA100"

try:
    compute = workspace_ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            max_instances=2,  # For multi node training set this to an integer value more than 1
        )
        workspace_ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        try:
            print(
                "Attempt #2 - Trying to create a low priority compute. Since this is a low priority compute, the job could get pre-empted before completion."
            )
            compute = AmlCompute(
                name=compute_cluster,
                size=compute_cluster_size,
                tier="LowPriority",
                max_instances=2,  # For multi node training set this to an integer value more than 1
            )
            workspace_ml_client.compute.begin_create_or_update(compute).wait()
        except Exception as e:
            print(e)
            raise ValueError(
                f"WARNING! Compute size {compute_cluster_size} not available in workspace"
            )


# Sanity check on the created compute
compute = workspace_ml_client.compute.get(compute_cluster)
if compute.provisioning_state.lower() == "failed":
    raise ValueError(
        f"Provisioning failed, Compute '{compute_cluster}' is in failed state. "
        f"please try creating a different compute"
    )

if computes_allow_list is not None:
    computes_allow_list_lower_case = [x.lower() for x in computes_allow_list]
    if compute.size.lower() not in computes_allow_list_lower_case:
        raise ValueError(
            f"VM size {compute.size} is not in the allow-listed computes for finetuning"
        )
else:
    # Computes with K80 GPUs are not supported
    unsupported_gpu_vm_list = [
        "standard_nc6",
        "standard_nc12",
        "standard_nc24",
        "standard_nc24r",
    ]
    if compute.size.lower() in unsupported_gpu_vm_list:
        raise ValueError(
            f"VM size {compute.size} is currently not supported for finetuning"
        )


# This is the number of GPUs in a single node of the selected 'vm_size' compute.
# Setting this to less than the number of GPUs will result in underutilized GPUs, taking longer to train.
# Setting this to more than the number of GPUs will result in an error.
gpu_count_found = False
workspace_compute_sku_list = workspace_ml_client.compute.list_sizes()
available_sku_sizes = []
for compute_sku in workspace_compute_sku_list:
    available_sku_sizes.append(compute_sku.name)
    if compute_sku.name.lower() == compute.size.lower():
        gpus_per_node = compute_sku.gpus
        gpu_count_found = True
# if gpu_count_found not found, then print an error
if gpu_count_found:
    print(f"Number of GPU's in compute {compute.size}: {gpus_per_node}")
else:
    raise ValueError(
        f"Number of GPU's in compute {compute.size} not found. Available skus are: {available_sku_sizes}."
        f"This should not happen. Please check the selected compute cluster: {compute_cluster} and try again."
    )

The compute cluster already exists! Reusing it for the current run
Number of GPU's in compute STANDARD_NC48ADS_A100_V4: 2


### 4. Pick the dataset for fine-tuning the model

We use the [ultrachat_200k](https://huggingface.co/datasets/samsum) dataset. The dataset has four splits, suitable for:
* Supervised fine-tuning (sft).
* Generation ranking (gen).
The number of examples per split is shown as follows:

| train_sft | test_sft | train_gen | test_gen |
| :- | :- | :- | :- |
| 207865 | 23110 | 256032 | 28304 |

The next few cells show basic data preparation for fine tuning:
* Visualize some data rows
* We want this sample to run quickly, so save `train_sft`, `test_sft` files containing 5% of the already trimmed rows. This means the fine tuned model will have lower accuracy, hence it should not be put to real-world use.

> The [download-dataset.py](./download-dataset.py) is used to download the ultrachat_200k dataset and transform the dataset into finetune pipeline component consumable format. Also as the dataset is large, hence we here have only part of the dataset. 

> Running the below script only downloads 5% of the data. This can be increased by changing `dataset_split_pc` parameter to desired percenetage.

> **Note** : Some language models have different language codes and hence the column names in the dataset should reflect the same.

##### Here is an example of how the data should look like 

The chat-completion dataset is stored in parquet format with each entry using the following schema:
``` json
{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        }, 
        {
            "content": "Certainly! ....",
            "role": "assistant"
        }
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
```

In [93]:
import pandas as pd
import json
import uuid
import os
os.makedirs("sql_dataset", exist_ok=True)

original_train_data = pd.read_json("../llama2/data/sqltrain.jsonl", lines=True)
original_test_data = pd.read_json("../llama2/data/sqltest.jsonl", lines=True)

#reformat each record of data into new jsonl format {"prompt":"get this from the context", "messages":[{"role":"user","content":"take this from input"}, {"role":"assistant", "content":"take this from output"}], "prompt_id":"a random uuid string" }
train_data = []
for record in original_train_data.iterrows():
    train_data.append({"prompt":record[1]["context"], "messages":[{"role":"user","content":record[1]["context"]+"\n###Question:"+record[1]["input"]}, {"role":"assistant", "content":record[1]["output"]}], "prompt_id":str(uuid.uuid4())})
with open("sql_dataset/sql_train.jsonl", "w") as f:
    for record in train_data:
        f.write(json.dumps(record) + "\n")
test_data = []
for record in original_test_data.iterrows():
    test_data.append({"prompt":record[1]["context"], "messages":[{"role":"user","content":record[1]["context"]+"\n###Question:"+record[1]["input"]}, {"role":"assistant", "content":record[1]["output"]}], "prompt_id":str(uuid.uuid4())})
with open("sql_dataset/sql_test.jsonl", "w") as f:
    #write data to multiline json
    for record in test_data:
        f.write(json.dumps(record) + "\n")


### 5. Submit the fine tuning job using the the model and data as inputs
 
Create the job that uses the `chat-completion` pipeline component. [Learn more](https://github.com/Azure/azureml-assets/blob/main/assets/training/finetune_acft_hf_nlp/components/pipeline_components/chat_completion/README.md) about all the parameters supported for fine tuning.

Define finetune parameters

Finetune parameters can be grouped into 2 categories - training parameters, optimization parameters

Training parameters define the training aspects such as - 
1. the optimizer, scheduler to use
2. the metric to optimize the finetune
3. number of training steps and the batch size
and so on

Optimization parameters help in optimizing the GPU memory and effectively using the compute resources. Below are few of the parameters that belong to this category. _The optimization parameters differs for each model and are packaged with the model to handle these variations._
1. enable the deepspeed and LoRA
2. enable mixed precision training
2. enable multi-node training 

In [109]:
# Training parameters
training_parameters = dict(
    num_train_epochs=15,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
)
print(f"The following training parameters are enabled - {training_parameters}")

# Optimization parameters - As these parameters are packaged with the model itself, lets retrieve those parameters
if "model_specific_defaults" in foundation_model.tags:
    optimization_parameters = ast.literal_eval(
        foundation_model.tags["model_specific_defaults"]
    )  # convert string to python dict
else:
    optimization_parameters = dict(apply_lora="true", apply_deepspeed="true")
print(f"The following optimizations are enabled - {optimization_parameters}")

The following training parameters are enabled - {'num_train_epochs': 15, 'per_device_train_batch_size': 1, 'per_device_eval_batch_size': 1, 'learning_rate': 5e-06, 'lr_scheduler_type': 'cosine'}
The following optimizations are enabled - {'apply_deepspeed': 'true', 'deepspeed_stage': 2, 'apply_lora': 'true', 'apply_ort': 'false', 'precision': 16, 'ignore_mismatched_sizes': 'false'}


In [110]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import Input

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(
    name="chat_completion_pipeline", label="latest"
)


# define the pipeline job
@pipeline()
def create_pipeline():
    chat_completion_pipeline = pipeline_component_func(
        mlflow_model_path=foundation_model.id,
        compute_model_import=compute_cluster,
        compute_preprocess=compute_cluster,
        compute_finetune=compute_cluster,
        compute_model_evaluation=compute_cluster,
        # map the dataset splits to parameters
        train_file_path=Input(
            type="uri_file", path="./sql_dataset/sql_train.jsonl"
        ),
        test_file_path=Input(
            type="uri_file", path="./sql_dataset/sql_test.jsonl"
        ),
        # Training settings
        number_of_gpu_to_use_finetuning=gpus_per_node,  # set to the number of GPUs available in the compute
        **training_parameters,
        **optimization_parameters
    )
    return {
        # map the output of the fine tuning job to the output of pipeline job so that we can easily register the fine tuned model
        # registering the model is required to deploy the model to an online or batch endpoint
        "trained_model": chat_completion_pipeline.outputs.mlflow_model_folder
    }


pipeline_object = create_pipeline()

# don't use cached results from previous jobs
pipeline_object.settings.force_rerun = True

# set continue on step failure to False
pipeline_object.settings.continue_on_step_failure = False

Validate the pipeline against data and compute

In [None]:
# comment this section to disable validation
# Makesure to turn off the validation if your data is too big. Alternatively, validate the run with small data before launching runs with large datasets

# %run ../../pipeline_validations/common.ipynb

# validate_pipeline(pipeline_object, workspace_ml_client)

Submit the job

In [111]:
# submit the pipeline job
pipeline_job = workspace_ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)
# wait for the pipeline job to complete
# workspace_ml_client.jobs.stream(pipeline_job.name)

In [117]:
workspace_ml_client.jobs.stream(pipeline_job.name)

RunId: eager_glass_bxty79y6hh
Web View: https://ml.azure.com/runs/eager_glass_bxty79y6hh?wsid=/subscriptions/840b5c5c-3f4a-459a-94fc-6bad2a969f9d/resourcegroups/ml/workspaces/ws01ent

Execution Summary
RunId: eager_glass_bxty79y6hh
Web View: https://ml.azure.com/runs/eager_glass_bxty79y6hh?wsid=/subscriptions/840b5c5c-3f4a-459a-94fc-6bad2a969f9d/resourcegroups/ml/workspaces/ws01ent



### 6. Register the fine tuned model with the workspace

We will register the model from the output of the fine tuning job. This will track lineage between the fine tuned model and the fine tuning job. The fine tuning job, further, tracks lineage to the foundation model, data and training code.

In [112]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# check if the `trained_model` output is available
print("pipeline job outputs: ", workspace_ml_client.jobs.get(pipeline_job.name).outputs)

# fetch the model from pipeline job output - not working, hence fetching from fine tune child job
model_path_from_job = "azureml://jobs/{0}/outputs/{1}".format(
    pipeline_job.name, "trained_model"
)

finetuned_model_name = model_name + "-nlp2sql_v6"
finetuned_model_name = finetuned_model_name.replace("/", "-")
print("path to register model: ", model_path_from_job)
prepare_to_register_model = Model(
    path=model_path_from_job,
    type=AssetTypes.MLFLOW_MODEL,
    name=finetuned_model_name,
    version=timestamp,  # use timestamp as version to avoid version conflict
    description=model_name + " fine tuned model for nlp2sql",
)
print("prepare to register model: \n", prepare_to_register_model)
# register the model from pipeline job output
registered_model = workspace_ml_client.models.create_or_update(
    prepare_to_register_model
)
print("registered model: \n", registered_model)

pipeline job outputs:  {'trained_model': <azure.ai.ml.entities._job.pipeline._io.base.PipelineOutput object at 0x000001FB04195EA0>}
path to register model:  azureml://jobs/eager_glass_bxty79y6hh/outputs/trained_model
prepare to register model: 
 description: Phi-3-mini-128k-instruct fine tuned model for nlp2sql
name: Phi-3-mini-128k-instruct-nlp2sql_v6
path: azureml://jobs/eager_glass_bxty79y6hh/outputs/trained_model
properties: {}
tags: {}
type: mlflow_model
version: '1714449363'

registered model: 
 creation_context:
  created_at: '2024-04-30T13:15:53.192444+00:00'
  created_by: James Nguyen
  created_by_type: User
  last_modified_at: '2024-04-30T13:15:53.192444+00:00'
  last_modified_by: James Nguyen
  last_modified_by_type: User
description: Phi-3-mini-128k-instruct fine tuned model for nlp2sql
flavors:
  hftransformersv2:
    code: code
    hf_config_class: AutoConfig
    hf_predict_module: predict_phi
    hf_pretrained_class: AutoModelForCausalLM
    hf_tokenizer_class: AutoToken

### 7. Deploy the fine tuned model to an online endpoint
Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

In [113]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
    OnlineRequestSettings,
)

# Create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name

online_endpoint_name = "nlp2sqlphi3-completion"
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for "
    + registered_model.name
    + ", fine tuned model for nlpsqlchat-completion",
    auth_mode="key",
)
workspace_ml_client.begin_create_or_update(endpoint).wait()

You can find here the list of SKU's supported for deployment - [Managed online endpoints SKU list](https://learn.microsoft.com/en-us/azure/machine-learning/reference-managed-online-endpoints-vm-sku-list)

In [114]:
# create a deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type="standard_nc24ads_a100_v4",
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=600),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)
workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()
endpoint.traffic = {"demo": 100}
workspace_ml_client.begin_create_or_update(endpoint).result()

Check: endpoint nlp2sqlphi3-completion exists


..............................................................................................................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://nlp2sqlphi3-completion.westus2.inference.ml.azure.com/score', 'openapi_uri': 'https://nlp2sqlphi3-completion.westus2.inference.ml.azure.com/swagger.json', 'name': 'nlp2sqlphi3-completion', 'description': 'Online endpoint for Phi-3-mini-128k-instruct-nlp2sql_v6, fine tuned model for nlpsqlchat-completion', 'tags': {}, 'properties': {'azureml.onlineendpointid': '/subscriptions/840b5c5c-3f4a-459a-94fc-6bad2a969f9d/resourcegroups/ml/providers/microsoft.machinelearningservices/workspaces/ws01ent/onlineendpoints/nlp2sqlphi3-completion', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/840b5c5c-3f4a-459a-94fc-6bad2a969f9d/providers/Microsoft.MachineLearningServices/locations/westus2/mfeOperationsStatus/oe:cf406cec-7620-4855-9880-0b18191dabd0:2b4d1cee-7f57-46f4-9299-69b6f71f74b9?api-version=2022-02-01-preview'}, 'print_as_yaml': True, 'id': '/subscriptions/

### 8. Test the endpoint with sample data

We will fetch some sample data from the test dataset and submit to online endpoint for inference. We will then show the display the scored labels alongside the ground truth labels

In [None]:
# read ./ultrachat_200k_dataset/test_gen.jsonl into a pandas dataframe
test_df = pd.read_json("./sql_dataset/sql_test.jsonl", lines=True)
# take few random samples
test_df = test_df.sample(n=1)
# rebuild index
test_df.reset_index(drop=True, inplace=True)
test_df.head(2)

In [None]:
test_df["messages"][0]

In [None]:
import json

# create a json object with the key as "input_data" and value as a list of values from the text column of the test dataframe
parameters = {
    "temperature": 0.6,
    "top_p": 0.9,
    "do_sample": True,
    "max_new_tokens": 200,
}
test_json = {
    "input_data": {
        "input_string": [test_df["messages"][0]],
        "parameters": parameters,
    },
    "params": {},
}
# save the json object to a file named sample_score.json in the ./samsum-dataset folder
with open("./sql_dataset/sample_score.json", "w") as f:
    json.dump(test_json, f)

In [116]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoke method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file="./sql_dataset/sample_score.json",
)
print("raw response: \n", response, "\n")

raw response: 
 {"output": "To calculate the sales share of each product category in relation to the total sales, you can use the following SQL query:\n\n```sql\nSELECT \n    p.CategoryID, \n    SUM(od.UnitPrice * od.Quantity) AS CategorySales,\n    (SUM(od.UnitPrice * od.Quantity) / (SELECT SUM(UnitPrice * Quantity) FROM [Order Details])) AS SalesShare\nFROM Products p\nJOIN [Order Details] od ON p.ProductID = od.ProductID\nGROUP BY p.CategoryID;\n```\n\nThis query works as follows:\n\n1. It selects the `CategoryID` from the Products table.\n2. It calculates the total sales for each category by multiplying the `UnitPrice` and `Quantity` for each order detail record and summing them up. This is done using the `SUM(od.UnitPrice *"} 



### 9. Delete the online endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [None]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()