## Train Model

*[Florian Roscheck](https://www.linkedin.com/in/florianroscheck/), 2024-04-02*

In this notebook, we train the model on the TACO dataset using [Azure ML Automated Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml?view=azureml-api-2). First, we will connect to the data assets we created earlier on Azure ML. Then, we are going to create a compute cluster for training the model. Finally, we submit the training job to the Azure ML backend. After running this notebook, you will be able to follow the training progress in the Azure ML web interface.

In [1]:
# Make a connection to the Azure ML workspace

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azureml.core import Workspace

ws = Workspace.from_config()

ml_client = MLClient(
    DefaultAzureCredential(), ws.subscription_id, ws.resource_group, ws.name
)

To train, we need a compute cluster. Ideally, this should be a compute cluster powerful enough to run multiple training jobs. Especially during a hackathon where there is not much time, it would be good advice to parallelize as many training jobs as possible to avoid long training run times. You can learn more about compute clusters in the [Azure ML documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?view=azureml-api-2&tabs=python).

> Tip: If you are running on a budget, consider using low-priority VMs for training. They offer a significant price advantage, but also have some downsides. You can learn more about them in the [Azure ML documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-low-priority-batch?view=azureml-api-2&tabs=sdk).

In [1]:
# Define a name for the compute cluster

compute_name = "gpu-cluster-florian-lowprio"

In [1]:
# Define cluster and create it

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
    name=compute_name,
    type="amlcompute",
    size="STANDARD_NC8AS_T4_V3",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120,
    tier='LowPriority' # Use low-priority VM (see tip above)
)
ml_client.begin_create_or_update(cluster_basic)

Now we are ready to start the automated machine learning training on our data. Let's get a reference to it and then feed that reference into the training job. Also note that we set the validation data size to 20% of the training data to get a reasonable estimate of the model's performance. 

In [1]:
# Get reference to the data asset with the modified annotations we created earlier

data = ml_client.data.get("TACO-annotations", version="1")

In [1]:
# Define training job

from azure.ai.ml import automl
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

image_object_detection_job = automl.image_instance_segmentation(
    compute=compute_name,
    experiment_name='taco-flrs',
    training_data=Input(type=AssetTypes.MLTABLE, path=data.id),
    validation_data_size=0.2,
    target_column_name="label",
)

In the next cell, we define the limits for the training job. It makes sense to set the maximum number of trials to a multiple of the max_instances of the compute cluster. The parameter `max_concurrent_trials` should be set to the same number of `max_instances` of the compute cluster. This way, we can parallelize the training jobs as much as possible.

In [1]:
# Set job limits

image_object_detection_job.set_limits(max_trials=8, max_concurrent_trials=4)

Ready for action? Then let's submit the job to Azure ML. We can follow the progress in the Azure ML web interface.

In [1]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    image_object_detection_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

Created job: compute: azureml:gpu-cluster-florian-lowprio
creation_context:
  created_at: '2024-04-01T13:39:35.483134+00:00'
  created_by: Florian Roscheck
  created_by_type: User
display_name: olden_head_4857xq834s
experiment_name: taco-flrs
id: azureml:/subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***/jobs/olden_head_4857xq834s
limits:
  max_concurrent_trials: 2
  max_trials: 2
  timeout_minutes: 10080
log_verbosity: info
name: olden_head_4857xq834s
outputs: {}
primary_metric: mean_average_precision
properties: {}
resources:
  instance_count: 1
  shm_size: 2g
services:
  Studio:
    endpoint: https://ml.azure.com/runs/olden_head_4857xq834s?wsid=/subscriptions/***/resourcegroups/***/workspaces/***&tid=***
  Tracking:
    endpoint: azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/***/resourceGroups/***/providers/Microsoft.MachineLearningServices/workspaces/***?
status: NotStarted
tags: {}
target_column_name: label
task: image_

While Azure Automated Machine Learning uses the [Mask R-CNN](https://arxiv.org/abs/1703.06870) model as default model for instance segmentation tasks, at the time of writing this notebook, Microsoft offers, as a preview feature, the opportunity to use different instance segmentation models. You can learn more about this [in this Jupyter Notebook by Microsoft](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/automl-standalone-jobs/automl-image-instance-segmentation-task-fridge-items/automl-image-instance-segmentation-task-fridge-items.ipynb), at point 4.2.1 ("Individual runs with models from MMDetection (Preview)").