# Many models training pipeline

In this example, we'll use ParallelRunStep to train many models in parallel.

In [8]:

import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration, Environment
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps import PythonScriptStep, ParallelRunStep, ParallelRunConfig
from azureml.data import OutputFileDatasetConfig
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)


Azure ML SDK version: 1.32.0


First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [2]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


WS name: aml-mlops-workshop
Region: westeurope
Subscription id: 43ab27bb-ee6c-4f68-b9cf-a26c4c454a4a
Resource group: aml-mlops-workshop


# Preparation

Let's upload data for our many models...well, not so many in this example, but you'll get the idea:

In [3]:
from azureml.core import Dataset

datastore = ws.get_default_datastore()
datastore.upload(src_dir='../data-many-models', target_path='many-models-training', overwrite=True)

Uploading an estimated of 3 files
Uploading ../data-many-models\model1\data.csv
Uploaded ../data-many-models\model1\data.csv, 1 files out of an estimated total of 3
Uploading ../data-many-models\model2\data.csv
Uploaded ../data-many-models\model2\data.csv, 2 files out of an estimated total of 3
Uploading ../data-many-models\model3\data.csv
Uploaded ../data-many-models\model3\data.csv, 3 files out of an estimated total of 3
Uploaded 3 files


$AZUREML_DATAREFERENCE_197a1bd0fa644474bf8edb43db567cce

As the data for each model sits in its own folder, we will register the dataset with a partition definition using `partition_format`. This allows to later parallelize the training over each partition key.

In [4]:
ds = Dataset.File.from_files(path=[(datastore, 'many-models-training')], partition_format = '{model}/*.csv')
ds.register(ws, name='many-models-training-tutorial', description='Dataset for many models tutorial', create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'many-models-training')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "AddColumnsFromPartitionFormat"
  ],
  "registration": {
    "id": "64da9a7f-e080-4bdd-8992-e864d7e2c0a1",
    "name": "many-models-training-tutorial",
    "version": 2,
    "description": "Dataset for many models tutorial",
    "workspace": "Workspace.create(name='aml-mlops-workshop', subscription_id='43ab27bb-ee6c-4f68-b9cf-a26c4c454a4a', resource_group='aml-mlops-workshop')"
  }
}

Next, let's reference our newly created partitioned training dataset, so that we can use it as the pipeline input. Note that we defined access mode as `direct`, this is important so that data is accessesd efficiently during parallelization.

In [5]:
training_dataset = Dataset.get_by_name(ws, "many-models-training-tutorial")
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset, mode='direct')

Now let's create a output dataset that contains our models. This gives us complete freedom where we want to store the models on the datastore. Depending on your use case (e.g., "train & store", or "train & score"), you might not need to save the generated models.

In [6]:
datastore = ws.get_default_datastore()

# This will put the output results into a pre-defined folder on our datastore and optionally register it as a dataset (not required)
models = OutputFileDatasetConfig(name='many_models',
                                 destination=(datastore, 'many_models/{run-id}')).register_on_complete(name='many-models')


Next, we can create a `ParallelRunStep` that runs our training code in parallel on one or more nodes. In this case, we use a `ParallelRunConfig` from a YAML file, that defines the parallelization of our job (source script, environement, scale, target cluster, etc.)

In [9]:


parallel_run_config = ParallelRunConfig.load_yaml(workspace=ws, path="parallel_runconfig.yml")

train_step = ParallelRunStep(
    name="train-many-models-step",
    parallel_run_config=parallel_run_config,
    arguments=[ '--model_output_path', models],
    inputs=[training_dataset_consumption],
    side_inputs=[],
    output=models,
    allow_reuse=False
)

model_name = 'many_models_demo'
runconfig = RunConfiguration()
runconfig.environment = Environment.get(workspace=ws, name='workshop-env')

register_step = PythonScriptStep(
    name="register-step",
    source_directory="./",
    script_name="register.py",
    arguments=['--model_name', model_name, '--model_path', models],
    inputs=[models],
    compute_target='cpu-cluster',
    runconfig=runconfig,
    allow_reuse=False
)

steps = [train_step, register_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [10]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.


Step train-many-models-step is ready to be created [04408cba]
Step register-step is ready to be created [8d392c52]


[]

Lastly, we can submit the pipeline against an experiment:

In [11]:
pipeline_run = Experiment(ws, 'many-models-training-pipeline').submit(pipeline)
pipeline_run.wait_for_completion()

Created step train-many-models-step [04408cba][418e8a6b-9c1b-4200-83bf-aef268bd0de3], (This step will run and generate new outputs)
Created step register-step [8d392c52][6d75b191-add4-416e-9d21-82644aa07e83], (This step will run and generate new outputs)
Submitted PipelineRun 9a0cabb7-d82d-49ea-a1f7-56255990600a
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/9a0cabb7-d82d-49ea-a1f7-56255990600a?wsid=/subscriptions/43ab27bb-ee6c-4f68-b9cf-a26c4c454a4a/resourcegroups/aml-mlops-workshop/workspaces/aml-mlops-workshop&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
PipelineRunId: 9a0cabb7-d82d-49ea-a1f7-56255990600a
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/9a0cabb7-d82d-49ea-a1f7-56255990600a?wsid=/subscriptions/43ab27bb-ee6c-4f68-b9cf-a26c4c454a4a/resourcegroups/aml-mlops-workshop/workspaces/aml-mlops-workshop&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: 0e37ecbd-a4ea-4a10-b3dc-fee33bee0d

Last but not least, we can nnow download the resulting models dataset. For easy of use, we'll just download the models and the summary here to a folder named `temp`:

In [None]:
Dataset.get_by_name(ws, "many-models").download(target_path="temp/", overwrite=True)
with open('temp/train_results.txt','r') as f:
    print(f.read())