# Multi-step pipeline example

In this example, we'll be building a three step pipeline which passes data from the a first step (prepare) to the second step (train) and then register the model (register).

**Note:** This example requires that you've ran the notebook from the first tutorial, so that the dataset, environment, and compute cluster are set up.

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration, Environment
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [None]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

Next, let's reference our training dataset from the last tutorial, so that we can use it as the pipeline input for the prepare step:

In [None]:
# Set our dataset as the default dataset (if user does not set the parameter during pipeline invocation)
default_training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")

# Parametrize dataset input to the pipeline
training_dataset_parameter = PipelineParameter(name="training_dataset", default_value=default_training_dataset)
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset_parameter).as_download()


Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:

In [None]:
default_datastore = ws.get_default_datastore()
prepared_data = PipelineData("prepared_data", datastore=default_datastore)

Next, we can create our three-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. The dependency graph is automatically resolved through the data input/outputs, which means we need to tell AML that registration should happen last:

In [None]:
runconfig = RunConfiguration()
runconfig.environment = Environment.get(workspace=ws, name='workshop-env')

prepare_step = PythonScriptStep(name="prepare-step",
                        source_directory="./",
                        script_name='prepare.py',
                        arguments=['--data-input-path', training_dataset_consumption,
                                   '--data-output-path', prepared_data],
                        inputs=[training_dataset_consumption],
                        outputs=[prepared_data],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

train_step = PythonScriptStep(name="train-step",
                        source_directory="./",
                        script_name='train.py',
                        arguments=['--data-path', prepared_data],
                        inputs=[prepared_data],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

register_step = PythonScriptStep(name="register-step",
                        source_directory="./",
                        script_name='register.py',
                        arguments=['--model_name', 'workshop-model', '--model_path', 'outputs/model.pkl'],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

register_step.run_after(train_step) # Required, as there is no implicit data dependency between the train and register steps
steps = [prepare_step, train_step, register_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Lastly, we can submit the pipeline against an experiment:

In [None]:
pipeline_run = Experiment(ws, 'prepare-training-pipeline').submit(pipeline)
pipeline_run.wait_for_completion()

Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window.

In [None]:
published_pipeline = pipeline.publish('prepare-training-pipeline')
published_pipeline

# Using just a path as input (instead of a dataset)

What if we want to run the pipeline without using a dataset, but rather just a path on the Datastore? This might make it easier to use the pipeline for e.g., batch scoring, as it removes the requirement for dataset registration. For this we can use Datapaths:

In [None]:
from azureml.data.datapath import DataPath, DataPathComputeBinding

default_datastore = ws.get_default_datastore()

# Define default Datapath and make it configurable via PipelineParameter
data_path = DataPath(datastore=default_datastore, path_on_datastore='german-credit-train-tutorial/')
datapath_parameter = PipelineParameter(name="training_data_path", default_value=data_path)
datapath_input = (datapath_parameter, DataPathComputeBinding(mode='download'))

# Same as in example above
prepared_data = PipelineData("prepared_data", datastore=default_datastore)

The pipeline stays the same, expect that we swap out the data input for the first step and set it to the datapath:

In [None]:
runconfig = RunConfiguration()
runconfig.environment = Environment.get(workspace=ws, name='workshop-env')

prepare_step = PythonScriptStep(name="prepare-step",
                        source_directory="./",
                        script_name='prepare.py',
                        arguments=['--data-input-path', datapath_input,
                                   '--data-output-path', prepared_data],
                        inputs=[datapath_input],
                        outputs=[prepared_data],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

train_step = PythonScriptStep(name="train-step",
                        source_directory="./",
                        script_name='train.py',
                        arguments=['--data-path', prepared_data],
                        inputs=[prepared_data],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

register_step = PythonScriptStep(name="register-step",
                        source_directory="./",
                        script_name='register.py',
                        arguments=['--model_name', 'workshop-model', '--model_path', 'outputs/model.pkl'],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

register_step.run_after(train_step) # Required, as there is no implicit data dependency between the train and register steps
steps = [prepare_step, train_step, register_step]

Lastly, we can validate, try it out and publish it (same as before):

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

In [None]:
pipeline_run = Experiment(ws, 'prepare-training-pipeline-datapath').submit(pipeline)
pipeline_run.wait_for_completion()

In [None]:
published_pipeline = pipeline.publish('prepare-training-pipeline-datapath')
published_pipeline