# Multi-step pipeline example

In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train).

**Note:** This example requires that you've ran the notebook from the first tutorial, so that the dataset and compute cluster are set up.

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [None]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

Next, let's reference our training dataset from the last tutorial, so that we can use it as the pipeline input for the prepare step:

In [None]:
# Set our dataset as the default dataset (if user does not set the parameter during pipeline invocation)
default_training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")

# Parametrize dataset input to the pipeline
training_dataset_parameter = PipelineParameter(name="training_dataset", default_value=default_training_dataset)
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset_parameter).as_download()


Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:

In [None]:
default_datastore = ws.get_default_datastore()
prepared_data = PipelineData("prepared_data", datastore=default_datastore)


Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. In this case, we use a separate `runconfig` for each step. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):

In [None]:
prepare_runconfig = RunConfiguration.load("prepare_runconfig.yml")

prepare_step = PythonScriptStep(name="prepare-step",
                        runconfig=prepare_runconfig,
                        source_directory="./",
                        script_name=prepare_runconfig.script,
                        arguments=['--data-input-path', training_dataset_consumption,
                                   '--data-output-path', prepared_data],
                        inputs=[training_dataset_consumption],
                        outputs=[prepared_data],
                        allow_reuse=False)

train_runconfig = RunConfiguration.load("train_runconfig.yml")

train_step = PythonScriptStep(name="train-step",
                        runconfig=train_runconfig,
                        source_directory="./",
                        script_name=train_runconfig.script,
                        arguments=['--data-path', prepared_data],
                        inputs=[prepared_data],
                        allow_reuse=False)

train_step.run_after(prepare_step) # not really needed here, just for illustration
steps = [prepare_step, train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Lastly, we can submit the pipeline against an experiment:

In [None]:
pipeline_run = Experiment(ws, 'prepare-training-pipeline').submit(pipeline)
pipeline_run.wait_for_completion()

Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window.

In [None]:
published_pipeline = pipeline.publish('prepare-training-pipeline')
published_pipeline