# Single-step pipeline examples

In this example, we'll build a very simple pipeline that just contains a single train step. The dataset and compute cluster created in this tutorial will be re-used in the subsequent examples in this module.

In [None]:
!pip install azureml-sdk --upgrade

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration, Environment
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [None]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

# Preparation

Let's quickly a create a compute cluster named `cpu-cluster`, in case it does not exist. This is where our pipeline will run on.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

aml_compute_target = "cpu-cluster"
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
except ComputeTargetException:
    config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", min_nodes = 0, max_nodes = 1,
                                                   idle_seconds_before_scaledown=3600)
    aml_compute = ComputeTarget.create(ws, aml_compute_target, config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Next, let's create an AzureML Environment. This will hold our dependencies we'll need to execute our code inside the pipeline. We'll re-use the same Environment throughout most of the samples in this repo. Typically, you might use different Environments throughout the lifecycle (e.g., one for training and one for inferencing), but for sake of simplicity, we'll keep it down to one here.

In [None]:
env = Environment.from_conda_specification(name='workshop-env', file_path='conda.yml')
env.register(workspace=ws)

We can directly build the environment - this will create a new Docker image in Azure Container Registry (ACR), and directly 'bake in' our dependencies from the conda definition. When we later use the Environment, all AML will need to do is pull the image for environment, thus saving the time for potentially a long-running conda environment creation.

In [None]:

build = env.build(workspace=ws)
build.wait_for_completion(show_output=True)

Furthermore, we'll create a new dataset and register it to the workspace. We'll be using this dataset also in the subsequent pipelines.

In [None]:
from azureml.core import Dataset

datastore = ws.get_default_datastore()
datastore.upload(src_dir='../data-training', target_path='german-credit-train-tutorial', overwrite=True)
ds = Dataset.File.from_files(path=[(datastore, 'german-credit-train-tutorial')])
ds.register(ws, name='german-credit-train-tutorial', description='Dataset for workshop tutorials', create_new_version=True)

Next, let's reference our newly created training dataset, so that we can use it as the pipeline input:

In [None]:
training_dataset = Dataset.get_by_name(ws, "german-credit-train-tutorial")
# Download dataset to compute node - we can also use .as_mount() if the dataset does not fit the machine
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset).as_download()

Next, we can create a `PythonScriptStep` that runs our training code, referencing our code, passing in the args, and using the environment definition:

In [None]:
runconfig = RunConfiguration()
runconfig.environment = Environment.get(workspace=ws, name='workshop-env')

train_step = PythonScriptStep(name="train-step",
                        source_directory="./",
                        script_name="train.py",
                        arguments=['--data-path', training_dataset_consumption],
                        inputs=[training_dataset_consumption],
                        runconfig=runconfig,
                        compute_target='cpu-cluster',
                        allow_reuse=False)

steps = [train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Lastly, we can submit the pipeline against an experiment:

In [None]:
pipeline_run = Experiment(ws, 'training-pipeline').submit(pipeline)
pipeline_run.wait_for_completion()

Alternatively, we can also publish the pipeline as a RESTful API Endpoint:

In [None]:
published_pipeline = pipeline.publish('training-pipeline')
published_pipeline

What if we want to continously publish a new pipelines, but have it published as the same URL as the version prior? For this, we can use [`PipelineEndpoint`](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelineendpoint?view=azure-ml-py), which keeps multiple `PublishedPipeline`s behind a single endpoint URL. It allows to set `default_version`, which determines to which `PublishedPipeline` it should route the request.

In [None]:
from azureml.pipeline.core import PipelineEndpoint

endpoint_name = "training-pipeline-endpoint"

try:
   pipeline_endpoint = PipelineEndpoint.get(workspace=ws, name=endpoint_name)
   # Add new default endpoint - only works from PublishedPipeline
   pipeline_endpoint.add_default(published_pipeline)
except Exception:
    pipeline_endpoint = PipelineEndpoint.publish(workspace=ws,
                                            name=endpoint_name,
                                            pipeline=pipeline,
                                            description="New Training Pipeline Endpoint")
