# Multi-step pipeline example with R

In this example, we'll be building a two step pipeline which passes data from the a first step (prepare) to the second step (train) in R.

**Note:** This example requires that you've ran the notebook from the first tutorial, so that the compute cluster is set up.

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Experiment, Dataset, RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import RScriptStep
from azureml.core.environment import Environment, RSection
   
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig

print("Azure ML SDK version:", azureml.core.VERSION)

First, we will connect to the workspace. The command `Workspace.from_config()` will either:
* Read the local `config.json` with the workspace reference (given it is there) or
* Use the `az` CLI to connect to the workspace and use the workspace attached to via `az ml folder attach -g <resource group> -w <workspace name>`

In [None]:
ws = Workspace.from_config()
print(f'WS name: {ws.name}\nRegion: {ws.location}\nSubscription id: {ws.subscription_id}\nResource group: {ws.resource_group}')

Furthermore, we'll create a new dataset for the R example and register it to the workspace.

In [None]:
from azureml.core import Dataset

datastore = ws.get_default_datastore()
datastore.upload(src_dir='data/', target_path='r-pipeline', overwrite=True)
ds = Dataset.File.from_files(path=[(datastore, 'r-pipeline')])
ds.register(ws, name='r-pipeline-tutorial', description='Dataset for R pipeline tutorials', create_new_version=True)

Next, let's get our dataset ready for input to the training job:

In [None]:
training_dataset = Dataset.get_by_name(ws, "r-pipeline-tutorial")
training_dataset_consumption = DatasetConsumptionConfig("training_dataset", training_dataset).as_download(path_on_compute="/data")

Let's also define a `PipelineData` placeholder which will be used to persist and pipe data from the prepare step to the train step:

In [None]:
default_datastore = ws.get_default_datastore()
prepared_data = PipelineData("prepared_data", datastore=default_datastore)


Since R environments take quite a while to build, we'll use a Docker image that has all our dependencies built-in. This ensure quick execution of the pipeline and avoids unnecessay, long-running image build processes:

In [None]:
rc = RunConfiguration()
rc.framework='R'
rc.environment.r = RSection()
rc.environment.docker.enabled = True

# Replace with the name of your container registry!!!
rc.environment.docker.base_image = 'xxxxxxx.azurecr.io/r-tutorial:v1'

# Disable AML's automatic package installation, but rather rely on pre-built base image
rc.environment.r.user_managed = True
rc.environment.python.user_managed_dependencies = True 

Next, we can create our two-stepped pipeline that runs some preprocessing on the data and then pipes the output to the training code. The dependency graph is automatically resolved through the data input/outputs, but we could also define it ourselves (if desired):

In [None]:
prepare_step = RScriptStep("prepare.R",
                         name="prepare-step",
                         arguments=['--data_path_input', '/data',
                                    '--data_path_output', prepared_data],
                         compute_target='cpu-cluster',
                         runconfig=rc,
                         inputs=[training_dataset_consumption],
                         outputs=[prepared_data],
                         source_directory="./",
                         custom_docker_image=None,
                         allow_reuse=False)

train_step = RScriptStep("train.R",
                         name="train-step",
                         arguments=['--data_path', prepared_data],
                         compute_target='cpu-cluster',
                         runconfig=rc,
                         inputs=[prepared_data],
                         source_directory="./",
                         custom_docker_image=None,
                         allow_reuse=False)

train_step.run_after(prepare_step) # not really needed here, just for illustration
steps = [prepare_step, train_step]

Finally, we can create our pipeline object and validate it. This will check the input and outputs are properly linked and that the pipeline graph is a non-cyclic graph:

In [None]:
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline.validate()

Lastly, we can submit the pipeline against an experiment:

In [None]:
pipeline_run = Experiment(ws, 'prepare-training-pipeline-with-r').submit(pipeline)
pipeline_run.wait_for_completion()

Alternatively, we can also publish the pipeline as a RESTful API Endpoint. In this case, you can specify the dataset upon invocation of the pipeline. This is nicely possible in the `Studio UI`, goto `Endpoints`, then `Pipeline Endpoints` and then select the pipeline. Once you hit the submit button, you can select the Dataset at the bottom of the window.

In [None]:
published_pipeline = pipeline.publish('prepare-training-pipeline-with-r')
published_pipeline