Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.png)

# Azure Machine Learning Pipelines with Data Dependency
In this notebook, we will see how we can build a pipeline with implicit data dependency.

## Prerequisites and Azure Machine Learning Basics
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the [configuration Notebook](https://aka.ms/pl-config) first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc. 

### Azure Machine Learning and Pipeline SDK-specific Imports

In [1]:
import azureml.core
from azureml.core import Workspace, Experiment, Datastore
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.widgets import RunDetails

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

from azureml.data.data_reference import DataReference
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
print("Pipeline SDK-specific imports completed")

SDK version: 1.0.83
Pipeline SDK-specific imports completed


### Initialize Workspace

Initialize a [workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class%29) object from persisted configuration.

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

# Default datastore (Azure blob storage)
# def_blob_store = ws.get_default_datastore()
blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(blob_store.name))

agd-mlws
azure-ml-workshop
westus2
c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60
Blobstore's name: workspaceblobstore


### Source Directory
The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
# source directories
source_directory_pivot = 'src/pivot'
print('Pivot scripts in {} directory.'.format(source_directory_pivot))

source_directory_join = 'src/join'
print('Join scripts in {} directory.'.format(source_directory_join))

Pivot scripts in src/pivot directory.
Join scripts in src/join directory.


### Required data and script files for the tutorial
Sample files required to finish this tutorial are already copied to the project folder specified above. Even though the .py provided in the samples don't have much "ML work," as a data scientist, you will work on this extensively as part of your work. To complete this tutorial, the contents of these files are not very important. The one-line files are for demostration purpose only.

### Compute Targets
See the list of Compute Targets on the workspace.

In [4]:
cts = ws.compute_targets
for ct in cts:
    print(ct)

agd-inference
agd-inference-v
agd-training-cpu
agd-training-gpu


#### Retrieve an AML compute
Azure Machine Learning Compute is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's get the default Aml Compute in the current workspace. We will then run the training script on this compute target.

In [5]:
from azureml.core.compute_target import ComputeTargetException

aml_compute_target = "agd-training-cpu"
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
    print("found existing compute target.")
    print("AML compute target attached")
except ComputeTargetException:
    print("compute target not found")

found existing compute target.
AML compute target attached


In [6]:
# For a more detailed view of current Azure Machine Learning Compute status, use get_status()
# example: un-comment the following line.
print(aml_compute.get_status().serialize())

{'currentNodeCount': 2, 'targetNodeCount': 2, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 2, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-03-30T21:23:00.195000+00:00', 'errors': None, 'creationTime': '2020-03-09T22:37:36.770656+00:00', 'modifiedTime': '2020-03-30T21:11:23.007088+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT3600S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D3_V2'}


**Wait for this call to finish before proceeding (you will see the asterisk turning to a number).**

Now that you have created the compute target, let's see what the workspace's compute_targets() function returns. You should now see one entry named 'amlcompute' of type AmlCompute.

## Building Pipeline Steps with Inputs and Outputs
As mentioned earlier, a step in the pipeline can take data as input. This data can be a data source that lives in one of the accessible data locations, or intermediate data produced by a previous step in the pipeline.

### Datasources
Datasource is represented by **[DataReference](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.data_reference.datareference?view=azure-ml-py)** object and points to data that lives in or is accessible from Datastore. DataReference could be a pointer to a file or a directory.

In [10]:
# Reference the data uploaded to blob storage using DataReference
# Assign the datasource to blob_input_data variable

# DataReference(datastore, 
#               data_reference_name=None, 
#               path_on_datastore=None, 
#               mode='mount', 
#               path_on_compute=None, 
#               overwrite=False)

input_time_series_1 = DataReference(
    datastore=blob_store,
    data_reference_name="time_series_1",
    path_on_datastore="datasets/time-series/time_series_1.csv")
print("DataReference input_time_series_1 created")

input_time_series_2 = DataReference(
    datastore=blob_store,
    data_reference_name="time_series_2",
    path_on_datastore="datasets/time-series/time_series_2.csv")
print("DataReference input_time_series_2 created")

DataReference input_time_series_1 created
DataReference input_time_series_2 created


### Intermediate/Output Data
Intermediate data (or output of a Step) is represented by **[PipelineData](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py)** object. PipelineData can be produced by one step and consumed in another step by providing the PipelineData object as an output of one step and the input of one or more steps.

#### Constructing PipelineData
- **name:** [*Required*] Name of the data item within the pipeline graph
- **datastore_name:** Name of the Datastore to write this output to
- **output_name:** Name of the output
- **output_mode:** Specifies "upload" or "mount" modes for producing output (default: mount)
- **output_path_on_compute:** For "upload" mode, the path to which the module writes this output during execution
- **output_overwrite:** Flag to overwrite pre-existing data

In [11]:
# Define intermediate data using PipelineData
# Syntax

# PipelineData(name, 
#              datastore=None, 
#              output_name=None, 
#              output_mode='mount', 
#              output_path_on_compute=None, 
#              output_overwrite=None, 
#              data_type=None, 
#              is_directory=None)

# Creating intermediate data definitions
output_time_series_1 = PipelineData("output_time_series_1",datastore=blob_store)
print("PipelineData output_time_series_1 created")

output_time_series_2 = PipelineData("output_time_series_2",datastore=blob_store)
print("PipelineData output_time_series_2 created")

PipelineData output_time_series_1 created
PipelineData output_time_series_2 created


### Pipelines steps using datasources and intermediate data
Machine learning pipelines can have many steps and these steps could use or reuse datasources and intermediate data. Here's how we construct such a pipeline:

#### Define a Step that consumes a datasource and produces intermediate data.
In this step, we define a step that consumes a datasource and produces intermediate data.

**Open `pivot.py` in the local machine and examine the arguments, inputs, and outputs for the script. That will give you a good sense of why the script argument names used below are important.** 

#### Specify conda dependencies and a base docker image through a RunConfiguration

This step uses a docker image, use a [**RunConfiguration**](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.runconfiguration?view=azure-ml-py) to specify these requirements and use when creating the PythonScriptStep. 

In [12]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

# create a new runconfig object
run_config = RunConfiguration()

# enable Docker 
run_config.environment.docker.enabled = True

# set Docker base image to the default CPU-based image
run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE

# use conda_dependencies.yml to create a conda environment in the Docker image for execution
run_config.environment.python.user_managed_dependencies = False

# specify CondaDependencies obj
#run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])

In [14]:
# step4 consumes the datasource (Datareference) in the previous step
# and produces processed_data1
time_series_1_pivot_step = PythonScriptStep(
    script_name="pivot.py", 
    arguments=["--input_data", input_time_series_1, "--output_data", output_time_series_1],
    inputs=[input_time_series_1],
    outputs=[output_time_series_1],
    compute_target=aml_compute, 
    source_directory=source_directory_pivot,
    runconfig=run_config
)
print("PythonScriptStep time_series_1_pivot_step created")

time_series_2_pivot_step = PythonScriptStep(
    script_name="pivot.py", 
    arguments=["--input_data", input_time_series_2, "--output_data", output_time_series_2],
    inputs=[input_time_series_2],
    outputs=[output_time_series_2],
    compute_target=aml_compute, 
    source_directory=source_directory_pivot,
    runconfig=run_config
)
print("PythonScriptStep time_series_2_pivot_step created")

PythonScriptStep time_series_1_pivot_step created
PythonScriptStep time_series_2_pivot_step created


#### Define a Step that consumes intermediate data and produces intermediate data
In this step, we define a step that consumes an intermediate data and produces intermediate data.

**Open `join.py` in the local machine and examine the arguments, inputs, and outputs for the script. That will give you a good sense of why the script argument names used below are important.** 

In [15]:
# step5 to use the intermediate data produced by step4
# This step also produces an output join_output_data
output_joined_time_series = PipelineData("output_joined_time_series", datastore=blob_store)

join_step = PythonScriptStep(
    script_name="join.py",
    arguments=["--input_data_1", output_time_series_1, "--input_data_2", output_time_series_2, "--output_data", output_joined_time_series],
    inputs=[output_time_series_1,output_time_series_2],
    outputs=[output_joined_time_series],
    compute_target=aml_compute, 
    source_directory=source_directory_join)
print("PythonScriptStep join_step created")

PythonScriptStep join_step created


#### Build the pipeline

In [16]:
pipeline = Pipeline(workspace=ws, steps=[join_step])
print ("Pipeline is built")

Pipeline is built


In [17]:
pipeline_run = Experiment(ws, 'use-case-1-data-prep').submit(pipeline)
print("Pipeline is submitted for execution")

Created step join.py [d8e54cf2][66ea111b-7495-4fd0-b781-0f5924cf30ec], (This step will run and generate new outputs)
Created step pivot.py [ab44cb41][263dfce7-8640-48be-b87e-f887702d3e04], (This step will run and generate new outputs)Created step pivot.py [90acf336][c2073419-2da8-4344-a9c2-7723aae36a23], (This step will run and generate new outputs)

Created data reference time_series_1 for StepId [e3e30525][89da74c8-910e-4467-92bd-f1fe7b4ba59b], (Consumers of this data will generate new runs.)
Created data reference time_series_2 for StepId [2ec31e49][4a9d92fd-08fd-4d76-8b97-67f52e04edec], (Consumers of this data will generate new runs.)
Submitted PipelineRun 559f07b2-84e5-4ac5-8fc7-1d71258dc6cd
Link to Azure Machine Learning studio: https://ml.azure.com/experiments/use-case-1-data-prep/runs/559f07b2-84e5-4ac5-8fc7-1d71258dc6cd?wsid=/subscriptions/c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60/resourcegroups/azure-ml-workshop/workspaces/agd-mlws
Pipeline is submitted for execution


In [18]:
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

#### Wait for pipeline run to complete

In [19]:
pipeline_run.wait_for_completion(show_output=True)

PipelineRunId: 559f07b2-84e5-4ac5-8fc7-1d71258dc6cd
Link to Portal: https://ml.azure.com/experiments/use-case-1-data-prep/runs/559f07b2-84e5-4ac5-8fc7-1d71258dc6cd?wsid=/subscriptions/c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60/resourcegroups/azure-ml-workshop/workspaces/agd-mlws
PipelineRun Status: Running


StepRunId: 1cb62951-59b6-4a10-8f3e-ba638defabfe
Link to Portal: https://ml.azure.com/experiments/use-case-1-data-prep/runs/1cb62951-59b6-4a10-8f3e-ba638defabfe?wsid=/subscriptions/c5ec24ce-9c5f-4da2-bf12-9ca8e9758d60/resourcegroups/azure-ml-workshop/workspaces/agd-mlws
StepRun( pivot.py ) Status: Running

Streaming azureml-logs/65_job_prep-tvmps_5a0048198f600b89a776e9215379629de230874e2e10c24d8f28f64275377801_d.txt
bash: /azureml-envs/azureml_1b417bb747e35859ebf611fb43071e9c/lib/libtinfo.so.5: no version information available (required by bash)
Starting job preparation. Current time:2020-03-30T22:00:46.024312
Extracting the control code.
fetching and extracting the control code on master

### See Outputs

See where outputs of each pipeline step are located on your datastore.

***Wait for pipeline run to complete, to make sure all the outputs are ready***

In [20]:
# Get Steps
for step in pipeline_run.get_steps():
    print("Outputs of step " + step.name)
    
    # Get a dictionary of StepRunOutputs with the output name as the key 
    output_dict = step.get_outputs()
    
    for name, output in output_dict.items():
        
        output_reference = output.get_port_data_reference() # Get output port data reference
        print("\tname: " + name)
        print("\tdatastore: " + output_reference.datastore_name)
        print("\tpath on datastore: " + output_reference.path_on_datastore)

Outputs of step join.py
	name: output_joined_time_series
	datastore: workspaceblobstore
	path on datastore: azureml/64722ebe-95d5-4a93-b7be-016cc7d2b895/output_joined_time_series
Outputs of step pivot.py
	name: output_time_series_1
	datastore: workspaceblobstore
	path on datastore: azureml/1cb62951-59b6-4a10-8f3e-ba638defabfe/output_time_series_1
Outputs of step pivot.py
	name: output_time_series_2
	datastore: workspaceblobstore
	path on datastore: azureml/7d101f85-4a2e-4ced-b605-6be6f7fafb60/output_time_series_2


### Download Outputs

We can download the output of any step to our local machine using the SDK.

In [None]:
# Retrieve the step runs by name 'train.py'
train_step = pipeline_run.find_step_run('join.py')

if train_step:
    train_step_obj = train_step[0] # since we have only one step by name 'train.py'
    train_step_obj.get_output_data('output_data').download("./outputs") # download the output to current directory

# Next: Publishing the Pipeline and calling it from the REST endpoint
See this [notebook](https://aka.ms/pl-pub-rep) to understand how the pipeline is published and you can call the REST endpoint to run the pipeline.