# Study Note - Building AI Solutions with Azure Machine Learning
This notebook collects the notes taken through the course of **[Build AI solutions with Azure Machine Learning](https://docs.microsoft.com/en-us/learn/paths/build-ai-solutions-with-azure-ml-service/)** offered by Microsoft, with supplements from the **[documentation of Azure Machine Learning SDK for Python](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)**.

This notebook contains Labs 01 - 05 of the learning course, which correspond to "Set up an Azure Machine Learning Workspace" and "Run Experiments and Train Models" sections in the exam guideline.

## 01 Getting Started with Azure Machine Learning

The Azure ML SDK for Python provides classes you can use to work with Azure ML in your Azure subscription.

### azureml-core package
**High level process:**
1. **create a new <font color='blue'>*workspace*</font> or connect to an existing workspace** 
2. **create an Azure ML <font color='blue'>*experiment*</font> in workspace**
3. **create a <font color='blue'>*run*</font> to run codes**

### Workspace 

A **workspace** is a context for the **experiments, data, compute targets, and other assets** associated with **a machine learning workload**. Workspaces are Azure resources, and as such they are defined within a resource group in an Azure subscription, along with other related Azure resources that are required to support the workspace. A Workspace is a fundamental **resource** for machine learning in Azure Machine Learning. You use a workspace to **experiment, train, and deploy machine learning models**.

```python
from azureml.core import Workspace
```
- All experiments and associated resources are managed within you Azure ML workspace. You can connect to an existing workspace,  create a new one using the Azure ML SDK, or load the workspace from the configuration file.

```python
# Load an existing workspace
ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-id>', resource_group='myresourcegroup')

# Create a new one
ws = Workspace.create(name='myworkspace',
                      subscription_id='<azure-subscription-id>',
                      resource_group='myresourcegroup',
                      create_resource_group=True,
                      location='eastus2'
                     )

# Load from a configuration file
ws = Workspace.from_config()

```

- In most cases, you should store the workspace configuration in a JSON configuration file. This makes it easier to reconnect without needing to remember details like your Azure subscription ID.

```python
ws.write_config(path="./file-path", file_name="ws_config.json")
```

- You can download the JSON configuration file from the blade for your workspace in the Azure portal, but ***if you're using a Compute Instance within your workspace, the configuration file has already been downloaded to the root folder.***
    - ***Note: It means if the new script is on the same compute instance, you can simply use `Workspace.from_comfig()` to retrieve workspace.***
- `.from_config()` finds and uses the configuration file from the root folder to connect to your workspace.

```python
ws_other_environment = Workspace.from_config(path="./file-path/ws_config.json")
```

### Experiment

In Azure Machine Learning, an **experiment** is a **named process**, usually the running of a script or a pipeline, that can generate metrics and outputs and be tracked in the Azure Machine Learning workspace. An experiment can be run multiple times, with different data, code, or settings; and Azure Machine Learning tracks each run, enabling you to view run history and compare results for each run.

When you submit an experiment, you use its run context to initialize and end the experiment run that is tracked in Azure Machine Learning

```python
from azureml.core import Experiment

# create an experiment variable
experiment = Experiment(workspace=ws, name='test-experiment')

# start the experimennt
run = experiment.start_logging()

# experiment code goes here

# end the experiment
run.complete()

```

After the experiment run has completed, you can view the details of the run in the **Experiments** tab in Azure Machine Learning studio.

### Run
**A run represent a single trial of an experiment.** **Run** is the object that you use to monitor the asynchronous execution of a trial, store the output of the trial, analyze results, and access generated artifacts. You use Run inside your experimentation code to log metrics and artifacts to the Run History service.

#### Note: *Run* is like running a model/pipeline each time.

- There are two ways to create run. Both functions return a Run object.
    1. `experiment.start_logging()` as previous example
    2. `experiment.submit()` to run a experiment script 

#### [IMPORTANT!] If you're interactively experimenting in a Jupyter notebook, use the `start_logging` function. If you're submitting an experiment from a standard Python environment, use the `submit` function. 

Create a Run object by submitting an Experiment object with a **run configuration object**. Use the **tags parameter** to attach custom categories and labels to your runs. You can easily find and retrieve them later from Experiment.

```python
tags = {"prod": "phase-1-model-tests"}
run = experiment.submit(config=your_config_object, tags=tags)
```

#### Create an experiemnt script
- Create a separate script from experiment, store it in a folder along with any other files it needs, and then use Azure ML to run the experiment based on the script in the folder.
- `Run.get_context()` method to *retrieve the experiment run context when the script is run*.
- <ins>**After a run object is created, use various `.log*()` methods to log the outputs.**</ins>
- `run.complete()` at the end of the script

```python
# An experiment script, experiment.py, saved in the experiment_files folder
from azureml.core import Run
import pandas as pd
import matplotlib.pyplot as plt
import os

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
data = pd.read_csv('data.csv')

# Count the rows and log the result
row_count = (len(data))
run.log('observations', row_count)

# Save a sample of the data
os.makedirs('outputs', exist_ok=True)
data.sample(100).to_csv("outputs/sample.csv", index=False, header=True)

# Complete the run
run.complete()
```

#### Configuration

To run a script as an experiment, you must define a script configuration that defines **the script to be run** and **the Python environment in which to run it**. This is implemented by using a **ScriptRunConfig** object.

```python
from azureml.core import Experiment, RunConfiguration, ScriptRunConfig

# create a new RunConfig object
experiment_run_config = RunConfiguration()

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder, 
                                script='experiment.py',
                                run_config=experiment_run_config) 

# submit the experiment
experiment = Experiment(workspace = ws, name = 'my-experiment')
run = experiment.submit(config=script_config)
run.wait_for_completion(show_output=True)
```

The **RunConfiguration** object defines the Python environment for the experiment, including the packages available to the script. If your script depends on packages that are not included in the default environment, you must associate the **RunConfiguration** with an Environment object that makes use of a **CondaDependencies** object to specify the Python packages required.

### Note: How to create a simple machine learning workflow

1. Create a new workspace or load an exsiting workspace
2. Create experiment script and save it in the folder along with other files
3. Configure the file and submit the experiment

### [Lab: Getting Started with Azure Machine Learning](https://github.com/MicrosoftDocs/mslearn-aml-labs/blob/master/labdocs/Lab01.md)

- In this lab, we need to create a **workspace** in Azure Portal and then use **Azure Machine Learning studio** to manage the workspace.
    - Create a compute instance under the workspace. ***When creating a Compute Instance, a virtual machine is created.***
    - The cheapest virtual machine is STANDARD_D2S_V3
        - After the compute instance is created, click its **Jupyter link** to open Jupyter Notebooks on the VM.
- **[IMPORTANT!!]** When you have finished the lab, **close all Jupyter tabs and *Stop* your compute instance** to avoid incurring unnecessary costs.

### MLflow

**MLflow** is an open source platform for managing machine learning processes. It's **commonly (but not exclusively) used in Databricks environments** to coordinate experiments and track metrics. In Azure Machine Learning experiments, you can use MLflow to track metrics instead of the native log functionality if you desire.

```python
import mlflow
```
- Refer to the notebook codes in official Git-Hub fore more details.


## 02 Training Models with Parameters

In Azure Machine Learning, you can use a **Run Configuration** and a **Script Run Configuration** to run a script-based experiment that trains a machine learning model. However, depending on the machine learning framework being used and **the dependencies** it requires, **the run configuration may become complex**.

Azure Machine Learning also provides a higher level abstraction called an **Estimator** that ***encapsulates a run configuration and a script configuration*** in a single object, and for which there are pre-defined, framework-specific variants that already include the package dependencies for common machine learning frameworks such as *Scikit-Learn, PyTorch, and Tensorflow*.

#### Note: 
- A difference is to replace script_config with estimator (create estimator object and pass it into the config parameter)
- The rest of process to run a model with experiments is basically the same. 

### Steps:
#### Create a training script and log key metrics of modeling performance
#### Run the script as experiment
- Option 1: Use an Estimator

```python
from azureml.train.estimator import Estimator
from azureml.core import Experiment

# Create an estimator
estimator = Estimator(source_directory='experiment_folder',
                      entry_script='training_script.py',
                      compute_target='local',
                      conda_packages=['scikit-learn']
                      )

# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator) # Note here the estimator is passed to the config parameter
```

- Option 2: using framewrk-specific estimators

```python
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment

# Create an estimator
estimator = SKLearn(source_directory='experiment_folder',
                    entry_script='training_script.py'
                    compute_target='local'
                    )

# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator) # Note here the estimator is passed to the config parameter
```

#### Register the trained model to the workspace

Note that **the outputs of the experiment include the trained model file (model.pkl)**. You can register this model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.

Model registration enables you to track multiple versions of a model, and retrieve models for ***inferencing (predicting label values from new data)***. When you register a model, you can specify a name, description, tags, framework (such as Scikit-Learn or PyTorch), framework version, custom properties, and other useful metadata. Registering a model with the same name as an existing model automatically creates a new version of the model, starting with 1 and increasing in units of 1.

- Option 1: **register** method of **Model** object

```python
from azureml.core import Model

model = Model.register(workspace=ws,
                       model_name='classification_model',
                       model_path='model.pkl', # local path
                       description='A classification model',
                       tags={'dept': 'sales'},
                       model_framework=Model.Framework.SCIKITLEARN,
                       model_framework_version='0.20.3')
```

- Option 2: reference to the **Run**

```python
run.register_model( model_name='classification_model',
                    model_path='outputs/model.pkl', # run outputs path
                    description='A classification model',
                    tags={'dept': 'sales'},
                    model_framework=Model.Framework.SCIKITLEARN,
                    model_framework_version='0.20.3')
```

#### Viewing registered models
```python
from azureml.core import Model

for model in Model.list(ws):
    # Get model name and auto-generated version
    print(model.name, 'version:', model.version)
```
    

### Also: using script parameters

#### Add argument into script
Adding parameters to your script enables you to repeat the same training experiment with different settings
To use parameters in a script, you must use a library such as **argparse** to read the arguments passed to the script and assign them to variables.

```python
import argparse
# also import other packages as neccessary

# Get the experiment run context
run = Run.get_context()

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
args = parser.parse_args()
reg = args.reg

# Prepare the dataset

# Train a logistic regression model
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# The rest of the script
```

#### Passing Script Arguments to an Estimator
```python
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment

# Configure/create an estimator
estimator = SKLearn(source_directory='experiment_folder',
                    entry_script='training_script.py',
                    script_params = {'--reg_rate': 0.1},
                    compute_target='local'
                    )

# Create and run an experiment
experiment = Experiment(workspace = ws, name = 'training_experiment')
run = experiment.submit(config=estimator)
```

### Side Note: Revisit how to interprete ROC
- Y axis calculates True Positive Rate – the base is True (Ex: 80 True instances)
- X axis calculates False Positive Rate – the base is False (Ex: 20 False instances)
    - If we select True by randomly, the probability of selecting a true or false instance is 0.8 and 0.2. Therefore, TPR and FPR will increase at around the same pace.
    - However, if we build a good predictive model, the probability of selecting a true instance should increase, skewing the curve to the top-left. ***The better the capability of the model to predict true positive, the higher the AUC.***

#### Don’t confuse the concept of AUC and Accuracy.
- AUC shows **the capability of a model to predict true positives**, and each axis has different base.
- The base of accuracy includes both true and false instances. It doesn’t take into account the capability of predicting true positives.


## 03 Work with Data in Azure Machine Learning

### [IMPORTANT NOTE] Datastores are *file locations* whereas datasets are are *real data*.

### Datastores
In Azure Machine Learning, ***datastores*** are abstractions for cloud data sources / storage locations.

```python
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
    datastore_name='blob_data',
    container_name='data_container',
    account_name='az_store_acct',
    account_key='123456abcde789…')    

# Get reference to a data score
blob_store = Datastore.get(ws, datastore_name='blob_data')
default_store = ws.get_default_datastore()
ws.set_default_datastore('blob_data')

# Working directly with a datastore
blob_ds.upload(src_dir='/files',
               target_path='/data/files',
               overwrite=True, show_progress=True)

blob_ds.download(target_path='downloads',
                 prefix='/data',
                 show_progress=True)
```

When you want to use a datastore in an experiment script, you must pass a data reference to the script. The data reference is configured for one of the following data access modes: **download, upload, and mount.**

```python
# Get a data reference
data_ref = blob_ds.path('data/files').as_download(path_on_compute='training_data')

# Configuration
estimator = SKLearn(source_directory='experiment_folder',
                    entry_script='training_script.py'
                    compute_target='local',
                    script_params = {'--data_folder': data_ref})
```

In your training script, you can retrieve the parameter and use it like a local folder:
```python
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--data_folder', type=str, dest='data_folder')
args = parser.parse_args()
data_files = os.listdir(args.data_folder)
```

### Datasets
***Datasets*** are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.

Datasets are typically based on **files in a datastore**, though they can also be based on URLs and other sources. You can create the following types of dataset: **tabular and file**.

```python
# Create - Type 1: Creating and registering tabular datasets
from azureml.core import Dataset

blob_ds = ws.get_default_datastore()

    # The dataset in this example includes data from two file paths within the default datastore
csv_paths = [(blob_ds, 'data/files/current_data.csv'),
             (blob_ds, 'data/files/archive/*.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)

    # After creating the dataset, the code registers it in the workspace with the name csv_table.
tab_ds = tab_ds.register(workspace=ws, name='csv_table')

# Create - Type 2: Creating and registering file datasets
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')

# Retrieve a registered dataset
import azureml.core
from azureml.core import Workspace, Dataset

    # Load the workspace from the saved config file
ws = Workspace.from_config()

    # Get a dataset from the workspace datasets collection (dictionary attribute)
ds1 = ws.datasets['csv_table']

    # Get a dataset by name from the datasets class (method)
ds2 = Dataset.get_by_name(ws, 'img_files')

# Dataset versioning - specifying the create_new_version property
img_paths = [(blob_ds, 'data/files/images/*.jpg'),
             (blob_ds, 'data/files/images/*.png')]
file_ds = Dataset.File.from_files(path=img_paths)
file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)

# Retrieving a specific dataset version
img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)
```

You can read data directly from a dataset, or you can pass a dataset as a named input to a script configuration or estimator.
```python
# Working with a dataset directly
    # Tabuler
df = tab_ds.to_pandas_dataframe()
# code to work with dataframe goes here

    # File
for file_path in file_ds.to_path():
    print(file_path)
```

When you need to access a dataset in an experiment script, you can pass the dataset as an input to a **ScriptRunConfig** or an **Estimator**. For example, the following code passes a tabular dataset to an estimator:

Since the script will need to work with a Dataset object, you must include either **the full azureml-sdk package** or **the azureml-dataprep package with the pandas extra library** in the script's compute environment.

```python
estimator = SKLearn( source_directory='experiment_folder',
                     entry_script='training_script.py',
                     compute_target='local',
                     inputs=[tab_ds.as_named_input('csv_data')],
                     pip_packages=['azureml-dataprep[pandas]')
```

In the experiment script itself, you can access the input and work with the Dataset object it references like this:

```python
run = Run.get_context()
data = run.input_datasets['csv_data'].to_pandas_dataframe()
```

When passing a file dataset, you must **specify the access mode**. For large volumes of data, you'd generally use the **as_mount** method to stream the files directly from the dataset source; but when running on local compute (as we are in this example), you need to use the **as_download** option to download the dataset files to a local folder.

```python
estimator = Estimator( source_directory='experiment_folder',
                     entry_script='training_script.py'
                     compute_target='local',
                     inputs=[img_ds.as_named_input('img_data').as_download(path_on_compute='data')],
                     pip_packages=['azureml-dataprep[pandas]')
```

## 04 Work with Compute in Azure machine Learning
The runtime context for each experiment run consists of two elements:
1. The *environment* for the script, which includes all packages used in the script.
2. The *compute target* on which the environment will be deployed and the script run. This could be the local workstation from which the experiment run is initiated, or a remote compute target such as a training cluster that is provisioned on-demand.
    - In Azure Machine Learning, *Compute Targets* are **physical or virtual computers on which experiments are run**.

### Environments in Azure Machine Learning
In general, Azure Machine Learning handles environment creation and package installation for you - usually through the creation of **Docker containers**. In addition to Python, you can also configure PySpark, Docker and R for environments. Internally, environments result in **Docker images** that are used to run the training and scoring processes on the compute target.

When you run a Python script as an experiment in Azure Machine Learning, a Conda environment is created to define the execution context for the script. Azure Machine Learning provides a default environment that includes many common packages; including the **azureml-defaults** package that contains the libraries necessary for working with an experiment run, as well as popular packages like **pandas** and **numpy**.

You can also define your own environment and add packages by using **conda** or **pip**, to ensure your experiment has access to all the libraries it requires.

You can have Azure Machine Learning manage environment creation and package installation to define an environment, and then register it for reuse. Alternatively, you can manage your own environments and register them. This makes it possible to define consistent, reusable runtime contexts for your experiments - regardless of where the experiment script is run.

```python
from azureml.core import Environment

# Create an environment
# Approach 1: Creating an environment from a specification file
env = Environment.from_conda_specification(name='training_environment',
                                           file_path='./conda.yml')

# Approach 2: Creating an environment from an existing Conda environment
env = Environment.from_existing_conda_environment(name='training_environment',
                                                  conda_environment_name='py_env')

```

```python
# Approach 3: Creating an environment by specifying packages
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

    # Create a Python environment for the experiment
diabetes_env = Environment("diabetes-experiment-env")
diabetes_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
diabetes_env.docker.enabled = True # Use a docker container

    # Create a set of package dependencies (conda or pip as required)
diabetes_packages = CondaDependencies.create(conda_packages=['scikit-learn'],
                                          pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]'])

    # Add the dependencies to the environment
diabetes_env.python.conda_dependencies = diabetes_packages
```

```python
# Register an environment
env.register(workspace=ws)

    # View registered environment
env_names = Environment.list(workspace=ws)
for env_name in env_names:
    print('Name:',env_name)
```
```python
# Retrieving and using an environment
from azureml.core import Environment
from azureml.train.estimator import Estimator

training_env = Environment.get(workspace=ws, name='training_environment')
estimator = Estimator(source_directory='experiment_folder'
                      entry_script='training_script.py',
                      compute_target='local',
                      environment_definition=training_env)
```

### Compute Targets

*In Azure Machine Learning, **Compute Targets** are physical or virtual computers on which experiments are run.* Azure Machine Learning supports multiple types of compute for experimentation and training, and for production inferencing. This enables you to select the most appropriate type of compute target for your particular needs.

#### Local compute

This runs the experiment on the same compute target as the code used to initiate the experiment, which may be your physical workstation or a virtual machine such as an Azure Machine Learning **compute instance on which you are running a notebook**.

#### Compute cluster

For experiment workloads with high scalability requirements, you can use Azure Machine Learning compute clusters; which are **multi-node clusters of Virtual Machines** that automatically scale up or down to meet demand. This is a cost-effective way to run experiments that need to handle large volumes of data or use parallel processing to distribute the workload and reduce the time it takes to run.

#### Inference clusters 

To deploy trained models as production services, you can use Azure Machine Learning inference clusters, which use **containerization technologies** to enable rapid initialization of compute for on-demand inferencing.

#### Attached compute

If you already use an Azure-based compute environment for data science, such as a virtual machine or an Azure Databricks cluster, you can attach it to your Azure Machine Learning workspace and use it as a compute target for certain types of workload.

#### Sample codes
```python
# Creating a managed compute target with the SDK
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute

# 1. Load the workspace from the saved config file
ws = Workspace.from_config()

# 2. Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'

# 3. Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
                                                       min_nodes=0, max_nodes=4,
                                                       vm_priority='dedicated')

# 4. Create the compute
aml_cluster = ComputeTarget.create(ws, compute_name, compute_config)
aml_cluster.wait_for_completion(show_output=True)
```

In this example, a cluster with up to four nodes that is based on the STANDARD_DS12_v2 virtual machine image will be created. The priority for the virtual machines (VMs) is set to dedicated, meaning they are reserved for use in this cluster (the alternative is to specify lowpriority, which has a lower cost but means that the VMs can be preempted if a higher-priority workload requires the compute).


```python
# Attaching an unmanaged compute target with the SDK
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, DatabricksCompute

# 1. Load the workspace from the saved config file
ws = Workspace.from_config()

# 2. Specify a name for the compute (unique within the workspace)
compute_name = 'db_cluster'

# 3. Define configuration for existing Azure Databricks cluster
db_workspace_name = 'db_workspace'
db_resource_group = 'db_resource_group'
db_access_token = '1234-abc-5678-defg-90...'
db_config = DatabricksCompute.attach_configuration(resource_group=db_resource_group,
                                                   workspace_name=db_workspace_name,
                                                   access_token=db_access_token)

# 4. Create the compute
databricks_compute = ComputeTarget.attach(ws, compute_name, db_config)
databricks_compute.wait_for_completion(True)
```

After you've created environments and compute targets in your workspace, you can use them to run specific workloads; such as experiments.

When an experiment for the estimator is submitted, the run will be queued while the compute target is started and the specified environment deployed to it, and then the run will be processed on the compute environment.


```python
estimator = Estimator(source_directory='experiment_folder',
                      entry_script='training_script.py',
                      environment_definition=training_env,
                      compute_target=training_cluster # compute target - specify a new or an object
                      )
```

## 05 Orchestra machine learning with pipelines

### Definition
The term pipeline is used extensively in machine learning, often with different meanings.
- Scikit-Learn pipeline
- Azure Machine Learning pipelines encapsulate steps that can be run as an experiment.
- Azure DevOps pipelines: the build and configuration tasks required to deliver software.

### Azure Machine Learning Pipeline

In Azure Machine Learning, a pipeline is a workflow of machine learning tasks in which each task is implemented as a *step*.

Steps can be arranged sequentially or in parallel, enabling you to build sophisticated flow logic to orchestrate machine learning operations. *Each step can be run on a specific compute target*, making it possible to combine different types of processing as required to achieve an overall goal.

**A pipeline can be executed as a process by running the pipeline as an experiment. Each step in the pipeline runs on its allocated compute target as part of the overall experiment run.**

You can **publish a pipeline as a REST endpoint**, enabling client applications to initiate a pipeline run. You can also **define a schedule** for a pipeline, and have it run automatically at periodic intervals.

#### Types of step
Common kinds of step in an Azure Machine Learning pipeline include:

- **PythonScriptStep**: Runs a specified Python script.
- **EstimatorStep**: Runs an estimator.
- **DataTransferStepv**: Uses Azure Data Factory to copy data between data stores.
- **DatabricksStep**: Runs a notebook, script, or compiled JAR on a databricks cluster.
- **AdlaStep**: Runs a U-SQL job in Azure Data Lake Analytics.

To create a pipeline, you must first define each step and then create a pipeline that includes the steps. The specific configuration of each step depends on the step type.
#### Note: Each step has its own configuration done by a `pipeline.step` object. After each step object is created, a `Pipeline` object chains all steps together.

```python
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep

# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'scripts',
                         script_name = 'data_prep.py',
                         compute_target = 'aml-cluster',
                         runconfig = run_config)

# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
                      estimator = sk_estimator,
                      compute_target = 'aml-cluster')
```

After defining the steps, you can assign them to a pipeline, and run it as an experiment:

```python
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment

# Construct the pipeline
train_pipeline = Pipeline(workspace = ws, steps = [step1,step2])

# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'training-pipeline')
pipeline_run = experiment.submit(train_pipeline) # The configuration becomes a Pipeline object
```

### Pass data between pipeline steps

The **PipelineData** object is a special kind of **DataReference** that:

- References a location in a datastore.
- Creates a **data dependency between pipeline steps**.

To use a PipelineData object to pass data between steps, you must:

1. Define a named PipelineData object that references a location in a datastore.
2. Specify the PipelineData object as an input or output for the steps that use it.
3. Pass the PipelineData object as a script parameter in steps that run scripts (and include code in those scripts to read or write data)

#### Note: PipelineData object can also be used to pass trained models. There's another object called PipelineDataset for tabular data.

```python
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep

# Get a dataset for the initial data
raw_ds = Dataset.get_by_name(ws, 'raw_dataset')

# 1. Define a PipelineData object to pass data between steps
data_store = ws.get_default_datastore()
prepped_data = PipelineData('prepped',  datastore=data_store)

# Step to run a Python script
step1 = PythonScriptStep(name = 'prepare data',
                         source_directory = 'scripts',
                         script_name = 'data_prep.py',
                         compute_target = 'aml-cluster',
                         runconfig = run_config,
                         # Specify dataset as initial input
                         inputs=[raw_ds.as_named_input('raw_data')],
                         # 2 & 3. Specify PipelineData as output and create dependency between steps
                         outputs=[prepped_data],
                         # Also pass as data reference to script
                         arguments = ['--folder', prepped_data])

# Step to run an estimator
step2 = EstimatorStep(name = 'train model',
                      estimator = sk_estimator,
                      compute_target = 'aml-cluster',
                      # 2 & 3. Specify PipelineData as input and create dependency between steps
                      inputs=[prepped_data],
                      # Pass as data reference to estimator script
                      estimator_entry_script_arguments=['--folder', prepped_data])

```

Code in data_prep.py
```python
from azureml.core import Run
import argparse
import os

# Get the experiment run context
run = Run.get_context()

# Get input dataset as dataframe
raw_df = run.input_datasets['raw_data'].to_pandas_dataframe()

# Get PipelineData argument
parser = argparse.ArgumentParser()
parser.add_argument('--folder', type=str, dest='folder')
args = parser.parse_args()
output_folder = args.folder

# code to prep data (in this case, just select specific columns)
prepped_df = raw_df[['col1', 'col2', 'col3']]

# Save prepped data to the PipelineData location
os.makedirs(output_folder, exist_ok=True)
output_path = os.path.join(output_folder, 'prepped_data.csv')
prepped_df.to_csv(output_path) # This would create a PipelineData object
```

### Reuse pipeline steps

By default, the step output from a previous pipeline run is reused without rerunning the step provided the script, source directory, and other parameters for the step have not changed. **Step reuse can reduce the time it takes to run a pipeline, but it can lead to stale results when changes to downstream data sources have not been accounted for**.

To control reuse for an individual step, you can set the `allow_reuse = False` in the step configuration

When you have multiple steps, you can force all of them to run regardless of individual reuse configuration by setting the regenerate_outputs parameter when submitting the pipeline experiment:

```python
pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)
```

### Publish pipelines

After you have created a pipeline, you can publish it to create a **REST endpoint** through which the pipeline can be run on demand.

```python
# Approach 1: publish method
published_pipeline = pipeline.publish(name='training_pipeline',
                                          description='Model training pipeline',
                                          version='1.0')

# Approach 2: call the publish method on a successful run
    # Get the most recent run of the pipeline
pipeline_experiment = ws.experiments.get('training-pipeline')
run = list(pipeline_experiment.get_runs())[0]

    # Publish the pipeline from the run
published_pipeline = run.publish_pipeline(name='training_pipeline',
                                          description='Model training pipeline',
                                          version='1.0')
```

After the pipeline has been published, you can view it in Azure Machine Learning studio. You can also determine the URI of its endpoint like this:

```python
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)
```

To use the endpoint, client applications need to make a REST call over HTTP. This request must be authenticated, so an authorization header is required. A real application would require a service principal with which to be authenticated, but to test this out, we'll use the authorization header from your current connection to your Azure workspace, which you can get using the following code:

```python
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
```

To initiate a published endpoint, you **make an HTTP request to its REST endpoint, passing an authorization header with a token for a service principal with permission to run the pipeline, and a JSON payload specifying the experiment name**. The pipeline is run asynchronously, so the response from a successful REST call includes the run ID. You can use this to track the run in Azure Machine Learning studio.

```python
import requests

response = requests.post(rest_endpoint,
                         headers=auth_header,
                         json={"ExperimentName": "run_training_pipeline"})
run_id = response.json()["Id"]
print(run_id)
```

### Use pipeline parameters

In the script:
```python
from azureml.pipeline.core.graph import PipelineParameter

reg_param = PipelineParameter(name='reg_rate', default_value=0.01) # create a PipelineParameter object

...

step2 = EstimatorStep(name = 'train model',
                      estimator = sk_estimator,
                      compute_target = 'aml-cluster',
                      inputs=[prepped],
                      estimator_entry_script_arguments=['--folder', prepped,
                                                        '--reg', reg_param])
```

After you publish a parameterized pipeline, you can pass parameter values in the JSON payload for the REST interface:

```python
response = requests.post(rest_endpoint,
                         headers=auth_header,
                         json={"ExperimentName": "run_training_pipeline",
                               "ParameterAssignments": {"reg_rate": 0.1}})
```

### Schedule pipelines
```python
# Scheduling a pipeline for periodic intervals: define a ScheduleRecurrence that determines the run frequency, and use it to create a Schedule.
from azureml.pipeline.core import ScheduleRecurrence, Schedule

daily = ScheduleRecurrence(frequency='Day', interval=1)
pipeline_schedule = Schedule.create(ws, name='Daily Training',
                                        description='trains model every day',
                                        pipeline_id=published_pipeline.id,
                                        experiment_name='Training_Pipeline',
                                        recurrence=daily # pass the ScheduleRecurrence object here
                                   )

# Triggering a pipeline run on data changes: create a Schedule that monitors a specified path on a datastore
from azureml.core import Datastore
from azureml.pipeline.core import Schedule

training_datastore = Datastore(workspace=ws, name='blob_data')
pipeline_schedule = Schedule.create(ws, name='Reactive Training',
                                    description='trains model on data change',
                                    pipeline_id=published_pipeline_id,
                                    experiment_name='Training_Pipeline',
                                    datastore=training_datastore, # Pass the Datastore here
                                    path_on_datastore='data/training')
```

### [Pattern for creating and using pipelines](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py#pattern-for-creating-and-using-pipelines)

- **A Azure Machine learning Pipeline is associated with an <ins>Azure Machine Learning workspace</ins>.**
- **A pipeline step is associated with a <ins>compute target</ins> within that workspace.**

A common pattern for pipeline steps is:

1. Specify workspace, compute, and storage
2. Configure your input and output data using
    - Dataset which makes available an existing Azure datastore
    - PipelineDataset which encapsulates typed tabular data
    - PipelineData which is used for intermediate file or directory data written by one step and intended to be consumed by another
3. Define one or more pipeline steps
4. Instantiate a pipeline using your workspace and steps
5. Create an experiment to which you submit the pipeline
6. Monitor the experiment results