# Azure Machine Learning Pipelines with AutoML

In this demonstration, we will be looking at how to construct a training pipeline in Azure Machine Learning that includes data preparation, training with AutoML, and model registration.

This demonstration is adapted from the following Azure ML pipeline example:

[AutoML House Pricing Regression in Pipeline](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-regression-house-pricing-in-pipeline)

>[NOTE] Must use Python 3.10 SDK V2 for Lab

## Import the required libraries

In [None]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.automl import regression
from azure.ai.ml.entities._job.automl.tabular import TabularFeaturizationSettings
from azure.ai.ml.entities import Environment, AmlCompute

In [None]:
# Note: AutoML steps in Pipelines are in Preview state at this time. Set AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED to true to enable.
import os
os.environ["AZURE_ML_CLI_PRIVATE_FEATURES_ENABLED"] = "true"

## Get a reference to the Machine Learning workspace from config file created in previous steps
In this step we are getting details of machine learning workspace previously created from the config file

The cell below can be executed if you are running the notebook locally in this machine and you created the workspace using the portal. Replace subscription-id, resource-group and workspace-name

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [None]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

In [None]:
# Print workspace metadata
ml_client.workspaces.get()

## Create compute cluster
In the step below, we will create a compute target to run the pipeline

In [None]:
# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

## Basic pipeline job with AutoML regression task

### Define Environment for data preprocessing step in the pipeline

In [None]:
env_docker_conda = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="./environment/preprocessing_env.yaml",
    name="pipeline-custom-environment",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env_docker_conda)

### Build pipeline

In [None]:
# Define pipeline
@pipeline(
    description="AutoML Car Price Regression Pipeline",
)
def automl_regression(
    regression_train_data, regression_validation_data, regression_test_data
):
    # define command function for preprocessing the model
    preprocessing_command_func = command(
        inputs=dict(
            train_data=Input(type="mltable"),
            test_data=Input(type="mltable"),
            validation_data=Input(type="mltable"),
        ),

        outputs=dict(
            preprocessed_train_data=Output(type="mltable"),
            preprocessed_test_data=Output(type="mltable"),
            preprocessed_validation_data=Output(type="mltable"),
        ),
        
        code="./src/preprocess.py",
        command="python preprocess.py "
        + "--train_data ${{inputs.train_data}} "
        + "--validation_data ${{inputs.validation_data}} "
        + "--test_data ${{inputs.test_data}} "
        + "--preprocessed_train_data ${{outputs.preprocessed_train_data}} "
        + "--preprocessed_validation_data ${{outputs.preprocessed_validation_data}} "
        + "--preprocessed_test_data ${{outputs.preprocessed_test_data}}",
        environment="pipeline-custom-environment@latest",
    )

    # define command task for preprocessing the data
    preprocess_node = preprocessing_command_func(
        train_data=regression_train_data,
        test_data=regression_test_data,
        validation_data=regression_validation_data,
    )

    # define the AutoML regression task with AutoML function
    regression_node = regression(
        primary_metric="r2_score",
        target_column_name="price",
        training_data=preprocess_node.outputs.preprocessed_train_data,
        test_data=preprocess_node.outputs.preprocessed_test_data,
        validation_data=preprocess_node.outputs.preprocessed_validation_data,
        featurization=TabularFeaturizationSettings(mode="auto"),
        
        # currently need to specify outputs "mlflow_model" explicitly to reference it in following nodes
        enable_model_explainability=True,
        outputs={"best_model": Output(type="mlflow_model")},
    )

    # set limits & training
    regression_node.set_limits(max_trials=5, max_concurrent_trials=2)
    regression_node.set_training(
        enable_stack_ensemble=True, enable_vote_ensemble=True
    )

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path=Input(type="mlflow_model"),
            model_base_name="RULPredictInitial",
        ),
        code="./src/register.py",
        command="python register.py "
        + "--model_input_path ${{inputs.model_input_path}} "
        + "--model_base_name ${{inputs.model_base_name}}",
        environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    )
    
    register_model = command_func(model_input_path=regression_node.outputs.best_model)


pipeline_regression = automl_regression(
    regression_train_data=Input(path="./data/car-price-data/training/", type="mltable"),
    regression_validation_data=Input(
        path="./data/car-price-data/validation/", type="mltable"
    ),
    regression_test_data=Input(path="./data/car-price-data/test/", type="mltable"),
)

# set pipeline level compute
pipeline_regression.settings.default_compute = "cpu-cluster"

## Submit pipeline job

In [None]:
# submit the pipeline job
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_regression, experiment_name="Car-Price-Regression-Experiment"
)
pipeline_job

In [None]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)