# Example Pipeline Displays the following capabilities
- Impute Missing Data
- Split Data into Train/Test partitions
- Train a Machine Learning Model using optimized hyperparameter sweep

# Create a Impute Missing Job

This cell starts with `%%writefile local_src/impute_missing.py` and demonstrates CLI-based hyperparameter sweep.

In [17]:
%%writefile local_src/impute_missing.py
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import mlflow


def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--clean_data_csv", type=str, help="name of inputed data")
    
    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)

    credit_df = pd.read_csv(args.data, header=1, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    credit_df['AGE'].fillna(credit_df['AGE'].mean(), inplace = True)

    os.makedirs(args.clean_data_csv, exist_ok=True)

    out_path = os.path.join(os.getcwd(), args.clean_data_csv, "imputed_data.csv")

    credit_df.to_csv(out_path, index=False, header=True)


    credit_df_chk = pd.read_csv(args.data, header=1, index_col=0)

    # Stop Logging
    mlflow.end_run()


if __name__ == "__main__":
    main()

Overwriting local_src/impute_missing.py


This cell starts with `%%writefile impute_missing.yaml` and demonstrates CLI-based hyperparameter sweep.

In [18]:
%%writefile impute_missing.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
name: impute_missing
code: local_src
command: >-
  python impute_missing.py 
  --data ${{inputs.data}} 
  --clean_data_csv ${{inputs.clean_data_csv}}
inputs:
  data: 
    type: uri_file
  clean_data_csv: 
    type: string
outputs:
  clean_data_csv:
    type: uri_file
    mode: rw_mount
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
compute: azureml:cpu-cluster
display_name: sweep_clean_data
experiment_name: aml-examples
description: Train a Machine Learning model using a workspace Data asset.

Overwriting impute_missing.yaml


This cell starts with `!az ml component create -f impute_missing.yaml` and demonstrates CLI-based hyperparameter sweep.

In [19]:
!az ml component create -f impute_missing.yaml

{
  "$schema": "https://azuremlschemas.azureedge.net/latest/commandJob.schema.json",
  "code": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/codes/33a37ca7-0488-4bff-90e6-30e9e7e60eaf/versions/1",
  "command": "python impute_missing.py  --data ${{inputs.data}}  --clean_data_csv ${{inputs.clean_data_csv}}",
  "creation_context": {
    "created_at": "2023-11-10T15:22:26.920530+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-10T15:22:27.012979+00:00",
    "last_modified_by": "Anton Slutsky",
    "last_modified_by_type": "User"
  },
  "description": "Train a Machine Learning model using a workspace Data asset.",
  "display_name": "sweep_clean_data",
  "environment": "azureml://registries/azureml/environments/AzureML-sklearn-1.0-ubuntu20.04-py38-cpu/versions/1",
  "id": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-c

# Create a Data Splitter Job

This cell starts with `%%writefile local_src/split_data.py` and demonstrates CLI-based hyperparameter sweep.

In [20]:
%%writefile local_src/split_data.py
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import mlflow
from os import listdir


def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--train_data_csv", type=str, help="name of train data")
    parser.add_argument("--test_data_csv", type=str, help="name of test data")
    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    print("Dir:\n", listdir(args.data))

    credit_df = pd.read_csv(f"{args.data}/imputed_data.csv", header=1, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    credit_train_df, credit_test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )

    os.makedirs(args.train_data_csv, exist_ok=True)
    os.makedirs(args.test_data_csv, exist_ok=True)

    credit_train_df.to_csv(os.path.join(os.getcwd(), args.train_data_csv, "data.csv"), index=False)

    credit_test_df.to_csv(os.path.join(os.getcwd(), args.test_data_csv, "data.csv"), index=False)

    # Stop Logging
    mlflow.end_run()


if __name__ == "__main__":
    main()

Overwriting local_src/split_data.py


# Create a Data Splitter Job

This cell starts with `%%writefile local_src/split_data.py` and demonstrates CLI-based hyperparameter sweep.

In [21]:
%%writefile local_src/split_data.py
import os
import argparse
import pandas as pd
from sklearn.model_selection import train_test_split
import logging
import mlflow
from os import listdir


def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--train_data_csv", type=str, help="name of train data")
    parser.add_argument("--test_data_csv", type=str, help="name of test data")
    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    print("Dir:\n", listdir(args.data))



    credit_df = pd.read_csv(f"{args.data}/imputed_data.csv", header=0, index_col=0)

    print("Headers:::::::", credit_df.columns)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    credit_train_df, credit_test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )

    os.makedirs(args.train_data_csv, exist_ok=True)
    os.makedirs(args.test_data_csv, exist_ok=True)

    credit_train_df.to_csv(os.path.join(os.getcwd(), args.train_data_csv, "data.csv"), index=False)

    credit_test_df.to_csv(os.path.join(os.getcwd(), args.test_data_csv, "data.csv"), index=False)

    # Stop Logging
    mlflow.end_run()


if __name__ == "__main__":
    main()

Overwriting local_src/split_data.py


This cell starts with `%%writefile split_data.yaml` and demonstrates CLI-based hyperparameter sweep.

In [22]:
%%writefile split_data.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
name: sweep_clean_data
code: local_src
command: >-
  python split_data.py 
  --data ${{inputs.data}} 
  --test_train_ratio ${{inputs.test_train_ratio}} 
  --train_data_csv ${{outputs.train_data_csv}}
  --test_data_csv ${{outputs.test_data_csv}}
inputs:
  data: 
    type: uri_file
  test_train_ratio: 
    type: number
  train_data_csv: 
    type: string
  test_data_csv: 
    type: string
outputs:
  train_data_csv:
    type: uri_folder
    mode: rw_mount
  test_data_csv:
    type: uri_folder
    mode: rw_mount
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
compute: azureml:cpu-cluster
display_name: sweep_clean_data
experiment_name: aml-examples
description: Train a Machine Learning model using a workspace Data asset.

Overwriting split_data.yaml


This cell starts with `!az ml component create -f split_data.yaml` and demonstrates CLI-based hyperparameter sweep.

In [23]:
!az ml component create -f split_data.yaml

{
  "$schema": "https://azuremlschemas.azureedge.net/latest/commandJob.schema.json",
  "code": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/codes/33a37ca7-0488-4bff-90e6-30e9e7e60eaf/versions/1",
  "command": "python split_data.py  --data ${{inputs.data}}  --test_train_ratio ${{inputs.test_train_ratio}}  --train_data_csv ${{outputs.train_data_csv}} --test_data_csv ${{outputs.test_data_csv}}",
  "creation_context": {
    "created_at": "2023-11-10T15:22:51.781031+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-10T15:22:51.881956+00:00",
    "last_modified_by": "Anton Slutsky",
    "last_modified_by_type": "User"
  },
  "description": "Train a Machine Learning model using a workspace Data asset.",
  "display_name": "sweep_clean_data",
  "environment": "azureml://registries/azureml/environments/AzureML-sklearn-1.0-ub

This cell starts with `%%writefile split_data.yaml` and demonstrates CLI-based hyperparameter sweep.

In [24]:
%%writefile split_data.yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
name: sweep_clean_data
code: local_src
command: >-
  python split_data.py 
  --data ${{inputs.data}} 
  --test_train_ratio ${{inputs.test_train_ratio}} 
  --train_data_csv ${{outputs.train_data_csv}}
  --test_data_csv ${{outputs.test_data_csv}}
inputs:
  data: 
    type: uri_file
  test_train_ratio: 
    type: number
  train_data_csv: 
    type: string
  test_data_csv: 
    type: string
outputs:
  train_data_csv:
    type: uri_folder
    mode: rw_mount
  test_data_csv:
    type: uri_folder
    mode: rw_mount
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
compute: azureml:cpu-cluster
display_name: sweep_clean_data
experiment_name: aml-examples
description: Train a Machine Learning model using a workspace Data asset.

Overwriting split_data.yaml


This cell starts with `!az ml component create -f split_data.yaml` and demonstrates CLI-based hyperparameter sweep.

In [25]:
!az ml component create -f split_data.yaml

{
  "$schema": "https://azuremlschemas.azureedge.net/latest/commandJob.schema.json",
  "code": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/codes/33a37ca7-0488-4bff-90e6-30e9e7e60eaf/versions/1",
  "command": "python split_data.py  --data ${{inputs.data}}  --test_train_ratio ${{inputs.test_train_ratio}}  --train_data_csv ${{outputs.train_data_csv}} --test_data_csv ${{outputs.test_data_csv}}",
  "creation_context": {
    "created_at": "2023-11-10T15:22:58.386183+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-10T15:22:58.509220+00:00",
    "last_modified_by": "Anton Slutsky",
    "last_modified_by_type": "User"
  },
  "description": "Train a Machine Learning model using a workspace Data asset.",
  "display_name": "sweep_clean_data",
  "environment": "azureml://registries/azureml/environments/AzureML-sklearn-1.0-ub

# Create Model Training Job

This cell starts with `%%writefile local_src/train.py` and demonstrates CLI-based hyperparameter sweep.

In [26]:
%%writefile local_src/train.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from os import listdir

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)

    print("Dir:", listdir(args.data))

    
    credit_df = pd.read_csv(f"{args.data}/data.csv", header=0, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    #Split train and test datasets
    train_df, test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )
    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################


    print("Columns:\n", list(train_df.columns))

    # Extracting the label column
    y_train = train_df.pop("default payment next month")

    # convert the dataframe values to array
    X_train = train_df.values

    # Extracting the label column
    y_test = test_df.pop("default payment next month")

    # convert the dataframe values to array
    X_test = test_df.values

    print(f"Training with data of shape {X_train.shape}")

    clf = GradientBoostingClassifier(
        n_estimators=args.n_estimators, learning_rate=args.learning_rate
    )
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    ####################
    # Log classifier accuracy
    ####################

    accuracy = clf.score(X_test, y_test)
    print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
    mlflow.log_metric('accuracy', float(accuracy))
    # mlflow.log_metric('accuracy', 1.0)
    print(classification_report(y_test, y_pred))
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=clf,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=clf,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Overwriting local_src/train.py


This cell starts with `%%writefile train.yaml` and demonstrates CLI-based hyperparameter sweep.

In [27]:
%%writefile train.yaml
# <component>
name: sweep_train_credit_defaults_component
display_name: Sweep Train Credit Defaults Component
# version: 1 # Not specifying a version will automatically update the version
type: command
inputs:
  train_data_csv: 
    type: uri_folder
    mode: ro_mount
  test_data_csv: 
    type: uri_file
    mode: ro_mount
  learning_rate:
    type: number     
  registered_model_name:
    type: string
outputs:
  model:
    type: uri_folder
code: local_src
environment:
  # for this step, we'll use an AzureML curate environment
  azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1
command: >-
  python train.py 
  --data ${{inputs.train_data_csv}}  
  --learning_rate ${{inputs.learning_rate}}
  --registered_model_name ${{inputs.registered_model_name}} 

Overwriting train.yaml


This cell starts with `!az ml component create -f train.yaml` and demonstrates CLI-based hyperparameter sweep.

In [28]:
!az ml component create -f train.yaml

{
  "code": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/codes/33a37ca7-0488-4bff-90e6-30e9e7e60eaf/versions/1",
  "command": "python train.py  --data ${{inputs.train_data_csv}}   --learning_rate ${{inputs.learning_rate}} --registered_model_name ${{inputs.registered_model_name}} ",
  "creation_context": {
    "created_at": "2023-11-10T15:23:05.267614+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-10T15:23:05.343383+00:00",
    "last_modified_by": "Anton Slutsky",
    "last_modified_by_type": "User"
  },
  "display_name": "Sweep Train Credit Defaults Component",
  "environment": "azureml://registries/azureml/environments/AzureML-sklearn-1.0-ubuntu20.04-py38-cpu/versions/1",
  "id": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/wor

# Create Pipeline

This cell starts with `%%writefile sweep_pipeline.yaml` and demonstrates CLI-based hyperparameter sweep.

In [29]:
%%writefile sweep_pipeline.yaml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
name: pipeline_inpute_and_hyper_sweep_[RANDID]
display_name: pipeline_inpute_and_hyper_sweep_XX
description: Tune hyperparameters using TF component
settings:
    default_compute: azureml:cpu-cluster

jobs:
  impute_missing:
    type: command
    inputs:
      data: 
        type: uri_file
        path: azureml:credit_cards@latest
      clean_data_csv: clean 
    outputs:
      clean_data_csv: 
        mode: upload
    code: local_src
    environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
    compute: azureml:cpu-cluster
    command: >-
      python impute_missing.py 
      --data ${{inputs.data}} 
      --clean_data_csv ${{outputs.clean_data_csv}}

  split_data:
    type: command
    inputs:
      data: ${{parent.jobs.impute_missing.outputs.clean_data_csv}}
      test_train_ratio: 0.25
      train_data_csv: train 
      test_data_csv: test
    outputs:
      train_data_csv: 
        mode: upload
      test_data_csv: 
        mode: upload
    code: local_src
    environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
    compute: azureml:cpu-cluster
    command: >-
      python split_data.py 
      --data ${{inputs.data}} 
      --test_train_ratio ${{inputs.test_train_ratio}} 
      --train_data_csv ${{outputs.train_data_csv}}
      --test_data_csv ${{outputs.test_data_csv}}

  sweep_step:
    type: sweep
    trial: train.yaml
    inputs:
      train_data_csv: ${{parent.jobs.split_data.outputs.train_data_csv}}
      test_data_csv: ${{parent.jobs.split_data.outputs.test_data_csv}}
      registered_model_name: sweeped_credit_default_model
    sampling_algorithm: random
    search_space:
      learning_rate: 
        type: uniform
        min_value: 0.1
        max_value: 3.0
    objective:
      goal: maximize
      primary_metric: accuracy
    limits:
      max_total_trials: 4
      max_concurrent_trials: 2
      timeout: 3600



Overwriting sweep_pipeline.yaml


This cell starts with `import random` and demonstrates CLI-based hyperparameter sweep.

In [30]:
import random

yaml = open("sweep_pipeline.yaml").read().replace("[RANDID]", str(random.randint(0, 1000000)))

with open("sweep_pipeline.yaml", "w") as out:
    out.write(yaml)


This cell starts with `!az ml component create -f sweep_pipeline.yaml` and demonstrates CLI-based hyperparameter sweep.

In [31]:
!az ml component create -f sweep_pipeline.yaml

{
  "$schema": "https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json",
  "creation_context": {
    "created_at": "2023-11-10T15:23:20.628895+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User",
    "last_modified_at": "2023-11-10T15:23:20.723304+00:00",
    "last_modified_by": "Anton Slutsky",
    "last_modified_by_type": "User"
  },
  "description": "Tune hyperparameters using TF component",
  "display_name": "pipeline_inpute_and_hyper_sweep_XX",
  "id": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/components/pipeline_inpute_and_hyper_sweep_766615/versions/1",
  "is_deterministic": false,
  "name": "pipeline_inpute_and_hyper_sweep_766615",
  "resourceGroup": "SandboxML",
  "type": "pipeline",
  "version": "1"
}


Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


This cell starts with `!az ml job create --file sweep_pipeline.yaml` and demonstrates CLI-based hyperparameter sweep.

In [32]:
!az ml job create --file sweep_pipeline.yaml

{
  "creation_context": {
    "created_at": "2023-11-10T15:23:37.108591+00:00",
    "created_by": "Anton Slutsky",
    "created_by_type": "User"
  },
  "description": "Tune hyperparameters using TF component",
  "display_name": "pipeline_inpute_and_hyper_sweep_XX",
  "experiment_name": "aml-examples",
  "id": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/jobs/pipeline_inpute_and_hyper_sweep_766615",
  "jobs": {
    "impute_missing": {
      "component": "azureml:azureml_anonymous:f89e1f11-9fc5-43d7-b01c-823af0ebbf5b",
      "compute": "azureml:cpu-cluster",
      "inputs": {
        "clean_data_csv": "clean",
        "data": {
          "path": "azureml:/subscriptions/781b03e7-6eb7-4506-bab8-cf3a0d89b1d4/resourceGroups/SandboxML/providers/Microsoft.MachineLearningServices/workspaces/quick-start-tutorial/data/credit_cards/versions/3",
          "type": "uri_file"
        }

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
