# Train a diabetes Model with AML pipeline


## Content
### 1. Connect to Workspace

### 2. Upload and register diabetes dataset

### 3. Setup an Azure Ml pipeline

# 1. Connect to your workspace

To get started, connect to your workspace.

>**Note**: If you haven't already established an authenticated session with your Azure subscription, you'll be prompted to authenticate by clicking a link, entering an authentication code, and signing into Azure.

In [57]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Ready to work with isolation_forest


# 2. Upload and register diabetes dataset

In [58]:
from azureml.core import Dataset

# 1.If you want to use another type of datastore you must creat it first
default_ds = ws.get_default_datastore()

# 2.Upload data files to the default datastore that  
default_ds.upload_files(
    files=['./data/diabetes.csv'],
    target_path='diabetes-data/', # Datastore location
    overwrite=True,
    show_progress=True)

Uploading an estimated of 1 files
Uploading ./data/diabetes.csv
Uploaded ./data/diabetes.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_b89e93e0ce524603b0f2223ec8db6233

In [59]:
# 3.Create a tabular dataset from the path on the datastore
print('Creating dataset...')
data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))

# 4.Register the tabular dataset
print('Registering dataset...')

data_set = data_set.register(
    workspace=ws, 
    name='diabetes dataset',
    description='diabetes data',
    tags = {'format':'CSV'},
    create_new_version=True
    )



Creating dataset...
Registering dataset...


# 3. Setup an Azure Ml pipeline

- Create scripts for pipeline steps
- Prepare a compute environment 
- Define Python environment
- Run pipeline as an experiment

## i. Create scripts for pipeline steps
Pipelines consist of one or more steps, which can be Python scripts, or specialized steps like a data transfer step that copies data from one location to another. Each step can run in its own compute context. In this demo, we will build a simple pipeline that contains two Python script steps: one to pre-process some training data, and another to use the pre-processed data to train and register a model.

First, let's create a folder for the script files we'll use in the pipeline steps.

In [60]:
import os
# Create a folder for the pipeline step files
experiment_folder = 'diabetes_pipeline'
os.makedirs(experiment_folder, exist_ok=True)

print(experiment_folder)

diabetes_pipeline


Now let's create the first script, which will read data from the diabetes dataset and apply some simple pre-processing to remove any rows with missing data and normalize the numeric features so they're on a similar scale.

The script includes an argument named --prepped-data, which references the folder where the resulting data should be saved.

In [61]:
%%writefile $experiment_folder/prep_diabetes.py 
# Import libraries
import os
import argparse
import pandas as pd
from azureml.core import Run
from sklearn.preprocessing import MinMaxScaler

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str, dest='raw_dataset_id', help='raw dataset')
parser.add_argument('--prepped-data', type=str, dest='prepped_data', default='prepped_data', help='Folder for results')
args = parser.parse_args()
save_folder = args.prepped_data

# Get the experiment run context
run = Run.get_context()

# load the data (passed as an input dataset)
print("Loading Data...")
diabetes = run.input_datasets['raw_data'].to_pandas_dataframe()

# Log raw row count
row_count = (len(diabetes))
run.log('raw_rows', row_count)

# remove nulls
diabetes = diabetes.dropna()

# Normalize the numeric columns
scaler = MinMaxScaler()
num_cols = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree']
diabetes[num_cols] = scaler.fit_transform(diabetes[num_cols])

# Log processed rows
row_count = (len(diabetes))
run.log('processed_rows', row_count)

# Save the prepped data
print("Saving Data...")
os.makedirs(save_folder, exist_ok=True)
save_path = os.path.join(save_folder,'data.csv')
diabetes.to_csv(save_path, index=False, header=True)

# End the run
run.complete()

Overwriting diabetes_pipeline/prep_diabetes.py


Now you can create the script for the second step, which will train a model. The script includes a argument named --training-folder, which references the folder where the prepared data was saved by the previous step.

In [62]:
%%writefile $experiment_folder/train_diabetes.py
# Import libraries
from azureml.core import Run, Model
import argparse
import pandas as pd
import numpy as np
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument("--training-folder", type=str, dest='training_folder', help='training data folder')
args = parser.parse_args()
training_folder = args.training_folder

# Get the experiment run context
run = Run.get_context()

# load the prepared data file in the training folder
print("Loading Data...")
file_path = os.path.join(training_folder,'data.csv')
diabetes = pd.read_csv(file_path)

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model...')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

# Save the trained model in the outputs folder
print("Saving model...")
os.makedirs('outputs', exist_ok=True)
model_file = os.path.join('outputs', 'diabetes_model.pkl')
joblib.dump(value=model, filename=model_file)

# Register the model
print('Registering model...')
Model.register(workspace=run.experiment.workspace,
               model_path = model_file,
               model_name = 'diabetes_model',
               tags={'Training context':'Pipeline'},
               properties={'AUC': np.float(auc), 'Accuracy': np.float(acc)})


run.complete()

Overwriting diabetes_pipeline/train_diabetes.py


## b. Prepare a compute environment for the pipeline steps
In this demo, you'll use the same compute for both steps, but it's important to realize that each step is run independently; so you could specify different compute contexts/targets for each step if appropriate.

A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning service. Selection depends on the workload

In [63]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "regression"

try:
    # Check for existing compute target
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it. The configutation is dependent on the workload.
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='STANDARD_DS11_V2',
        max_nodes=2,
        vm_priority='dedicated'
        )
    pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
    pipeline_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.


## c. Define Python environment

The compute will require a Python environment with the necessary package dependencies installed, so you'll need to create a run configuration.

In [66]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import DockerConfiguration

# Create a Python environment for the experiment
diabetes_env = Environment("diabetes-pipeline-env")
#diabetes_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies

# Create a set of package dependencies with versions 
diabetes_packages = CondaDependencies.create(
    conda_packages=['pandas 0.21'],
    pip_packages=['azureml-defaults','azureml-dataprep[pandas]']
    )

# Add the dependencies to the environment
diabetes_env.python.conda_dependencies = diabetes_packages

# Register the environment 
diabetes_env.register(workspace=ws)
registered_env = Environment.get(ws, 'diabetes-pipeline-env')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above. 
pipeline_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pipeline_run_config.environment = registered_env

print ("Run configuration created.")

Run configuration created.


## d. Define pipeline steps and submit experiment

Create and run a pipeline

Now you're ready to create and run a pipeline.

First you need to define the steps for the pipeline, and any data references that need to passed between them. In this case, the first step must write the prepared data to a folder that can be read from by the second step. Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace. The PipelineData object is a special kind of data reference that is used for interim storage locations that can be passed between pipeline steps, so you'll create one and use at as the output for the first step and the input for the second step. Note that you also need to pass it as a script argument so our code can access the datastore location referenced by the data reference.

In [67]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")

# Create a PipelineData (Data Reference) for the model folder
prepped_data_folder = PipelineData("prepped_data_folder", datastore=ws.get_default_datastore())

# Step 1, Run the data prep script
prep_step = PythonScriptStep(
    name = "Prepare Data",
    source_directory = experiment_folder,
    script_name = "prep_diabetes.py",
    arguments = ['--input-data', diabetes_ds.as_named_input('raw_data'),
                    '--prepped-data', prepped_data_folder],
    outputs=[prepped_data_folder],
    compute_target = pipeline_cluster,
    runconfig = pipeline_run_config,
    allow_reuse = True
    )

# Step 2, run the training script
train_step = PythonScriptStep(
    name = "Train and Register Model",
    source_directory = experiment_folder,
    script_name = "train_diabetes.py",
    arguments = ['--training-folder', prepped_data_folder],
    inputs=[prepped_data_folder],
    compute_target = pipeline_cluster,
    runconfig = pipeline_run_config,
    allow_reuse = True
    )

print("Pipeline steps defined")

# Construct the pipeline from the steps you've defined and run it as an experiment.
pipeline = Pipeline(workspace=ws, steps=[prep_step, train_step])
print("Pipeline is built.")

# Create an experiment and run the pipeline
experiment = Experiment(workspace=ws, name='diabetes_pipeline')

pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print("Pipeline submitted for execution.")
#RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion(show_output=True)

kB)
Collecting isodate>=0.6.0
  Downloading isodate-0.6.0-py2.py3-none-any.whl (45 kB)
Collecting ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.10"
  Downloading ruamel.yaml.clib-0.2.2-cp36-cp36m-manylinux1_x86_64.whl (549 kB)
Collecting importlib-metadata; python_version < "3.8"
  Downloading importlib_metadata-4.4.0-py3-none-any.whl (17 kB)
Collecting pyasn1>=0.1.1
  Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-2.0.1-cp36-cp36m-manylinux2010_x86_64.whl (30 kB)
Collecting portalocker~=1.0; platform_system != "Windows"
  Downloading portalocker-1.7.1-py2.py3-none-any.whl (10 kB)
Collecting pycparser
  Downloading pycparser-2.20-py2.py3-none-any.whl (112 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.1.1-py2.py3-none-any.whl (146 kB)
Collecting zipp>=0.5
  Downloading zipp-3.4.1-py3-none-any.whl (5.2 kB)
Collecting typing-extensions>=3.6.4; python_version < "3.8"
  Downlo

'Canceled'

In [32]:
for run in pipeline_run.get_children():
    print(run.name, ':')
    metrics = run.get_metrics()
    for metric_name in metrics:
        print('\t',metric_name, ":", metrics[metric_name])

Train and Register Model :
	 Accuracy : 0.8893333333333333
	 AUC : 0.8773322055823672
	 ROC : aml://artifactId/ExperimentRun/dcid.e6d33ad5-8df9-4a91-b49e-ef254c708b17/ROC_1620128462.png
Prepare Data :
	 raw_rows : 10000
	 processed_rows : 10000


Assuming the pipeline was successful, a new model should be registered with a Training context tag indicating it was trained in a pipeline. Run the following code to verify this.



In [33]:
from azureml.core import Model

for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 11
	 Training context : Pipeline
	 AUC : 0.8773322055823672
	 Accuracy : 0.8893333333333333


diabetes_model version: 10
	 Training context : Pipeline
	 AUC : 0.8744141499577093
	 Accuracy : 0.8896666666666667


diabetes_model version: 9
	 Training context : Pipeline
	 AUC : 0.8761281655803771
	 Accuracy : 0.89


diabetes_model version: 8
	 Training context : Pipeline
	 AUC : 0.8761057764067863
	 Accuracy : 0.889


diabetes_model version: 7
	 Training context : Pipeline
	 AUC : 0.876615752027464
	 Accuracy : 0.89


diabetes_model version: 6
	 Training context : Pipeline
	 AUC : 0.8728842230956764
	 Accuracy : 0.8866666666666667


diabetes_model version: 5
	 Training context : Pipeline
	 AUC : 0.8770809493009601
	 Accuracy : 0.889


diabetes_model version: 4
	 Training context : Pipeline
	 AUC : 0.8783148415344046
	 Accuracy : 0.8896666666666667


diabetes_model version: 3
	 Training context : Inline Training
	 AUC : 0.8785884869894024
	 Accuracy : 0.891


diabet