Skip to content

Xgboost model used to predict blood pressure created using a pipeline in Azure Machine Learning using Python SDKv2

Notifications You must be signed in to change notification settings

gatorduck/AzureML_SDK_Pipeline

Repository files navigation

About

This repo illustrates creating a machine learning pipeline in the cloud with the python sdk and used to develop a model to predict a patients blood pressure given several features e.g. insulin,BMI and many more. This is based on the UC Irvine Diabetes dataset found below.

https://archive.ics.uci.edu/dataset/34/diabetes

This scenario solves a regression problem, in this case a continuous numeric value for blood pressure. The main purpose is primarily to guide in the creation of a pipeline using Azure Machine Learning and the Python SDK.

Creating Machine Learning Pipelines with Python SDKv2 in Azure Machine Learning

There are many ways to create pipelines in Azure Machine Learning. This method uses a programmatic approach with the Python SDK (software development kit). This approach is ideal for developing pipelines and experimenting. For production, I recommend the CLI and the AML extension which helps create a production grade pipeline.

flowchart TB
    subgraph pipeline with sdkv2
    c1(data prep component) --> c2(training component)
    end
    

The files are setup to follow a sequence of steps that mimic an ML workflow starting with p01.

1 development

p01_development.ipynb

Assuming role of mlops or ml engineer we have received a notebook that creates a ml model.

2 productionalize

p02_refactor.py

  1. We convert our notebook into a python script (.py) as pipelines jobs require scripts. However note, that I will ultimately create two scripts, one for each component found under the components folder.

  2. To make things easier to manage we create functions from our steps.

  3. This method also introduces the argparse package which is used to read data into the python script or any other input such as a parameter.

Note we have removed reference to our workspace and data filepaths.

2a - Components with Python SDKv2

flowchart LR
    script:::bar --> notebook:::foobar --> component
    classDef bar stroke:#0f0
    classDef foobar stroke:#00f

We create components using two approaches, with Python SDK we define the structure of our component and another example we define a component with a yaml file.

At this level, we can specify compute and environment at the component level.

We start with python sdkv2 or the programmatic approach.

components/p03_data_prep.py

We take our functions from the prep stage in our p02_refactor.py script and make a few adjustments. This will become a separate independent component from our training portion of our pipeline and will be used to scale our features.

This will be the be the staging point for data ingestion into our pipeline, so we define two arguments. One for input which will be the data we ultimately want to use, and output which is the data that has been transformed and ready for training.

Within our main() function we utlimately want to return our prepped data so we adjust our function to return data as a .csv and save it wo our output argument named --prepped_data.

Arguments

--input_data - argument that will refence filepath/filename name as it is defined as uri_file

--prepped_data - argument that will reference filepath as it is defined as uri_folder

We will not provide the filepath just yet to our data and where to store our transformed data. At this point we are creating skeletons for our pipeline.

p03_create_data_prep_component.ipynb

We will create the actual component using a notebook with python sdks command function. We also take an additional step and register this component into our AML workspace.

2b - Components with yaml

flowchart LR
    yaml:::foo & script:::bar --> notebook:::foobar --> component
    classDef foo stroke:#f00
    classDef bar stroke:#0f0
    classDef foobar stroke:#00f

This approach defines our component using a yaml file.

p04_create_register_component.ipynb

This notebook loads the yaml file and creates a component for us (note, script is defined in the yaml file).

p04_training.yaml

yaml file holding component inputs, outputs, script reference, and environment.

p04_training.py

Our actual script that is executed when running the training component.

3 create our pipeline

p05_create_pipeline.ipynb

This last notebook creates our pipeline. Because we didn't specify compute type in our components we specify the compute resource to run this pipeline. We can also define different computes per component, would make sense to use GPU or Spark Cluster for just the training component and CPU compute for data prep.

Benefits

  • we have a pipeline stood up primarily python

Drawbacks

  • reproducibility is important, CLI is best for this

About

Xgboost model used to predict blood pressure created using a pipeline in Azure Machine Learning using Python SDKv2

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published