Skip to content

Latest commit

 

History

History
125 lines (95 loc) · 7.31 KB

NOTES.md

File metadata and controls

125 lines (95 loc) · 7.31 KB

My first ML pipeline in Azure ML

However, processing data into format consumable by the model was a bit more challenging. I had to combine categorical features encoded by one-hot encoding with processed text data. The trick was that I had to split up data into train, validation and test first to avoid data leakage. Then seperately process each of the splits. I but I managed to do it through writing classes and functions that I could reuse. For future projects, I would seperate data processing pipeline and model training pipeline into two seperate pipelines.

Moving this to cloud was a bit harder than expected. I decided to use Azure CLI to seperate my code from azure commands. Defining yaml files turned out to be not that straightforward as with DVC pipeline. I had to do a lot of trial and error to get it right.

I still don't know how to modularize my code so that I can reuse utility functions common to different components. For example, I have a logging function used in all stages. As a temporary solution, each component comes with it's own utility functions. And some fucntions are duplicated across components. What if I want to change the logging function? I would have to change it in all components. I would like to learn how to modularize my code so that I can reuse utility functions across components. In DVC this was easy, because Git tracked all the code files while DVC tracked the data files. Azure does it's versioning differently. It tracks versions of each component, as well as data. And you can see which version was used in your pipeline runs.

Also experiments are not cached in Azure. So if you want to run the same experiment slightly differently, you have to run it from scratch - this is not computationally efficient. Especially when you want to run hundreds of experiments. In DVC, it was one command that could be run for different parameters (grid search).

Another thing worse mentioning is how to define parameters for ML pipeline. In DVC, there is a great integratin with Hydra that can compose one config file from multiple config files. This is very useful and you can be flexible with your parameters. In Azure, you have to define all parameters through arguments using parser.add_argument.

In general, I like how everything is organized in Azure. There are three ways how you can interact with Azure ML: Studio, CLI, SDK.

Confusions:

  • I defined my preprocess stage in pipeline to produce outputs as folders. Expectation was that compute cluster will create folders and save files into folders. Instead, I see files with the names without extensions.

  • Another thing is that mode: rw_mount means that the compute cluster will mount the datastore and save files into it. But don't download the files from compute cluster to datastore. I thought you need to use mode: download.

  • One more confusion is how to define parameters. If I want to change batch_size, I have to change in component definition and in pipeline definition. Is there a way to see all parameters in one place for the whole job? Probably, by redefining them in pipeline it overwrites the component parameters.

outputs:
      batches_train: 
        type: uri_folder
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/train
        mode: rw_mount
  • Where is the model saved? I don't see it in the datastore. Expectation was that it will be saved in the datastore when the mode is upload.
train:

...

  outputs:
        model_dir:
          type: mlflow_model
          mode: upload
  • Running code locally vs running code in the cloud.

  • When running experiments in cloud is there any caching involved? If I run the same experiment with the same parameters, will it run from scratch or will it use the cached results? My test stage was failing and when I resubmitted job it was running only the test stage. It's not clear to me if it was using cached results or not. But it's good. Probably, it's checking if the version of the component is the same as the previous one.

End-to-end ML pipeline in Azure ML and GitHub Actions:

  • Convert notebook to scripts.

  • Work with YAML to define a command or pipeline job.

  • Run scripts as a job with the CLI v2.

  • Create and assign a service principal the permissions needed to run an Azure Machine Learning job.

  • Store Azure credentials securely using secrets in GitHub Secrets.

  • Create a GitHub Action using YAML that uses the stored Azure credentials to run an Azure Machine Learning job.

  • Run linters and unit tests with GitHub Actions.

  • Integrate code checks with pull requests.

  • Set up environments in GitHub.

  • Use environments in GitHub Actions.

  • Add approval gates to assign required reviewers before moving the model to the next environment.

  • Deploy a model to a managed endpoint.

  • Trigger model deployment with GitHub Actions.

  • Test the deployed model.

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: job-salary-prediction
experiment_name: basic-nn-architecture
description: A pipeline job to split-preprocess-train-test 

settings:
    default_compute: azureml:cluster-cpu

jobs:
  split:
    type: command
    component: azureml:component_pipeline_cli_split@latest
    inputs:
      raw_data:
        type: uri_file
        path: azureml:raw_data@latest
        mode: download
    outputs:
      train_data:
        type: uri_file
        mode: rw_mount
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/train.csv
      validation_data:
        type: uri_file
        mode: rw_mount
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/validation.csv
      test_data:
        type: uri_file
        mode: rw_mount
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/test.csv

  preprocess:
    type: command
    component: azureml:component_pipeline_cli_preprocess@latest
    inputs:
      train_data: ${{parent.jobs.split.outputs.train_data}}
      validation_data: ${{parent.jobs.split.outputs.validation_data}}
      test_data: ${{parent.jobs.split.outputs.test_data}}
    outputs:
      batches_train: 
        type: uri_folder
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/train
        mode: rw_mount
      batches_validation:
        type: uri_folder
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/validation
        mode: rw_mount
      batches_test:
        type: uri_folder
        path: azureml://subscriptions/a8c5d49d-e0aa-4576-97cc-fa6b18ce0f6a/resourcegroups/rg001/workspaces/WS001/datastores/workspaceblobstore/paths/LocalUpload/73375df799e563845861e11ed586aa7d/test
        mode: rw_mount