In [1]:
# checking Python SDKv2
!pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.9.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.8/site-packages
Requires: strictyaml, azure-storage-file-share, azure-storage-blob, azure-core, opencensus-ext-azure, pydash, tqdm, jsonschema, azure-storage-file-datalake, pyyaml, typing-extensions, msrest, pyjwt, marshmallow, colorama, azure-common, azure-mgmt-core, isodate
Required-by: 


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [2]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [45]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


# Train a model and dump it for scoring later
The following part trains a simple model and dump it in a directory as a pickle file in the `model` folder. 

In [None]:
# Open terminal and move to the src folder and run to save the model
!python train-model-parameters.py --training_data diabetes.csv

## Register the custom model

Batch deployments can only deploy models registered in the workspace. You'll register an custom model, which is stored in the local `model` folder.

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-model-custom-output?view=azureml-api-2&tabs=python

In [4]:
from azure.ai.ml import MLClient, Input, load_component
from azure.ai.ml.entities import BatchEndpoint, ModelBatchDeployment, ModelBatchDeploymentSettings, PipelineComponentBatchDeployment, Model, AmlCompute, Data, BatchRetrySettings, CodeConfiguration, Environment, Data
from azure.ai.ml.constants import AssetTypes, BatchDeploymentOutputAction
from azure.ai.ml.dsl import pipeline
from azure.identity import DefaultAzureCredential

In [15]:
# Registering the model
# Batch Endpoint can only deploy registered models. In this case, we already have a local copy of the model in the repository, 
# so we only need to publish the model to the registry in the workspace. You can skip this step if the model you are trying to deploy is already registered.


model_name = "model"
model_description = "A linear classifier."
model_local_path = "src/model/"

model = ml_client.models.create_or_update(
    Model(
        name=model_name,
        path=model_local_path,
        type=AssetTypes.CUSTOM_MODEL,
        tags={"framework": "scikit-learn", "estimator": "LogisticRegression"},
    )
)

## Creating a scoring script

We need to create a scoring script that can read the input data provided by the batch deployment and return the scores of the model. We are also going to write directly to the output folder of the job. In summary, the proposed scoring script does as follows:

- Reads the input data as CSV files.
- Runs an MLflow model predict function over the input data.
- Appends the predictions to a pandas.DataFrame along with the input data.
- Writes the data in a file named as the input file, but in parquet format.

### Remarks:

- Notice how the environment variable AZUREML_BI_OUTPUT_PATH is used to get access to the output path of the deployment job.
    - In Azure Machine Learning Studio, the environment variable AZUREML_BI_OUTPUT_PATH is used in the context of batch inferencing. It points to the location where the output results of the batch inferencing should be written.
- The init() function is populating a global variable called output_path that can be used later to know where to write.
- The run method returns a list of the processed files. It is required for the run function to return a list or a pandas.DataFrame object.

In [12]:
%%writefile src/model/score.py

import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List


def init():
    global model
    global output_path

    # AZUREML_MODEL_DIR is an environment variable created during deployment
    # It is the path to the model folder
    # Please provide your model's folder name if there's one:
    output_path = os.environ["AZUREML_BI_OUTPUT_PATH"]
    model_path = os.environ["AZUREML_MODEL_DIR"]
    model_file = glob.glob(f"{model_path}/*/*.pkl")[-1]

    with open(model_file, "rb") as file:
        model = pickle.load(file)


def run(mini_batch: List[str]):
    for file_path in mini_batch:
        data = pd.read_csv(file_path)
        X = data[['Pregnancies',
            'PlasmaGlucose',
            'DiastolicBloodPressure',
            'TricepsThickness',
            'SerumInsulin',
            'BMI',
            'DiabetesPedigree',
            'Age']].values

        pred = model.predict(X)


        data["prediction"] = pred

        output_file_name = Path(file_path).stem + '_pred'
        output_file_path = os.path.join(output_path, output_file_name + ".parquet")
        data.to_parquet(output_file_path)

    return data

Overwriting src/model/score.py


In [7]:
### TEST the scoring script to check whether it works


import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List


def init():
    global model
    global output_path

    # Replace with local paths for debugging
    output_path = "/home/azureuser/cloudfiles/code/my-azure-ml-projects/model-deployment-batch/data/"
    model_path = "/home/azureuser/cloudfiles/code/my-azure-ml-projects/model-deployment-batch/src/model/"
    model_file = glob.glob(f"{model_path}/*.pkl")[-1]

    with open(model_file, "rb") as file:
        model = pickle.load(file)


def run(mini_batch: List[str]):
    print(mini_batch)
    for file_path in mini_batch:
        data = pd.read_csv(file_path)
        X = data[['Pregnancies',
                'PlasmaGlucose',
                'DiastolicBloodPressure',
                'TricepsThickness',
                'SerumInsulin',
                'BMI',
                'DiabetesPedigree',
                'Age']].values

        pred = model.predict(X)

        data["prediction"] = pred
        print(pred)

        output_file_name = Path(file_path).stem + '_pred'
        output_file_path = os.path.join(output_path, output_file_name + ".parquet")
        data.to_parquet(output_file_path)

    return data


# Add this code to run the script locally
if __name__ == "__main__":
    init()   
    test_files = glob.glob("/home/azureuser/cloudfiles/code/my-azure-ml-projects/model-deployment-batch/data/*.csv")
    print(test_files)
    run(test_files)


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


['/home/azureuser/cloudfiles/code/my-azure-ml-projects/model-deployment-batch/data/diabetes.csv']
['/home/azureuser/cloudfiles/code/my-azure-ml-projects/model-deployment-batch/data/diabetes.csv']
[0 0 0 ... 1 0 1]


## Create a Custom Environement

** Note that, __for custom model__, it seesm that only custom environment needs to be deployed and the existing one can not be used (personal experience). ***

Your deployment requires an execution environment in which to run the scoring script. Any dependency your code requires should be included in the environment.

You can create an environment with a Docker image with Conda dependencies, or with a Dockerfile.

You'll also need to add the library azureml-core as it is required for batch deployments to work.

To create an environment using a base Docker image, you can define the Conda dependencies in a conda.yaml file:

In [8]:
%%writefile 'src/model/conda-env.yml'

name: basic-env-cpu
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas
  - scikit-learn
  - numpy
  - matplotlib
  - pip:
      - azureml-core
      - mlflow

Overwriting src/model/conda-env.yml


In [24]:
from azure.ai.ml.entities import Environment

env = Environment(
    image="mcr.microsoft.com/azureml/minimal-ubuntu18.04-py37-cpu-inference:latest",   # Double check this as it might fail the creation of image!!!!!
    conda_file="./src/model/conda-env.yml",
    name="FarbodTaymouri-env-batch",
    description="Environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(env)

Environment({'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'FarbodTaymouri-env-batch', 'description': 'Environment created from a Docker image plus Conda environment.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2a21ade8-9d70-4d5a-a619-083b264d1d56/resourceGroups/mlcertificate1/providers/Microsoft.MachineLearningServices/workspaces/ft_ml2/environments/FarbodTaymouri-env-batch/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/farbodtaymouri2/code/my-azure-ml-projects/model-deployment-batch', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f3da45492d0>, 'serialize': <msrest.serialization.Serializer object at 0x7f3da454b2e0>, 'version': '1', 'latest_version': None, 'conda_file': {'channels': ['conda-forge'], 'dependencies': ['python=3.8', 'pandas', 'scikit-learn', 'numpy', 'matplotlib', {'pip': 

## Check the custom envrionment to see it works
https://learn.microsoft.com/en-us/azure/machine-learning/migrate-to-v2-local-runs?view=azureml-api-2


In [63]:
%%writefile src/model/sample_script.py

# Sample script to run in the custom environment

import sys
import os
import pickle
import glob
import pandas as pd
from pathlib import Path
from typing import List

print(sys.version)


Overwriting src/model/sample_script.py


In [61]:
#import required libraries
from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential

# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)


# define the command
command_job = command(
    code='./src/model',
    command='python sample_script.py',
    # inputs={
    # "diabetes_data": Input(
    #     type=AssetTypes.URI_FILE, 
    #     path="azureml:diabetes-local:1"
    #     )
    # },
    environment='FarbodTaymouri-env-batch@latest',
    compute='farbodCluster',   # Local mode doesn't work
)

returned_job = ml_client.jobs.create_or_update(command_job)
returned_job

Found the config file in: /config.json


Experiment,Name,Type,Status,Details Page
model-deployment-batch,frosty_ticket_qkxnp151lv,command,Starting,Link to Azure Machine Learning studio


## Create a batch endpoint

A batch endpoint is an HTTPS endpoint that applications can call to trigger a batch scoring job. A batch endpoint name needs to be unique within an Azure region. You'll use the `datetime` function to generate a unique name based on the current date and time. 

In [9]:
import datetime

endpoint_name = "batch-" + datetime.datetime.now().strftime("%m%d%H%M%f")
endpoint_name

'batch-09071159919380'

In [10]:
from azure.ai.ml.entities import BatchEndpoint

# create a batch endpoint
endpoint = BatchEndpoint(
    name=endpoint_name,
    description="A batch endpoint for classifying diabetes in patients",
)

ml_client.batch_endpoints.begin_create_or_update(endpoint)

<azure.core.polling._poller.LROPoller at 0x7f8f062d31c0>

<p style="color:red;font-size:120%;background-color:yellow;font-weight:bold"> IMPORTANT! Wait until the endpoint is created before continuing! A green notification should appear in the studio. </p>

## Create the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. We will create a deployment for our endpoint using the `BatchDeployment` class. 

Since you're deploying an MLflow model, you don't need a scoring script or define the environment. Azure Machine Learning will automatically create those assets for you. The `MLmodel` file in the `model` folder is used to understand what the expected inputs and outputs are of the model.

You'll deploy a model with the following parameters:

- `name`: Name of the deployment.
- `description`: Optional description to further clarify what the deployment represents.
- `endpoint_name`: Name of the previously created endpoint the model should be deployed to.
- `model`: Name of the registered model.
- `compute`: Compute to be used when invoking the deployed model to generate predictions.
- `instance_count`: Count of compute nodes to use for generating predictions.
- `max_concurrency_per_instance`: Maximum number of parallel scoring script runs per compute node.
- `mini_batch_size`: Number of files passed per scoring script run.
-- How does parallelization work?:
    Batch deployments distribute work at the file level, which means that a folder containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files each. Notice that this will happen regardless of the size of the files involved. If your files are too big to be processed in large mini-batches we suggest to either split the files in smaller files to achieve a higher level of parallelism or to decrease the number of files per mini-batch. At this moment, batch deployment can't account for skews in the file's size distribution.
- `output_action`: Each new prediction will be appended as a new row to the output file.
- `output_file_name`: File to which predictions will be appended.
- `retry_settings`: Settings for a mini-batch fails.
- `logging_level`: The log verbosity level. Allowed values are `warning`, `info`, and `debug`. 

Running the following cell will configure and create the deployment.

In [11]:
# See whether the enviornment is avaialble
envs = ml_client.environments.list()
for my_env in envs:
    print(my_env.name)

FarbodTaymouri-env-batch
batch-inference-ncd-env
DefaultNcdEnv-openmpi4-1-0-ubuntu20-04
farbod-deployment-environment
deployment-environment
DefaultNcdEnv-mlflow-ubuntu20-04-py38-cpu-inference
AzureML-AI-Studio-Development
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu
AzureML-ACPT-pytorch-1.12-py38-cuda11.6-gpu
AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.5-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu
AzureML-responsibleai-0.21-ubuntu20.04-py38-cpu
AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu
AzureML-tensorflow-2.5-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.6-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu
AzureML-sklearn-1.0-ubuntu20.04-py38-cpu
AzureML-pytorch-1.10-ubuntu18.04-py38-cuda11-gpu
AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu
AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu
AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu
AzureML-pytorch-1.7-ubuntu18.04-py37-c

In [54]:
deployment = ModelBatchDeployment(
    name="classifier-lm-custom",
    description="a linear model classifier for predicting diabetes",
    endpoint_name=endpoint.name,
    model=model,
    # environment= env,
    environment = 'FarbodTaymouri-env-batch@latest',
    code_configuration=CodeConfiguration(
        code="./src/model",
        scoring_script="score.py",
    ),
    compute='farbodCluster',
    settings=ModelBatchDeploymentSettings(
        mini_batch_size=1,
        instance_count=2,
        max_concurrency_per_instance=2,
        output_action=BatchDeploymentOutputAction.APPEND_ROW,
        output_file_name="predictions.csv",
        retry_settings=BatchRetrySettings(max_retries=1, timeout=100),
        logging_level="info",
    ),
)

In [55]:
ml_client.batch_deployments.begin_create_or_update(deployment)

<azure.core.polling._poller.LROPoller at 0x7f8e697a94b0>

<p style="color:red;font-size:120%;background-color:yellow;font-weight:bold"> IMPORTANT! Wait until the default deployment is set before continuing! A green notification should appear in the studio. </p>

## Prepare the data for batch predictions

In the `data` folder you'll find CSV files with unlabeled data. You'll create a data asset that points to the files in the `data` folder, which you'll use as input for the batch job.

In [20]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

data_path = "./unlabled-data"
dataset_name = "patient-data-unlabeled"

patient_dataset_unlabeled = Data(
    path=data_path,
    type=AssetTypes.URI_FOLDER,
    description="An unlabeled dataset for diabetes classification",
    name=dataset_name,
)
ml_client.data.create_or_update(patient_dataset_unlabeled)

[32mUploading unlabled-data (0.02 MBs): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17119/17119 [00:00<00:00, 85250.06i

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'patient-data-unlabeled', 'description': 'An unlabeled dataset for diabetes classification', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2a21ade8-9d70-4d5a-a619-083b264d1d56/resourceGroups/mlcertificate1/providers/Microsoft.MachineLearningServices/workspaces/ft_ml2/data/patient-data-unlabeled/versions/7', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/farbodtaymouri2/code/my-azure-ml-projects/model-deployment-batch', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f8ecc72b2e0>, 'serialize': <msrest.serialization.Serializer object at 0x7f8e832a3940>, 'version': '7', 'latest_version': None, 'path': 'azureml://subscriptions/2a21ade8-9d70-4d5a-a619-083b264d1d56/resourcegroups/mlcertifica

In [70]:

my_path = './data/diabetes.csv'

patient_dataset_unlabeleda = Data(
    path=my_path,
    type=AssetTypes.URI_FILE,
    description="Data asset pointing to a local file, automatically uploaded to the default datastore",
    name="diabetes-local"
)

ml_client.data.create_or_update(patient_dataset_unlabeled)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'patient-data-unlabeled', 'description': 'An unlabeled dataset for diabetes classification', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2a21ade8-9d70-4d5a-a619-083b264d1d56/resourceGroups/mlcertificate1/providers/Microsoft.MachineLearningServices/workspaces/ft_ml2/data/patient-data-unlabeled/versions/7', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/farbodtaymouri2/code/my-azure-ml-projects/model-deployment-batch', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f8e6528fa00>, 'serialize': <msrest.serialization.Serializer object at 0x7f8e650ecf40>, 'version': '7', 'latest_version': None, 'path': 'azureml://subscriptions/2a21ade8-9d70-4d5a-a619-083b264d1d56/resourcegroups/mlcertifica

In [71]:
# Reading the dataset
patient_dataset_unlabeled = ml_client.data.get(
    name="patient-data-unlabeled", label="latest"
)

# # Read the data from data asset and print it
# import pandas as pd
# pd.read_csv(patient_dataset_unlabeled.path)


## Submit the job

Now that you have deployed a model to a batch endpoint, and have an unlabeled data asset, you're ready to invoke the endpoint to generate predictions on the unlabeled data.

First, you'll define the input by referring to the registered data asset. Then, you'll invoke the endpoint, which will submit a pipeline job. You can use the job URL to monitor it in the Studio. The job will contain a child job that represents the running of the (generated) scoring script to get the predictions.

In [72]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

input = Input(type=AssetTypes.URI_FOLDER, path=patient_dataset_unlabeled.id)

In [74]:
job = ml_client.batch_endpoints.invoke(
    endpoint_name=endpoint.name, 
    deployment_name=deployment.name,
    input=input)

ml_client.jobs.get(job.name)

Experiment,Name,Type,Status,Details Page
batch-09071159919380,batchjob-6462b5aa-cb56-4283-ab2d-f4725c08fe4d,pipeline,Preparing,Link to Azure Machine Learning studio


## Get the results

When the pipeline job that invokes the batch endpoint is completed, you can view the results. All predictions are collected in the `predictions.csv` file that is stored in the default datastore. You can download the file and visualize the data by running the following cells. 

The job generates a named output called `score` where all the generated files are placed. Since we wrote into the directory directly, one file per each input file, then we can expect to have the same number of files. In this particular example we decided to name the output files the same as the inputs, but they will have a parquet extension.

In [None]:
ml_client.jobs.download(name=job.name, download_path=".", output_name="score")