# Automated ML

In [1]:
# import all the dependencies
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core import Workspace, Experiment, Dataset
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails
from azureml.train.automl.utilities import get_primary_metrics
from azureml.core.webservice import AciWebservice, LocalWebservice
from azureml.core import Environment
from azureml.core.model import InferenceConfig
from azureml.core.model import Model
import requests
import json

ws = Workspace.from_config()

In [2]:
# Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.
# source: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute(class)?view=azure-ml-py

cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview

This project uses the data from a DrivenData competition - [Pump it Up: Data Mining the Water Table](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

The training data is devided into two files, with the target variable (labels) and the other variables (values). The target variable describe the functioning status of each pump (*functional*, *functional need repair* and *non functional*). Descriptive variables inlude waterpoint location, its founder, water quality and quantity, waterpoint type, etc.

As one need to be logged in to DrivenData in order to access the data, it cannot be downloaded via direct links and was stored as .csv files in the *data* folder. The original data stored to the Azure datastore, merged into a single data set and registered as a dataset.

In [3]:
# #local paths to train data
# path_labels = "data/train_labels.csv"
# path_values = "data/train_values.csv"

# # get the datastore to upload prepared data
# datastore = ws.get_default_datastore()

# # upload the local file from src_dir to the target_path in datastore
# datastore.upload(src_dir='data', target_path='data', overwrite=True)

# # create datasets referencing the cloud location
# ds_labels = Dataset.Tabular.from_delimited_files(path = [(datastore, (path_labels))])
# ds_values = Dataset.Tabular.from_delimited_files(path = [(datastore, (path_values))])

In [4]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
found = False
key = "Winery Dataset"
description_text = "Wine Dataset for Udacity Nanodegree"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 
        print("Registered dataset found in the workspace.")

if not found:
        # Register AML Dataset in Workspace
        dataset_url = "https://raw.githubusercontent.com/alihussainia/Azure3/main/wine.csv"
        ds = TabularDatasetFactory.from_delimited_files(path = dataset_url)
        dataset = ds.register(workspace=ws,
                              name=key,
                              description=description_text)
        print("Dataset registered in workspace.")

Registered dataset found in the workspace.


In [5]:
df = dataset.to_pandas_dataframe()
df.head()

Unnamed: 0,name,alcohol,malicAcid,ash,ashalcalinity,magnesium,totalPhenols,flavanoids,nonFlavanoidPhenols,proanthocyanins,colorIntensity,hue,od280_od315,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
2,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
3,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
4,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 14 columns):
name                   177 non-null int64
alcohol                177 non-null float64
malicAcid              177 non-null float64
ash                    177 non-null float64
ashalcalinity          177 non-null float64
magnesium              177 non-null int64
totalPhenols           177 non-null float64
flavanoids             177 non-null float64
nonFlavanoidPhenols    177 non-null float64
proanthocyanins        177 non-null float64
colorIntensity         177 non-null float64
hue                    177 non-null float64
od280_od315            177 non-null float64
proline                177 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB


Basic exploratory data analysis (EDA) was completed by profiling the data in Azure. As a result, some variables are excluded. Deeper EDA and consequent data wrangling are highly recommended, but omitted for now, as the goal of the project is different.

In [10]:
# create experiment
experiment_name = 'wine'
experiment = Experiment(ws, experiment_name)

## AutoML Configuration

The problem at hand is a multiclass classification. The fact of unbalanced dataset suggest against the often used accuracy metric. Among the available performance metrics in AutoML classification (see below), the weighted AUC was chosen.

Choice of the cloud compute target allows to profit from higher compute capabilities. Enabling early stopping saves computation time for prospectless children runs.

In [11]:
# establish a list of available metrics
get_primary_metrics('classification')

['average_precision_score_weighted',
 'norm_macro_recall',
 'AUC_weighted',
 'accuracy',
 'precision_score_weighted']

In [15]:
# source: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-auto-train-models
# automl settings 
automl_settings = {
       "n_cross_validations": 3,
       "primary_metric": 'AUC_weighted',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": -1,
       "verbosity": logging.INFO,
   }

# automl config 
automl_config = AutoMLConfig(task = 'classification',
                               compute_target = cpu_cluster,
                               training_data = dataset,
                               label_column_name = "name",
                               **automl_settings
                            )

In [16]:
# Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

In [17]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

This section demonstrates the best performing model, downloads and registers it.

In [21]:
# Retrieve the best automl model
# source: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train
best_automl_run, automl_model = remote_run.get_output()
automl_model_name = best_automl_run.properties['model_name']
print('Best AutoML model name: ' + automl_model_name,
      'Best AutoML model run: ' + str(best_automl_run),
      'Best AutoML model specification: ' + str(automl_model), sep = '\n\n')

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


Best AutoML model name: AutoMLa5282e99622

Best AutoML model run: Run(Experiment: wine,
Id: AutoML_a5282e99-6962-4b8a-9682-aa3257bcac67_22,
Type: azureml.scriptrun,
Status: Completed)

Best AutoML model specification: Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('MaxAbsScaler', MaxAbsScaler(copy...
                 ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0,
                                      class_weight='balanced', criterion='gini',
                       

In [22]:
# Register and save the best model
automl_model_registered = remote_run.register_model(model_name='automl_model')

automl_model_registered.download(target_dir="outputs_automl", exist_ok=True)

'outputs_automl/model.pkl'

## Model Deployment

The AutoML model showed better performance and therefore is it deployed as a web service. 

In [33]:
# source: https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-deploy-models-with-aml?view=azure-ml-py
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=4, 
                                               enable_app_insights=True,
                                               description='Predict Wine Name')

model = Model(ws, 'automl_model')

env_deploy = Environment.get(workspace=ws, name='AzureML-AutoML')

inference_config = InferenceConfig(entry_script="score_AUTO.py", environment=env_deploy)

service = Model.deploy(workspace=ws, 
                       name='automl', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

In [34]:
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [44]:
print(service.scoring_uri)

http://1344b204-d393-48e9-a948-28f3d28b8a30.southcentralus.azurecontainer.io/score


The deployed endpoint is tested by sending input data to it.

In [49]:
# scoring endpoint
scoring_uri = service.scoring_uri

data = {"data":
        [
            {
               
                "alcohol": 14.23,
                "malicAcid": 1.71,
                "ash":2.43,
                "ashalcalinity": 15.6,
                "magnesium": 127,
                "totalPhenols": 2.80,
                "flavanoids": 3.06,
                "nonFlavanoidPhenols": 0.28,
                "proanthocyanins": 2.29,
                "colorIntensity":5.64,
                "hue":1.04,
                "od280_od315":3.92,
                "proline":1065


            },
            {
               
                "alcohol": 13.16,
                "malicAcid": 2.36,
                "ash":2.67,
                "ashalcalinity": 18.6,
                "magnesium": 101,
                "totalPhenols": 2.80,
                "flavanoids": 3.24,
                "nonFlavanoidPhenols": 0.30,
                "proanthocyanins": 2.81,
                "colorIntensity":5.68,
                "hue":1.03,
                "od280_od315":3.17,
                "proline":1185
            }
        ]
    }
# Convert to JSON string
input_data = json.dumps(data)

# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

[1, 1]


Access the logs of the web service and clean up resources (web service and compute cluster).

In [50]:
print(service.get_logs())

2021-02-05T01:28:05,745807000+00:00 - gunicorn/run 
2021-02-05T01:28:05,757940400+00:00 - iot-server/run 
2021-02-05T01:28:05,792456900+00:00 - rsyslog/run 
2021-02-05T01:28:05,793450700+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7ade26eb614f97df8030bc480da59236/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [52]:
service.update(enable_app_insights=True)

In [None]:

onx = onnxmltools.convert.convert_xgboost(xgb, initial_types=initial_type)

In [None]:
# delete service
service.delete()

In [None]:
# delete compute cluster
cpu_cluster.delete()